linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
@ 2004-12-10 17:49 Mark_H_Johnson
  2004-12-10 21:09 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Mark_H_Johnson @ 2004-12-10 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Fernando Pablo Lopez-Lezcano, Lee Revell,
	Rui Nuno Capela, Shane Shrybman, Esben Nielsen, Thomas Gleixner,
	Michal Schmidt

>The -32-15 kernel can be downloaded from the
>usual place:
>
> http://redhat.com/~mingo/realtime-preempt/

By the time I started today, the directory had -18 so that is what I built
with. Here are some initial results from running cpu_delay (the simple
RT test / user tracing) on a -18 PREEMPT_RT kernel. Have not started
any of the stress tests yet.

To recap, all IRQ # tasks, ksoftirqd/# and events/# tasks are RT FIFO
priority 99. The test program runs at RT FIFO 30 and should use about
70% of the CPU time on the two CPU SMP system under test.

[1] I still do not get traces where cpu_delay switches CPU's. I only
get trace output if it starts and ends on a single CPU. I also had
several cases where I "triggered" a trace but no output - I assume
those are related symptoms. For example:

# ./cpu_delay 0.000100
Delay limit set to 0.00010000 seconds
calibrating loop ....
time diff= 0.504598 or 396354830 loops/sec.
Trace activated with 0.000100 second delay.
Trace triggered with 0.000102 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000164 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000132 second delay. [00]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000128 second delay. [01]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000144 second delay. [not recorded]
Trace triggered with 0.000355 second delay. [02]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000101 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000126 second delay. [not recorded]
Trace triggered with 0.000205 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000147 second delay. [03]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000135 second delay. [04]
Trace triggered with 0.000110 second delay. [not recorded]
Trace triggered with 0.000247 second delay. [05]
Trace triggered with 0.000120 second delay. [06]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000107 second delay. [07]
Trace triggered with 0.000104 second delay. [not recorded]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000100 second delay. [08]
Trace activated with 0.000100 second delay.
Trace triggered with 0.000201 second delay. [09]

# chrt -f 1 ./get_ltrace.sh 50
Current Maximum is 4965280, limit will be 50.
Resetting max latency from 4965280 to 50.
No new latency samples at Fri Dec 10 11:01:22 CST 2004.
No new latency samples at Fri Dec 10 11:01:32 CST 2004.
No new latency samples at Fri Dec 10 11:01:42 CST 2004.
No new latency samples at Fri Dec 10 11:01:53 CST 2004.
No new latency samples at Fri Dec 10 11:02:03 CST 2004.
No new latency samples at Fri Dec 10 11:02:13 CST 2004.
New trace 0 w/ 117 usec latency.
Resetting max latency from 117 to 50.
No new latency samples at Fri Dec 10 11:02:35 CST 2004.
No new latency samples at Fri Dec 10 11:02:45 CST 2004.
New trace 1 w/ 99 usec latency.
Resetting max latency from 99 to 50.
New trace 2 w/ 248 usec latency.
Resetting max latency from 248 to 50.
New trace 3 w/ 146 usec latency.
Resetting max latency from 146 to 50.
New trace 4 w/ 134 usec latency.
Resetting max latency from 134 to 50.
New trace 5 w/ 250 usec latency.
Resetting max latency from 250 to 50.
New trace 6 w/ 120 usec latency.
Resetting max latency from 120 to 50.
New trace 7 w/ 105 usec latency.
Resetting max latency from 105 to 50.
New trace 8 w/ 91 usec latency.
Resetting max latency from 91 to 50.
New trace 9 w/ 200 usec latency.

For the most part, the elapsed times in the traces match closely to what
was measured by the application.

[2] The all CPU traces appear to show some cases where both ksoftirqd
tasks are active during a preemption of the RT task. That is expected
with my priority settings. [though the first trace shows BOTH getting
activated within 2 usec!]

[3] Some traces show information on both CPU's and then a long period
with no traces from the other. Here is an example...

preemption latency trace v1.1.4 on 2.6.10-rc2-mm3-V0.7.32-18RT
--------------------------------------------------------------------
 latency: 99 us, #275/275, CPU#0 | (M:rt VP:0, KP:1, SP:1 HP:1 #P:2)
    -----------------
    | task: cpu_delay-3556 (uid:0 nice:0 policy:1 rt_prio:30)
    -----------------
 => started at: <00000000>
 => ended at:   <00000000>

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
<unknown-2847  1d...    0µs : smp_apic_timer_interrupt (86a89bf 0 0)
cpu_dela-3556  0d.h1    0µs : update_process_times
(smp_apic_timer_interrupt)
cpu_dela-3556  0d.h1    0µs : account_system_time (update_process_times)
cpu_dela-3556  0d.h1    0µs : check_rlimit (account_system_time)
<unknown-2847  1d.h.    0µs : update_process_times
(smp_apic_timer_interrupt)
cpu_dela-3556  0d.h1    0µs : account_it_prof (account_system_time)
...
<unknown-2847  1d.h1    4µs : _raw_spin_unlock (scheduler_tick)
cpu_dela-3556  0d.h1    4µs : irq_exit (apic_timer_interrupt)
<unknown-2847  1d.h.    4µs : rebalance_tick (scheduler_tick)
cpu_dela-3556  0d..2    4µs : do_softirq (irq_exit)
cpu_dela-3556  0d..2    4µs : __do_softirq (do_softirq)
<unknown-2847  1d.h.    5µs : irq_exit (apic_timer_interrupt)
cpu_dela-3556  0d..2    5µs : wake_up_process (do_softirq)
cpu_dela-3556  0d..2    5µs : try_to_wake_up (wake_up_process)
cpu_dela-3556  0d..2    5µs : task_rq_lock (try_to_wake_up)
<unknown-2847  1d...    5µs < (0)
cpu_dela-3556  0d..2    5µs : _raw_spin_lock (task_rq_lock)
cpu_dela-3556  0d..3    6µs : activate_task (try_to_wake_up)
cpu_dela-3556  0d..3    6µs : sched_clock (activate_task)
cpu_dela-3556  0d..3    7µs : recalc_task_prio (activate_task)
cpu_dela-3556  0d..3    7µs : effective_prio (recalc_task_prio)
cpu_dela-3556  0d..3    7µs : activate_task <ksoftirq-4> (0 1):
cpu_dela-3556  0d..3    7µs : enqueue_task (activate_task)
cpu_dela-3556  0d..3    8µs : resched_task (try_to_wake_up)
cpu_dela-3556  0dn.3    8µs : __trace_start_sched_wakeup (try_to_wake_up)
cpu_dela-3556  0dn.3    8µs : try_to_wake_up <ksoftirq-4> (0 45):
cpu_dela-3556  0dn.3    9µs : _raw_spin_unlock_irqrestore (try_to_wake_up)
cpu_dela-3556  0dn.2    9µs : preempt_schedule (try_to_wake_up)
cpu_dela-3556  0dn.2    9µs : wake_up_process (do_softirq)
cpu_dela-3556  0dn.1   10µs < (0)
cpu_dela-3556  0.n.1   10µs : preempt_schedule (up)
cpu_dela-3556  0.n..   10µs : preempt_schedule (user_trace_start)
cpu_dela-3556  0dn..   11µs : __sched_text_start (preempt_schedule)
cpu_dela-3556  0dn.1   11µs : sched_clock (__sched_text_start)
cpu_dela-3556  0dn.1   11µs : _raw_spin_lock_irq (__sched_text_start)
cpu_dela-3556  0dn.1   12µs : _raw_spin_lock_irqsave (__sched_text_start)
cpu_dela-3556  0dn.2   12µs : pull_rt_tasks (__sched_text_start)
cpu_dela-3556  0dn.2   12µs : find_next_bit (pull_rt_tasks)
cpu_dela-3556  0dn.2   13µs : find_next_bit (pull_rt_tasks)
cpu_dela-3556  0d..2   13µs : trace_array (__sched_text_start)
cpu_dela-3556  0d..2   13µs : trace_array <ksoftirq-4> (0 1):
cpu_dela-3556  0d..2   15µs : trace_array <cpu_dela-3556> (45 46):
cpu_dela-3556  0d..2   16µs+: trace_array (__sched_text_start)
ksoftirq-4     0d..2   19µs : __switch_to (__sched_text_start)
ksoftirq-4     0d..2   20µs : __sched_text_start <cpu_dela-3556> (45 0):
ksoftirq-4     0d..2   20µs : finish_task_switch (__sched_text_start)
ksoftirq-4     0d..2   20µs : smp_send_reschedule_allbutself
(finish_task_switch)
ksoftirq-4     0d..2   20µs : send_IPI_allbutself
(smp_send_reschedule_allbutself)
ksoftirq-4     0d..2   21µs : __bitmap_weight (send_IPI_allbutself)
ksoftirq-4     0d..2   21µs : __send_IPI_shortcut (send_IPI_allbutself)
ksoftirq-4     0d..2   21µs : _raw_spin_unlock (finish_task_switch)
ksoftirq-4     0d..1   22µs : trace_stop_sched_switched
(finish_task_switch)
ksoftirq-4     0....   23µs : _do_softirq (ksoftirqd)
ksoftirq-4     0d...   23µs : ___do_softirq (_do_softirq)
ksoftirq-4     0....   23µs : run_timer_softirq (___do_softirq)
ksoftirq-4     0....   24µs : _spin_lock (run_timer_softirq)
ksoftirq-4     0....   24µs : __spin_lock (_spin_lock)
ksoftirq-4     0....   24µs : __might_sleep (__spin_lock)
<unknown-2847  1d...   24µs : smp_reschedule_interrupt (86a8bd8 0 0)
ksoftirq-4     0....   24µs : _down_mutex (__spin_lock)
<unknown-2847  1d...   25µs < (0)
ksoftirq-4     0....   25µs : __down_mutex (__spin_lock)
ksoftirq-4     0....   25µs : __might_sleep (__down_mutex)
ksoftirq-4     0d...   25µs : _raw_spin_lock (__down_mutex)
ksoftirq-4     0d..1   25µs : _raw_spin_lock (__down_mutex)
ksoftirq-4     0d..2   26µs : _raw_spin_lock (__down_mutex)
ksoftirq-4     0d..3   26µs : set_new_owner (__down_mutex)
ksoftirq-4     0d..3   26µs : set_new_owner <ksoftirq-4> (0 0):
ksoftirq-4     0d..3   27µs : _raw_spin_unlock (__down_mutex)
ksoftirq-4     0d..2   27µs : _raw_spin_unlock (__down_mutex)
ksoftirq-4     0d..1   27µs : _raw_spin_unlock (__down_mutex)
... no more traces from CPU 1 ...
ksoftirq-4     0....   77µs : rcu_check_quiescent_state
(__rcu_process_callbacks)
ksoftirq-4     0....   77µs : cond_resched_all (___do_softirq)
ksoftirq-4     0....   77µs : cond_resched (___do_softirq)
ksoftirq-4     0....   78µs : cond_resched (ksoftirqd)
ksoftirq-4     0....   78µs : kthread_should_stop (ksoftirqd)
ksoftirq-4     0....   78µs : schedule (ksoftirqd)
ksoftirq-4     0....   78µs : __sched_text_start (schedule)
ksoftirq-4     0...1   79µs : sched_clock (__sched_text_start)
ksoftirq-4     0...1   79µs : _raw_spin_lock_irq (__sched_text_start)
ksoftirq-4     0...1   79µs : _raw_spin_lock_irqsave (__sched_text_start)
ksoftirq-4     0d..2   80µs : deactivate_task (__sched_text_start)
ksoftirq-4     0d..2   80µs : dequeue_task (deactivate_task)
ksoftirq-4     0d..2   81µs : trace_array (__sched_text_start)
ksoftirq-4     0d..2   82µs : trace_array <cpu_dela-3556> (45 46):
ksoftirq-4     0d..2   84µs+: trace_array (__sched_text_start)
cpu_dela-3556  0d..2   86µs : __switch_to (__sched_text_start)
cpu_dela-3556  0d..2   87µs : __sched_text_start <ksoftirq-4> (0 45):
cpu_dela-3556  0d..2   87µs : finish_task_switch (__sched_text_start)
cpu_dela-3556  0d..2   87µs : _raw_spin_unlock (finish_task_switch)
cpu_dela-3556  0d..1   88µs : trace_stop_sched_switched
(finish_task_switch)
cpu_dela-3556  0d...   89µs+< (0)
cpu_dela-3556  0d...   92µs : math_state_restore (device_not_available)
cpu_dela-3556  0d...   92µs : restore_fpu (math_state_restore)
cpu_dela-3556  0d...   93µs < (0)
cpu_dela-3556  0....   93µs > sys_gettimeofday (00000000 00000000 0000007b)
cpu_dela-3556  0....   93µs : sys_gettimeofday (sysenter_past_esp)
cpu_dela-3556  0....   94µs : user_trace_stop (sys_gettimeofday)
cpu_dela-3556  0...1   94µs : user_trace_stop (sys_gettimeofday)
cpu_dela-3556  0...1   95µs : _raw_spin_lock_irqsave (user_trace_stop)
cpu_dela-3556  0d..2   95µs : _raw_spin_unlock_irqrestore (user_trace_stop)

If I read this right, we tried to reschedule cpu_delay on CPU #1 (at
24 usec) but it never happened and cpu_delay was still "ready to run"
on CPU #0 70 usec later.

[4] I have a trace where cpu_delay was bumped off of CPU #1 at 20 usec
while the X server (not RT) was the active process on CPU #0 for another
130 usec (several traces with preempt-depth ==0) when it finally gets
bumped by IRQ 0.

[5] More of a cosmetic problem, several traces still show the
application name as "unknown" - even for long lived processes like
ksoftirqd/0 and X.

Due to the file size, I will send the traces separately.
  --Mark


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-10 17:49 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
@ 2004-12-10 21:09 ` Ingo Molnar
  2004-12-10 21:12 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2004-12-10 21:09 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> [1] I still do not get traces where cpu_delay switches CPU's. I only
> get trace output if it starts and ends on a single CPU. [...]

lt001.18RT/lt.02 is such a trace. It starts on CPU#1:

 <unknown-3556  1...1    0µs : find_next_bit (user_trace_start)

and ends on CPU#0:

 <unknown-3556  1...1  247µs : _raw_spin_lock_irqsave (user_trace_stop)

the trace shows a typical migration of an RT task.

(but ... i have to say the debugging overhead is horrible. Please try a
completely-non-debug-non-tracing kernel just to see the difference.)

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-10 17:49 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
  2004-12-10 21:09 ` Ingo Molnar
@ 2004-12-10 21:12 ` Ingo Molnar
  2004-12-10 21:24 ` Ingo Molnar
  2004-12-13  0:16 ` Fernando Lopez-Lezcano
  3 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2004-12-10 21:12 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> [3] Some traces show information on both CPU's and then a long period
> with no traces from the other. Here is an example...

> <unknown-2847  1d.h.    4µs : rebalance_tick (scheduler_tick)
> <unknown-2847  1d.h.    5µs : irq_exit (apic_timer_interrupt)
> <unknown-2847  1d...    5µs < (0)

> ... no more traces from CPU 1 ...

PID 2847 returned to userspace at timestamp 5µs. Userspace then can take
an arbitrary amount of time until it calls the kernel again.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-10 17:49 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
  2004-12-10 21:09 ` Ingo Molnar
  2004-12-10 21:12 ` Ingo Molnar
@ 2004-12-10 21:24 ` Ingo Molnar
  2004-12-13  0:16 ` Fernando Lopez-Lezcano
  3 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2004-12-10 21:24 UTC (permalink / raw)
  To: Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Fernando Pablo Lopez-Lezcano, Lee Revell, Rui Nuno Capela,
	Shane Shrybman, Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Mark_H_Johnson@raytheon.com <Mark_H_Johnson@raytheon.com> wrote:

> [...] I also had several cases where I "triggered" a trace but no
> output - I assume those are related symptoms. For example:
> 
> # ./cpu_delay 0.000100
> Delay limit set to 0.00010000 seconds
> calibrating loop ....
> time diff= 0.504598 or 396354830 loops/sec.
> Trace activated with 0.000100 second delay.
> Trace triggered with 0.000102 second delay. [not recorded]
> Trace activated with 0.000100 second delay.
> Trace triggered with 0.000164 second delay. [not recorded]

is the userspace delay measurement nested inside the kernel-based
method? I.e. is it something like:

	gettimeofday(0,1);
	timestamp1 = cycles();

	... loop some ...

	timestamp2 = cycles();
	gettimeofday(0,0);

and do you get 'unreported' latencies in such a case too? If yes then
that would indeed indicate a tracer bug. But if the measurement is done
like this:

	gettimeofday(0,1);
	timestamp1 = cycles();

	... loop some ...

	gettimeofday(0,0);		// [1]
	timestamp2 = cycles();		// [2]

then a delay could get inbetween [1] and [2].

OTOH if the 'loop some' time is long enough then the [1]-[2] window is
too small to be significant statistically, while your logs show a near
50% 'miss rate'.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-10 17:49 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
                   ` (2 preceding siblings ...)
  2004-12-10 21:24 ` Ingo Molnar
@ 2004-12-13  0:16 ` Fernando Lopez-Lezcano
  2004-12-13  6:47   ` Ingo Molnar
  3 siblings, 1 reply; 30+ messages in thread
From: Fernando Lopez-Lezcano @ 2004-12-13  0:16 UTC (permalink / raw)
  To: ngo Molnar, Mark_H_Johnson
  Cc: Amit Shah, Karsten Wiese, Bill Huey, Adam Heath, emann,
	Gunther Persoons, K.R. Foley, linux-kernel, Florian Schmidt,
	Lee Revell, Rui Nuno Capela, Shane Shrybman, Esben Nielsen,
	Thomas Gleixner, Michal Schmidt

logOn Fri, 2004-12-10 at 09:49, Mark_H_Johnson@raytheon.com wrote:
> >The -32-15 kernel can be downloaded from the
> >usual place:
> >
> > http://redhat.com/~mingo/realtime-preempt/
> 
> By the time I started today, the directory had -18 so that is what I built
> with. Here are some initial results from running cpu_delay (the simple
> RT test / user tracing) on a -18 PREEMPT_RT kernel. Have not started
> any of the stress tests yet.

Something that just happened to me: running 0.7.32-14 (PREEMPT_DESKTOP)
and trying to install 0.7.32-19 from a custom built rpm package
completely hangs the machine (p4 laptop - I tried twice). No clues left
behind. If I boot into 0.7.32-9 I can install 0.7.32-19 with no
problems. 

-- Fernando



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-13  0:16 ` Fernando Lopez-Lezcano
@ 2004-12-13  6:47   ` Ingo Molnar
  2004-12-14  0:46     ` Fernando Lopez-Lezcano
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2004-12-13  6:47 UTC (permalink / raw)
  To: Fernando Lopez-Lezcano
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Lee Revell, Rui Nuno Capela, Shane Shrybman,
	Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:

> Something that just happened to me: running 0.7.32-14
> (PREEMPT_DESKTOP) and trying to install 0.7.32-19 from a custom built
> rpm package completely hangs the machine (p4 laptop - I tried twice).
> No clues left behind. If I boot into 0.7.32-9 I can install 0.7.32-19
> with no problems. 

does 0.7.32-19 work better if you reverse (patch -R) the loop.h and
loop.c bits (see below)?

	Ingo

--- linux/drivers/block/loop.c.orig
+++ linux/drivers/block/loop.c
@@ -378,7 +378,7 @@ static void loop_add_bio(struct loop_dev
 		lo->lo_bio = lo->lo_biotail = bio;
 	spin_unlock_irqrestore(&lo->lo_lock, flags);
 
-	up(&lo->lo_bh_mutex);
+	complete(&lo->lo_bh_done);
 }
 
 /*
@@ -427,7 +427,7 @@ static int loop_make_request(request_que
 	return 0;
 err:
 	if (atomic_dec_and_test(&lo->lo_pending))
-		up(&lo->lo_bh_mutex);
+		complete(&lo->lo_bh_done);
 out:
 	bio_io_error(old_bio, old_bio->bi_size);
 	return 0;
@@ -495,12 +495,12 @@ static int loop_thread(void *data)
 	/*
 	 * up sem, we are running
 	 */
-	up(&lo->lo_sem);
+	complete(&lo->lo_done);
 
 	for (;;) {
-		down_interruptible(&lo->lo_bh_mutex);
+		wait_for_completion_interruptible(&lo->lo_bh_done);
 		/*
-		 * could be upped because of tear-down, not because of
+		 * could be completed because of tear-down, not because of
 		 * pending work
 		 */
 		if (!atomic_read(&lo->lo_pending))
@@ -521,7 +521,7 @@ static int loop_thread(void *data)
 			break;
 	}
 
-	up(&lo->lo_sem);
+	complete(&lo->lo_done);
 	return 0;
 }
 
@@ -708,7 +708,7 @@ static int loop_set_fd(struct loop_devic
 	set_blocksize(bdev, lo_blocksize);
 
 	kernel_thread(loop_thread, lo, CLONE_KERNEL);
-	down(&lo->lo_sem);
+	wait_for_completion(&lo->lo_done);
 	return 0;
 
  out_putf:
@@ -773,10 +773,10 @@ static int loop_clr_fd(struct loop_devic
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_rundown;
 	if (atomic_dec_and_test(&lo->lo_pending))
-		up(&lo->lo_bh_mutex);
+		complete(&lo->lo_bh_done);
 	spin_unlock_irq(&lo->lo_lock);
 
-	down(&lo->lo_sem);
+	wait_for_completion(&lo->lo_done);
 
 	lo->lo_backing_file = NULL;
 
@@ -1153,8 +1153,8 @@ int __init loop_init(void)
 		if (!lo->lo_queue)
 			goto out_mem4;
 		init_MUTEX(&lo->lo_ctl_mutex);
-		init_MUTEX_LOCKED(&lo->lo_sem);
-		init_MUTEX_LOCKED(&lo->lo_bh_mutex);
+		init_completion(&lo->lo_done);
+		init_completion(&lo->lo_bh_done);
 		lo->lo_number = i;
 		spin_lock_init(&lo->lo_lock);
 		disk->major = LOOP_MAJOR;
--- linux/include/linux/loop.h.orig
+++ linux/include/linux/loop.h
@@ -58,9 +58,9 @@ struct loop_device {
 	struct bio 		*lo_bio;
 	struct bio		*lo_biotail;
 	int			lo_state;
-	struct semaphore	lo_sem;
+	struct completion	lo_done;
+	struct completion	lo_bh_done;
 	struct semaphore	lo_ctl_mutex;
-	struct semaphore	lo_bh_mutex;
 	atomic_t		lo_pending;
 
 	request_queue_t		*lo_queue;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-13  6:47   ` Ingo Molnar
@ 2004-12-14  0:46     ` Fernando Lopez-Lezcano
  2004-12-14  4:42       ` K.R. Foley
  0 siblings, 1 reply; 30+ messages in thread
From: Fernando Lopez-Lezcano @ 2004-12-14  0:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, K.R. Foley, linux-kernel,
	Florian Schmidt, Lee Revell, Rui Nuno Capela, Shane Shrybman,
	Esben Nielsen, Thomas Gleixner, Michal Schmidt

On Sun, 2004-12-12 at 22:47, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> 
> > Something that just happened to me: running 0.7.32-14
> > (PREEMPT_DESKTOP) and trying to install 0.7.32-19 from a custom built
> > rpm package completely hangs the machine (p4 laptop - I tried twice).
> > No clues left behind. If I boot into 0.7.32-9 I can install 0.7.32-19
> > with no problems. 
> 
> does 0.7.32-19 work better if you reverse (patch -R) the loop.h and
> loop.c bits (see below)?

Running 0.7.32-19 (no changes) I managed to install 0.7.32-20 with no
problems... probably a problem in -14 that was somehow fixed in later
releases. 

-- Fernando



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-14  0:46     ` Fernando Lopez-Lezcano
@ 2004-12-14  4:42       ` K.R. Foley
  2004-12-14  8:47         ` Rui Nuno Capela
  0 siblings, 1 reply; 30+ messages in thread
From: K.R. Foley @ 2004-12-14  4:42 UTC (permalink / raw)
  To: Fernando Lopez-Lezcano
  Cc: Ingo Molnar, Mark_H_Johnson, Amit Shah, Karsten Wiese, Bill Huey,
	Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Rui Nuno Capela, Shane Shrybman,
	Esben Nielsen, Thomas Gleixner, Michal Schmidt

Fernando Lopez-Lezcano wrote:
> On Sun, 2004-12-12 at 22:47, Ingo Molnar wrote:
> 
>>* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>>
>>
>>>Something that just happened to me: running 0.7.32-14
>>>(PREEMPT_DESKTOP) and trying to install 0.7.32-19 from a custom built
>>>rpm package completely hangs the machine (p4 laptop - I tried twice).
>>>No clues left behind. If I boot into 0.7.32-9 I can install 0.7.32-19
>>>with no problems. 
>>
>>does 0.7.32-19 work better if you reverse (patch -R) the loop.h and
>>loop.c bits (see below)?
> 
> 
> Running 0.7.32-19 (no changes) I managed to install 0.7.32-20 with no
> problems... probably a problem in -14 that was somehow fixed in later
> releases. 
> 
> -- Fernando

Possibly. I have had the occasional problem with running make install 
locking one of my systems. Rebooting and running make install again 
works fine in my case. It is by no means a regular occurrence, even 
installing 2 or 3 new kernels daily on 3 different machines.

kr

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-14  4:42       ` K.R. Foley
@ 2004-12-14  8:47         ` Rui Nuno Capela
  2004-12-14 11:35           ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Rui Nuno Capela @ 2004-12-14  8:47 UTC (permalink / raw)
  To: K.R. Foley
  Cc: Fernando Lopez-Lezcano, Ingo Molnar, mark_h_johnson, Amit Shah,
	Karsten Wiese, Bill Huey, Adam Heath, emann, Gunther Persoons,
	linux-kernel, Florian Schmidt, Lee Revell, Shane Shrybman,
	Esben Nielsen, Thomas Gleixner, Michal Schmidt

K.R. Foley wrote:
> Fernando Lopez-Lezcano wrote:
>> On Sun, 2004-12-12 at 22:47, Ingo Molnar wrote:
>>
>>>* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>>>
>>>
>>>>Something that just happened to me: running 0.7.32-14
>>>>(PREEMPT_DESKTOP) and trying to install 0.7.32-19 from a custom built
>>>>rpm package completely hangs the machine (p4 laptop - I tried twice).
>>>>No clues left behind. If I boot into 0.7.32-9 I can install 0.7.32-19
>>>>with no problems.
>>>
>>>does 0.7.32-19 work better if you reverse (patch -R) the loop.h and
>>>loop.c bits (see below)?
>>
>>
>> Running 0.7.32-19 (no changes) I managed to install 0.7.32-20 with no
>> problems... probably a problem in -14 that was somehow fixed in later
>> releases.
>>
>> -- Fernando
>
> Possibly. I have had the occasional problem with running make install
> locking one of my systems. Rebooting and running make install again
> works fine in my case. It is by no means a regular occurrence, even
> installing 2 or 3 new kernels daily on 3 different machines.
>

Isn't this tightly related to mkinitrd sometimes hanging while on mount -o
loop, that I've been reporting a couple of times before? It used to hang
on any other time I do a new kernel install, but latetly it seems to be OK
(RT-V0.9.32-19 and -20).
-- 
rncbc aka Rui Nuno Capela
rncbc@rncbc.org


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15
  2004-12-14  8:47         ` Rui Nuno Capela
@ 2004-12-14 11:35           ` Ingo Molnar
  2004-12-27 14:35             ` Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15) Esben Nielsen
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2004-12-14 11:35 UTC (permalink / raw)
  To: Rui Nuno Capela
  Cc: K.R. Foley, Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah,
	Karsten Wiese, Bill Huey, Adam Heath, emann, Gunther Persoons,
	linux-kernel, Florian Schmidt, Lee Revell, Shane Shrybman,
	Esben Nielsen, Thomas Gleixner, Michal Schmidt


* Rui Nuno Capela <rncbc@rncbc.org> wrote:

> Isn't this tightly related to mkinitrd sometimes hanging while on
> mount -o loop, that I've been reporting a couple of times before? It
> used to hang on any other time I do a new kernel install, but latetly
> it seems to be OK (RT-V0.9.32-19 and -20).

yeah, i've added Thomas Gleixner's earlier semaphore->completion
conversion to the loop device, to -19 or -18.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-14 11:35           ` Ingo Molnar
@ 2004-12-27 14:35             ` Esben Nielsen
  2004-12-27 15:27               ` Steven Rostedt
  2005-01-28  7:38               ` Ingo Molnar
  0 siblings, 2 replies; 30+ messages in thread
From: Esben Nielsen @ 2004-12-27 14:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rui Nuno Capela, K.R. Foley, Fernando Lopez-Lezcano,
	mark_h_johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Lee Revell, Shane Shrybman, Thomas Gleixner, Michal Schmidt

I haven't seen much traffic on real-time preemption lately. Is it due
to Christmas or lost interest?

I noticed that you changed rw-locks to behave quite diferently under
real-time preemption: They basicly works like normal locks now. I.e. there
can only be one reader task within each region. This can can however lock
the region recursively. I wanted to start looking at fixing that because
it ought to hurt scalability quite a bit - and even on UP create a few
unneeded task-switchs. However, the more I think about it the bigger the 
problem:

First, let me describe how I see a read-write lock. It has 3 states:
a) unlocked
b) locked by n readers
c) locked by 1 writer
There can either be 1 writer within the protected region or n>=0
readers within the region. When a writer wants to take the lock,
calling down_write(), it has to wait until the read count is 0. When a
reader wants to take the lock, calling down_read(), he has only to wait
until the the writer is done - there is no need to wait for the other
readers.

Now in a real-time system down_X() ought to have a deterministic
blocking time. It should be easy to make down_read() deterministic: If
there is a writer let it inherit the calling readers priority. 
However, down_write() is hard to make deterministic. Even if we assume
that the lock not only keeps track of the number of readers but keeps a
list of all the reader threads within the region it can traverse the list
and boost the priority of all those threads. If there is n readers when
down_write() is called the blocking time would be
 O(ceil(n/#cpus))
time - which is unbounded as n is not known!

Having a rw-lock with deterministic down_read() but non-deterministic
down_write() would be very usefull in a lot of cases. The characteritic is
that the data structure being protected is relative static, is going
to be used by a lot of RT readers and the updates doesn't have to be done
with any real-time requirements.
However, there is no way to know in general which locks in the kernel can
be allowed to work like that and which can't. A good compromise would be
limit the number of readers in a lock by the number of cpu's on the
system. That would make the system scale over several CPUs without hitting
unneeded congestions on read-locks and still have a determnistic
down_write(). 

down_write() shall then do the following: Boost the priority of all the
active readers to the priority of the caller. This will in turn distribute
the readers over the cpu's of the system assuming no higher priority RT
tasks are running. All the reader tasks will then run to up_read() in
time O(1) as they can all run in parellel - assuming there is no ugly
nested locking ofcourse!
down_read() should first check if there is a writer. If there is
boost it and wait. If there isn't but there isn't room for another reader
boost one of the readers such it will run to up_read().

An extra bonus of not having the number of readers bounded: The various
structures needed for making the list of readers can be allocated once.
There is no need to call kmalloc() from within down_read() to get a list
element for the lock's list of readers.

I don't know wether I have time for coding this soon. Under all
circumstances I do not have a SMP system so I can't really test it if I
get time to code it :-(

Esben



On Tue, 14 Dec 2004, Ingo Molnar wrote:

> 
> * Rui Nuno Capela <rncbc@rncbc.org> wrote:
> 
> > Isn't this tightly related to mkinitrd sometimes hanging while on
> > mount -o loop, that I've been reporting a couple of times before? It
> > used to hang on any other time I do a new kernel install, but latetly
> > it seems to be OK (RT-V0.9.32-19 and -20).
> 
> yeah, i've added Thomas Gleixner's earlier semaphore->completion
> conversion to the loop device, to -19 or -18.
> 
> 	Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 14:35             ` Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15) Esben Nielsen
@ 2004-12-27 15:27               ` Steven Rostedt
  2004-12-27 16:23                 ` Esben Nielsen
  2005-01-28  7:38               ` Ingo Molnar
  1 sibling, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2004-12-27 15:27 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Rui Nuno Capela, K.R. Foley, Fernando Lopez-Lezcano,
	Mark Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, LKML, Florian Schmidt, Lee Revell,
	Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Mon, 2004-12-27 at 15:35 +0100, Esben Nielsen wrote:
> I haven't seen much traffic on real-time preemption lately. Is it due
> to Christmas or lost interest?
> 

I think they are on vacation :-)

> I noticed that you changed rw-locks to behave quite diferently under
> real-time preemption: They basicly works like normal locks now. I.e. there
> can only be one reader task within each region. This can can however lock
> the region recursively. I wanted to start looking at fixing that because
> it ought to hurt scalability quite a bit - and even on UP create a few
> unneeded task-switchs. However, the more I think about it the bigger the 
> problem:
> 
> First, let me describe how I see a read-write lock. It has 3 states:
> a) unlocked
> b) locked by n readers
> c) locked by 1 writer
> There can either be 1 writer within the protected region or n>=0
> readers within the region. When a writer wants to take the lock,
> calling down_write(), it has to wait until the read count is 0. When a
> reader wants to take the lock, calling down_read(), he has only to wait
> until the the writer is done - there is no need to wait for the other
> readers.
> 
> Now in a real-time system down_X() ought to have a deterministic
> blocking time. It should be easy to make down_read() deterministic: If
> there is a writer let it inherit the calling readers priority. 
> However, down_write() is hard to make deterministic. Even if we assume
> that the lock not only keeps track of the number of readers but keeps a
> list of all the reader threads within the region it can traverse the list
> and boost the priority of all those threads. If there is n readers when
> down_write() is called the blocking time would be
>  O(ceil(n/#cpus))
> time - which is unbounded as n is not known!
> 
> Having a rw-lock with deterministic down_read() but non-deterministic
> down_write() would be very usefull in a lot of cases. The characteritic is
> that the data structure being protected is relative static, is going
> to be used by a lot of RT readers and the updates doesn't have to be done
> with any real-time requirements.
> However, there is no way to know in general which locks in the kernel can
> be allowed to work like that and which can't. A good compromise would be
> limit the number of readers in a lock by the number of cpu's on the
> system. That would make the system scale over several CPUs without hitting
> unneeded congestions on read-locks and still have a determnistic
> down_write(). 
> 

Why just limit to the number of CPUs, but make a configurable limit. I
would say the default may be 2*CPUs.  Reason being is that once you
limit the number of readers, you just bound the down_write. Even if
number of readers allowed is 100, the down_write is now bound to
O(ceil(n/#cpus)) as you said, but now n is known. Make a
CONFIG_ALLOWED_READERS or something to that affect, and it would be easy
to see what is a good optimal configuration (assuming you have the
proper tests).

> down_write() shall then do the following: Boost the priority of all the
> active readers to the priority of the caller. This will in turn distribute
> the readers over the cpu's of the system assuming no higher priority RT
> tasks are running. All the reader tasks will then run to up_read() in
> time O(1) as they can all run in parellel - assuming there is no ugly
> nested locking ofcourse!
> down_read() should first check if there is a writer. If there is
> boost it and wait. If there isn't but there isn't room for another reader
> boost one of the readers such it will run to up_read().
> 
> An extra bonus of not having the number of readers bounded: The various
> structures needed for making the list of readers can be allocated once.
> There is no need to call kmalloc() from within down_read() to get a list
> element for the lock's list of readers.
> 
> I don't know wether I have time for coding this soon. Under all
> circumstances I do not have a SMP system so I can't really test it if I
> get time to code it :-(
> 

I have two SMP machines that I can test on, unfortunately, they both
have NVIDIA cards, so I cant use them with X, unless I go back to the
default driver. Which I would do, but I really like the 3d graphics ;-)


-- Steve

> Esben
> 
> 
> 
> On Tue, 14 Dec 2004, Ingo Molnar wrote:
> 
> > 
> > * Rui Nuno Capela <rncbc@rncbc.org> wrote:
> > 
> > > Isn't this tightly related to mkinitrd sometimes hanging while on
> > > mount -o loop, that I've been reporting a couple of times before? It
> > > used to hang on any other time I do a new kernel install, but latetly
> > > it seems to be OK (RT-V0.9.32-19 and -20).
> > 
> > yeah, i've added Thomas Gleixner's earlier semaphore->completion
> > conversion to the loop device, to -19 or -18.
> > 
> > 	Ingo
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Steven Rostedt
Senior Engineer
Kihon Technologies


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 15:27               ` Steven Rostedt
@ 2004-12-27 16:23                 ` Esben Nielsen
  2004-12-27 16:39                   ` Steven Rostedt
  2004-12-28 21:42                   ` Lee Revell
  0 siblings, 2 replies; 30+ messages in thread
From: Esben Nielsen @ 2004-12-27 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Rui Nuno Capela, K.R. Foley, Fernando Lopez-Lezcano,
	Mark Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, LKML, Florian Schmidt, Lee Revell,
	Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Mon, 27 Dec 2004, Steven Rostedt wrote:

> On Mon, 2004-12-27 at 15:35 +0100, Esben Nielsen wrote:
> > I haven't seen much traffic on real-time preemption lately. Is it due
> > to Christmas or lost interest?
> > 
> 
> I think they are on vacation :-)
>
Oh, I thought I would have fun with some kernel programming in my
vacation! :-)
 
> > [...]
> 
> Why just limit to the number of CPUs, but make a configurable limit. I
> would say the default may be 2*CPUs.  Reason being is that once you
> limit the number of readers, you just bound the down_write. Even if
> number of readers allowed is 100, the down_write is now bound to
> O(ceil(n/#cpus)) as you said, but now n is known. Make a
> CONFIG_ALLOWED_READERS or something to that affect, and it would be easy
> to see what is a good optimal configuration (assuming you have the
> proper tests).
> 
I agree with you that it ought to be configureable. But if you set it to
something like 100 it is _not_ deterministic in practise. I.e. during your
tests you have a really hard time making 100 readers at the critical
point. Most likely you only have a handfull. Then when you deploy the
system where you might have a webserver presenting data read under a
rw-lock is overloaded spawns 100 worker tasks. Bang! Your RT application
doesn't run!
It doesn't help you to have something bounded if the bound is insanely
high such you never reach it in tests. And if you try to calculate the
worst case scenarious for such a system you would find it doesn't
schedule. I.e. you have to buy a bigger CPU or do some other drastic
thing!

A limit like 1 reader per CPU on the other hand behaves like a
normal mutex priority inheritance wrt. determinism. And it scales like the
stock Linux rw-lock which in practise is also limited to 1 task per CPU as
preemption is switched off :-)

> [...] 
> I have two SMP machines that I can test on, unfortunately, they both
> have NVIDIA cards, so I cant use them with X, unless I go back to the
> default driver. Which I would do, but I really like the 3d graphics ;-)
> 

Well, these kind of things isn't something you want to combine with 3d
graphics right away anyway!
I have even had problems with crashing on 2.4.xx with a NVidia and
hyperthreading on a machine I helped install :-( 
I am afraid NVidia cards and preempt realtime is far in the future :-(

> 
> -- Steve
> 

Esben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 16:23                 ` Esben Nielsen
@ 2004-12-27 16:39                   ` Steven Rostedt
  2004-12-27 21:06                     ` Bill Huey
  2004-12-28 21:42                   ` Lee Revell
  1 sibling, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2004-12-27 16:39 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Rui Nuno Capela, K.R. Foley, Fernando Lopez-Lezcano,
	Mark Johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, LKML, Florian Schmidt, Lee Revell,
	Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Mon, 2004-12-27 at 17:23 +0100, Esben Nielsen wrote:


> > > [...]
> > 
> > Why just limit to the number of CPUs, but make a configurable limit. I
> > would say the default may be 2*CPUs.  Reason being is that once you
> > limit the number of readers, you just bound the down_write. Even if
> > number of readers allowed is 100, the down_write is now bound to
> > O(ceil(n/#cpus)) as you said, but now n is known. Make a
> > CONFIG_ALLOWED_READERS or something to that affect, and it would be easy
> > to see what is a good optimal configuration (assuming you have the
> > proper tests).
> > 
> I agree with you that it ought to be configureable. But if you set it to
> something like 100 it is _not_ deterministic in practise. I.e. during your
> tests you have a really hard time making 100 readers at the critical
> point. Most likely you only have a handfull. Then when you deploy the
> system where you might have a webserver presenting data read under a
> rw-lock is overloaded spawns 100 worker tasks. Bang! Your RT application
> doesn't run!
> It doesn't help you to have something bounded if the bound is insanely
> high such you never reach it in tests. And if you try to calculate the
> worst case scenarious for such a system you would find it doesn't
> schedule. I.e. you have to buy a bigger CPU or do some other drastic
> thing!
> 

I wasn't saying that 100 was a GOOD number, but that was just an
example. I'm one of those that don't like the program, application,
kernel, to limit you on how you want to shoot yourself in the foot. But
always have the default set to something that prevents the idiot from
doing something too idiotic.  

> A limit like 1 reader per CPU on the other hand behaves like a
> normal mutex priority inheritance wrt. determinism. And it scales like the
> stock Linux rw-lock which in practise is also limited to 1 task per CPU as
> preemption is switched off :-)
> 

Do you think 2/cpu is too high? I'd figure that reads usually happen way
more than writes, and writes can usually last longer than reads, but
having the default to two, you can test your code, on a UP.

> > [...] 
> > I have two SMP machines that I can test on, unfortunately, they both
> > have NVIDIA cards, so I cant use them with X, unless I go back to the
> > default driver. Which I would do, but I really like the 3d graphics ;-)
> > 
> 
> Well, these kind of things isn't something you want to combine with 3d
> graphics right away anyway!
> I have even had problems with crashing on 2.4.xx with a NVidia and
> hyperthreading on a machine I helped install :-( 
> I am afraid NVidia cards and preempt realtime is far in the future :-(
> 

Actually, I've had some success with NVIDIA on all my kernels (except of
course the realtime ones). Unfortunately, there are the little crashes
here and there, but those usually happen with screen savers so I don't
lose data. I just come back to the machine and say WTF, I'm at a login
prompt! I am one of those that save everything within seconds of
modifying, and use subversion commits like crazy :-)

-- Steve


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 16:39                   ` Steven Rostedt
@ 2004-12-27 21:06                     ` Bill Huey
  2004-12-27 21:48                       ` Valdis.Kletnieks
                                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Bill Huey @ 2004-12-27 21:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Esben Nielsen, Ingo Molnar, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, Mark Johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, LKML,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt

On Mon, Dec 27, 2004 at 11:39:20AM -0500, Steven Rostedt wrote:
> Actually, I've had some success with NVIDIA on all my kernels (except of

Doesn't the NVidia driver use their own version of DRM/DRI ?
If so, then did you tell it to use the Linux kernel versions of that
driver ?

> course the realtime ones). Unfortunately, there are the little crashes
> here and there, but those usually happen with screen savers so I don't

I was just having a discussion about this last night with a friend
of mine and I'm going to pose this question to you and others.

Is a real-time enabled kernel still relevant for high performance
video even with GPUs being as fast as they are these days ?

The context that I'm working with is that I was told (been out of
gaming for a long time now) that GPus are so fast these days that
shortage of frame rate isn't a problem any more. An RTOS would be
able to deliver a data/instructions to the GPU under a much tighter
time period and could delivery better, more consistent frame rates.

Does this assertion still apply or not ? why ? (for either answer)

bill


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 21:06                     ` Bill Huey
@ 2004-12-27 21:48                       ` Valdis.Kletnieks
  2004-12-28 21:59                       ` Lee Revell
  2005-01-04 15:25                       ` Andrew McGregor
  2 siblings, 0 replies; 30+ messages in thread
From: Valdis.Kletnieks @ 2004-12-27 21:48 UTC (permalink / raw)
  To: Bill Huey
  Cc: Steven Rostedt, Esben Nielsen, Ingo Molnar, Rui Nuno Capela,
	K.R. Foley, Fernando Lopez-Lezcano, Mark Johnson, Amit Shah,
	Karsten Wiese, Adam Heath, emann, Gunther Persoons, LKML,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt

[-- Attachment #1: Type: text/plain, Size: 1838 bytes --]

On Mon, 27 Dec 2004 13:06:14 PST, Bill Huey said:

> Is a real-time enabled kernel still relevant for high performance
> video even with GPUs being as fast as they are these days ?

More to the point - can a RT kernel help *last* year's model?  My
laptop only has a GeForce4 440Go (which is closer to a GeForce2 in
reality) and a 1.6Gz Pentium4.  So it isn't any problem at all
to even find xscreensaver GL-hacks that bring it to its knees.

Even the venerable 'glxgears' drops down to about 40FPS in an 800x600
window.  I'm sure the average game has a *lot* more polygons in it than
glxgears does.  xscreensaver's 'sierpinski3d' drops down to 18FPS when it
gets up to 16K polygons.

Linux has long been reknowned for its ability to Get Stuff Done on much older
and less capable hardware than the stuff from Redmond.  Got an old box that
crawled under W2K and Win/XP won't even install? Toss the current RedHat or
Suse on it, and it goes...

Would be *really* nice if we could find similar tricks on the graphics side. ;)

> The context that I'm working with is that I was told (been out of
> gaming for a long time now) that GPus are so fast these days that
> shortage of frame rate isn't a problem any more. An RTOS would be
> able to deliver a data/instructions to the GPU under a much tighter
> time period and could delivery better, more consistent frame rates.
> 
> Does this assertion still apply or not ? why ? (for either answer)

Shortage of frame rate is *always* a problem.  No matter how many
millions of polygons/sec you can push out the door, somebody will want
to do N+25% per second.  Ask yourself why SGI was *EVER* able to sell
a machine with more than one InfiniteReality graphics engine on it, and
then ask yourself what the people who were using 3 IR pipes 5-6 years
ago are looking to use for graphics *this* year.

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 16:23                 ` Esben Nielsen
  2004-12-27 16:39                   ` Steven Rostedt
@ 2004-12-28 21:42                   ` Lee Revell
  1 sibling, 0 replies; 30+ messages in thread
From: Lee Revell @ 2004-12-28 21:42 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Steven Rostedt, Ingo Molnar, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, Mark Johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, LKML,
	Florian Schmidt, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Mon, 2004-12-27 at 17:23 +0100, Esben Nielsen wrote:
> I am afraid NVidia cards and preempt realtime is far in the future :-(

Not necessarily.  I am sure many of Nvidia's potential customers want to
do high end simulation and that requires an RT capable OS and good
graphics hardware.  If many people end up using RT preempt for high end
simulator applications, like several who have posted on this list, then
it seems like it would a good fit.

Lee


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 21:06                     ` Bill Huey
  2004-12-27 21:48                       ` Valdis.Kletnieks
@ 2004-12-28 21:59                       ` Lee Revell
  2005-01-04 15:25                       ` Andrew McGregor
  2 siblings, 0 replies; 30+ messages in thread
From: Lee Revell @ 2004-12-28 21:59 UTC (permalink / raw)
  To: Bill Huey
  Cc: Steven Rostedt, Esben Nielsen, Ingo Molnar, Rui Nuno Capela,
	K.R. Foley, Fernando Lopez-Lezcano, Mark Johnson, Amit Shah,
	Karsten Wiese, Adam Heath, emann, Gunther Persoons, LKML,
	Florian Schmidt, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Mon, 2004-12-27 at 13:06 -0800, Bill Huey wrote:
> I was just having a discussion about this last night with a friend
> of mine and I'm going to pose this question to you and others.
> 
> Is a real-time enabled kernel still relevant for high performance
> video even with GPUs being as fast as they are these days ?
>
> The context that I'm working with is that I was told (been out of
> gaming for a long time now) that GPus are so fast these days that
> shortage of frame rate isn't a problem any more. An RTOS would be
> able to deliver a data/instructions to the GPU under a much tighter
> time period and could delivery better, more consistent frame rates.
> 
> Does this assertion still apply or not ? why ? (for either answer)

Yes, an RTOS certainly helps.  Otherwise you cannot guarantee a minimum
frame rate - if a long running disk ISR fires then you are screwed,
because jsut as with low latency audio you have a SCHED_FIFO userspace
process that is feeding data to the GPU at a constant rate.  Its a worse
problem with audio because you absolutely cannot drop a frame or you
will hear it.  With video it's likely to be imperceptible.

Lee


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 21:06                     ` Bill Huey
  2004-12-27 21:48                       ` Valdis.Kletnieks
  2004-12-28 21:59                       ` Lee Revell
@ 2005-01-04 15:25                       ` Andrew McGregor
  2 siblings, 0 replies; 30+ messages in thread
From: Andrew McGregor @ 2005-01-04 15:25 UTC (permalink / raw)
  To: Bill Huey
  Cc: Rui Nuno Capela, LKML, Ingo Molnar, Gunther Persoons,
	Thomas Gleixner, Karsten Wiese, Michal Schmidt, Shane Shrybman,
	Mark Johnson, emann, Adam Heath, Steven Rostedt, K.R. Foley,
	Lee Revell, Esben Nielsen, Amit Shah, Fernando Lopez-Lezcano,
	Florian Schmidt


On 28/12/2004, at 10:06 AM, Bill Huey (hui) wrote:

> On Mon, Dec 27, 2004 at 11:39:20AM -0500, Steven Rostedt wrote:
>> Actually, I've had some success with NVIDIA on all my kernels (except 
>> of
>
> Doesn't the NVidia driver use their own version of DRM/DRI ?
> If so, then did you tell it to use the Linux kernel versions of that
> driver ?
>
>> course the realtime ones). Unfortunately, there are the little crashes
>> here and there, but those usually happen with screen savers so I don't
>
> I was just having a discussion about this last night with a friend
> of mine and I'm going to pose this question to you and others.
>
> Is a real-time enabled kernel still relevant for high performance
> video even with GPUs being as fast as they are these days ?

It is if you want to do any audio at the same time, as you usually do.

Andrew


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2004-12-27 14:35             ` Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15) Esben Nielsen
  2004-12-27 15:27               ` Steven Rostedt
@ 2005-01-28  7:38               ` Ingo Molnar
  2005-01-28 11:56                 ` William Lee Irwin III
                                   ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-01-28  7:38 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Rui Nuno Capela, K.R. Foley, Fernando Lopez-Lezcano,
	mark_h_johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Lee Revell, Shane Shrybman, Thomas Gleixner, Michal Schmidt


* Esben Nielsen <simlo@phys.au.dk> wrote:

> I noticed that you changed rw-locks to behave quite diferently under
> real-time preemption: They basicly works like normal locks now. I.e.
> there can only be one reader task within each region. This can can
> however lock the region recursively. [...]

correct.

> [...] I wanted to start looking at fixing that because it ought to
> hurt scalability quite a bit - and even on UP create a few unneeded
> task-switchs. [...]

no, it's not a big scalability problem. rwlocks are really a mistake -
if you want scalability and spinlocks/semaphores are not enough then one
should either use per-CPU locks or lockless structures. rwlocks/rwsems
will very unlikely help much.

> However, the more I think about it the bigger the problem:

yes, that complexity to get it perform in a deterministic manner is why
i introduced this (major!) simplification of locking. It turns out that
most of the time the actual use of rwlocks matches this simplified
'owner-recursive exclusive lock' semantics, so we are lucky.

look at what kind of worst-case scenarios there may already be with
multiple spinlocks (blocker.c). With rwlocks that just gets insane.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28  7:38               ` Ingo Molnar
@ 2005-01-28 11:56                 ` William Lee Irwin III
  2005-01-28 15:28                   ` Ingo Molnar
  2005-01-28 19:18                 ` Trond Myklebust
  2005-01-30 22:03                 ` Esben Nielsen
  2 siblings, 1 reply; 30+ messages in thread
From: William Lee Irwin III @ 2005-01-28 11:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt

* Esben Nielsen <simlo@phys.au.dk> wrote:
>> [...] I wanted to start looking at fixing that because it ought to
>> hurt scalability quite a bit - and even on UP create a few unneeded
>> task-switchs. [...]

On Fri, Jan 28, 2005 at 08:38:56AM +0100, Ingo Molnar wrote:
> no, it's not a big scalability problem. rwlocks are really a mistake -
> if you want scalability and spinlocks/semaphores are not enough then one
> should either use per-CPU locks or lockless structures. rwlocks/rwsems
> will very unlikely help much.

I wouldn't be so sure about that. SGI is already implicitly relying on
the parallel holding of rwsems for the lockless pagefaulting, and
Oracle has been pushing on mapping->tree_lock becoming an rwlock for a
while, both for large performance gains.


* Esben Nielsen <simlo@phys.au.dk> wrote:
>> However, the more I think about it the bigger the problem:

On Fri, Jan 28, 2005 at 08:38:56AM +0100, Ingo Molnar wrote:
> yes, that complexity to get it perform in a deterministic manner is why
> i introduced this (major!) simplification of locking. It turns out that
> most of the time the actual use of rwlocks matches this simplified
> 'owner-recursive exclusive lock' semantics, so we are lucky.
> look at what kind of worst-case scenarios there may already be with
> multiple spinlocks (blocker.c). With rwlocks that just gets insane.

tasklist_lock is one large exception; it's meant for concurrency there,
and it even gets sufficient concurrency to starve the write side.

Try test_remap.c on mainline vs. -mm to get a microbenchmark-level
notion of the importance of mapping->tree_lock being an rwlock (IIRC
you were cc:'d in at least some of those threads).

net/ has numerous rwlocks, which appear to frequently be associated
with hashtables, and at least some have some relevance to performance.

Are you suggesting that lockless alternatives to mapping->tree_lock,
mm->mmap_sem, and tasklist_lock should be pursued now?


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28 11:56                 ` William Lee Irwin III
@ 2005-01-28 15:28                   ` Ingo Molnar
  2005-01-28 15:55                     ` William Lee Irwin III
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-01-28 15:28 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Jan 28, 2005 at 08:38:56AM +0100, Ingo Molnar wrote:
> > no, it's not a big scalability problem. rwlocks are really a mistake -
> > if you want scalability and spinlocks/semaphores are not enough then one
> > should either use per-CPU locks or lockless structures. rwlocks/rwsems
> > will very unlikely help much.
> 
> I wouldn't be so sure about that. SGI is already implicitly relying on
> the parallel holding of rwsems for the lockless pagefaulting, and
> Oracle has been pushing on mapping->tree_lock becoming an rwlock for a
> while, both for large performance gains.

i dont really buy it. Any rwlock-type of locking causes global cacheline
bounces. It can make a positive scalability difference only if the
read-lock hold time is large, at which point RCU could likely have
significantly higher performance. There _may_ be an intermediate locking
pattern that is both long-held but has a higher mix of write-locking
where rwlocks/rwsems may have a performance advantage over RCU or
spinlocks.

Also this is about PREEMPT_RT, mainly aimed towards embedded systems,
and at most aimed towards small (dual-CPU) SMP systems, not the really
big systems.

But, the main argument wrt. PREEMPT_RT stands and is independent of any
scalability properties: rwlocks/rwsems have so bad deterministic
behavior that they are almost impossible to implement in a sane way.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28 15:28                   ` Ingo Molnar
@ 2005-01-28 15:55                     ` William Lee Irwin III
  2005-01-28 16:16                       ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: William Lee Irwin III @ 2005-01-28 15:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I wouldn't be so sure about that. SGI is already implicitly relying on
>> the parallel holding of rwsems for the lockless pagefaulting, and
>> Oracle has been pushing on mapping->tree_lock becoming an rwlock for a
>> while, both for large performance gains.

On Fri, Jan 28, 2005 at 04:28:02PM +0100, Ingo Molnar wrote:
> i dont really buy it. Any rwlock-type of locking causes global cacheline
> bounces. It can make a positive scalability difference only if the
> read-lock hold time is large, at which point RCU could likely have
> significantly higher performance. There _may_ be an intermediate locking
> pattern that is both long-held but has a higher mix of write-locking
> where rwlocks/rwsems may have a performance advantage over RCU or
> spinlocks.

The performance relative to mutual exclusion is quantifiable and very
reproducible. These results have people using arguments similar to what
you made above baffled. Systems as small as 4 logical cpus feel these
effects strongly, and it appears to scale almost linearly with the
number of cpus. It may be worth consulting an x86 processor architect
or similar to get an idea of why the counterargument fails. I'm rather
interested in hearing why as well, as I believed the cacheline bounce
argument until presented with incontrovertible evidence to the contrary.

As far as performance relative to RCU goes, I suspect cases where
write-side latency is important will arise for these. Other lockless
methods are probably more appropriate, and are more likely to dominate
rwlocks as expected. For instance, a reimplementation of the radix
trees for lockless insertion and traversal (c.f. lockless pagetable
patches for examples of how that's carried out) is plausible, where RCU
memory overhead in struct page is not.


On Fri, Jan 28, 2005 at 04:28:02PM +0100, Ingo Molnar wrote:
> Also this is about PREEMPT_RT, mainly aimed towards embedded systems,
> and at most aimed towards small (dual-CPU) SMP systems, not the really
> big systems.
> But, the main argument wrt. PREEMPT_RT stands and is independent of any
> scalability properties: rwlocks/rwsems have so bad deterministic
> behavior that they are almost impossible to implement in a sane way.

I suppose if it's not headed toward mainline the counterexamples don't
really matter. I don't have much to say about the RT-related issues,
though I'm aware of priority inheritance's infeasibility for the
rwlock/rwsem case.


-- wli

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28 15:55                     ` William Lee Irwin III
@ 2005-01-28 16:16                       ` Ingo Molnar
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2005-01-28 16:16 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt


* William Lee Irwin III <wli@holomorphy.com> wrote:

> The performance relative to mutual exclusion is quantifiable and very
> reproducible. [...]

yes, i dont doubt the results - my point is that it's not proven that
the other, more read-friendly types of locking underperform rwlocks. 
Obviously spinlocks and rwlocks have the same cache-bounce properties,
so rwlocks can outperform spinlocks if the read path overhead is higher
than that of a bounce, and reads are dominant. But it's still a poor
form of scalability. In fact, when the read path is really expensive
(larger than say 10-20 usecs) an rwlock can produce the appearance of
linear scalability, when compared to spinlocks.

> As far as performance relative to RCU goes, I suspect cases where
> write-side latency is important will arise for these. Other lockless
> methods are probably more appropriate, and are more likely to dominate
> rwlocks as expected. For instance, a reimplementation of the radix
> trees for lockless insertion and traversal (c.f. lockless pagetable
> patches for examples of how that's carried out) is plausible, where
> RCU memory overhead in struct page is not.

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28  7:38               ` Ingo Molnar
  2005-01-28 11:56                 ` William Lee Irwin III
@ 2005-01-28 19:18                 ` Trond Myklebust
  2005-01-28 19:45                   ` Ingo Molnar
  2005-01-28 21:13                   ` Lee Revell
  2005-01-30 22:03                 ` Esben Nielsen
  2 siblings, 2 replies; 30+ messages in thread
From: Trond Myklebust @ 2005-01-28 19:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt

fr den 28.01.2005 Klokka 08:38 (+0100) skreiv Ingo Molnar:

> no, it's not a big scalability problem. rwlocks are really a mistake -
> if you want scalability and spinlocks/semaphores are not enough then one
> should either use per-CPU locks or lockless structures. rwlocks/rwsems
> will very unlikely help much.

If you do have a highest interrupt case that causes all activity to
block, then rwsems may indeed fit the bill.

In the NFS client code we may use rwsems in order to protect stateful
operations against the (very infrequently used) server reboot recovery
code. The point is that when the server reboots, the server forces us to
block *all* requests that involve adding new state (e.g. opening an
NFSv4 file, or setting up a lock) while our client and others are
re-establishing their existing state on the server.

IOW: If you are planning on converting rwsems into a semaphore, you will
screw us over most royally, by converting the currently highly
infrequent scenario of a single task being able to access the server
into the common case.

Cheers,
  Trond
-- 
Trond Myklebust <trond.myklebust@fys.uio.no>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28 19:18                 ` Trond Myklebust
@ 2005-01-28 19:45                   ` Ingo Molnar
  2005-01-28 23:25                     ` Bill Huey
  2005-01-28 21:13                   ` Lee Revell
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-01-28 19:45 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt


* Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> If you do have a highest interrupt case that causes all activity to
> block, then rwsems may indeed fit the bill.
> 
> In the NFS client code we may use rwsems in order to protect stateful
> operations against the (very infrequently used) server reboot recovery
> code. The point is that when the server reboots, the server forces us
> to block *all* requests that involve adding new state (e.g. opening an
> NFSv4 file, or setting up a lock) while our client and others are
> re-establishing their existing state on the server.

it seems the most scalable solution for this would be a global flag plus
per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and
still support the requirements of this rare event. An rwsem really
bounces around on SMP, and it seems very unnecessary in the case you
described.

possibly this could be formalised as an rwlock/rwlock implementation
that scales better. brlocks were such an attempt.

> IOW: If you are planning on converting rwsems into a semaphore, you
> will screw us over most royally, by converting the currently highly
> infrequent scenario of a single task being able to access the server
> into the common case.

nono, i have no such plans.

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28 19:18                 ` Trond Myklebust
  2005-01-28 19:45                   ` Ingo Molnar
@ 2005-01-28 21:13                   ` Lee Revell
  1 sibling, 0 replies; 30+ messages in thread
From: Lee Revell @ 2005-01-28 21:13 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Ingo Molnar, Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Fri, 2005-01-28 at 11:18 -0800, Trond Myklebust wrote:
> In the NFS client code we may use rwsems in order to protect stateful
> operations against the (very infrequently used) server reboot recovery
> code. The point is that when the server reboots, the server forces us to
> block *all* requests that involve adding new state (e.g. opening an
> NFSv4 file, or setting up a lock) while our client and others are
> re-establishing their existing state on the server.

Hmm, when I was an ISP sysadmin I used to use this all the time.  NFS
mounts from the BSD/OS clients would start to act up under heavy web
server load and the cleanest way to get them to recover was to simulate
a reboot on the NetApp.  Of course Linux clients were unaffected, they
were just along for the ride ;-)

Lee


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28 19:45                   ` Ingo Molnar
@ 2005-01-28 23:25                     ` Bill Huey
  0 siblings, 0 replies; 30+ messages in thread
From: Bill Huey @ 2005-01-28 23:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Trond Myklebust, Esben Nielsen, Rui Nuno Capela, K.R. Foley,
	Fernando Lopez-Lezcano, mark_h_johnson, Amit Shah, Karsten Wiese,
	Bill Huey, Adam Heath, emann, Gunther Persoons, linux-kernel,
	Florian Schmidt, Lee Revell, Shane Shrybman, Thomas Gleixner,
	Michal Schmidt, Igor Manyilov (auriga)

On Fri, Jan 28, 2005 at 08:45:46PM +0100, Ingo Molnar wrote:
> * Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> > If you do have a highest interrupt case that causes all activity to
> > block, then rwsems may indeed fit the bill.
> > 
> > In the NFS client code we may use rwsems in order to protect stateful
> > operations against the (very infrequently used) server reboot recovery
> > code. The point is that when the server reboots, the server forces us
> > to block *all* requests that involve adding new state (e.g. opening an
> > NFSv4 file, or setting up a lock) while our client and others are
> > re-establishing their existing state on the server.
> 
> it seems the most scalable solution for this would be a global flag plus
> per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and
> still support the requirements of this rare event. An rwsem really
> bounces around on SMP, and it seems very unnecessary in the case you
> described.
> 
> possibly this could be formalised as an rwlock/rwlock implementation
> that scales better. brlocks were such an attempt.

>From how I understand it, you'll have to have a global structure to
denote an exclusive operation and then take some additional cpumask_t
representing the spinlocks set and use it to iterate over when doing a
PI chain operation.

Locking of each individual parametric typed spinlock might require
a raw_spinlock manipulate lists structures, which, added up, is rather
heavy weight.

No only that, you'd have to introduce a notion of it being counted
since it could also be aquired/preempted  by another higher priority
thread on that same procesor.  Not having this semantic would make the
thread in that specific circumstance effectively non-preemptable (PI
scheduler indeterminancy), where the mulipule readers portion of a
real read/write (shared-exclusve) lock would have permitted this.

	http://people.lynuxworks.com/~bhuey/rt-share-exclusive-lock/rtsem.tgz.1208

Is our attempt at getting real shared-exclusive lock semantics in a
blocking lock and may still be incomplete and buggy. Igor is still
working on this and this is the latest that I have of his work. Getting
comments on this approach would be a good thing as I/we (me/Igor)
believed from the start that this approach is correct.

Assuming that this is possible with the current approach, optimizing
it to avoid CPU ping-ponging is an important next step

bill


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-28  7:38               ` Ingo Molnar
  2005-01-28 11:56                 ` William Lee Irwin III
  2005-01-28 19:18                 ` Trond Myklebust
@ 2005-01-30 22:03                 ` Esben Nielsen
  2005-01-30 23:59                   ` Kyle Moffett
  2 siblings, 1 reply; 30+ messages in thread
From: Esben Nielsen @ 2005-01-30 22:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rui Nuno Capela, K.R. Foley, Fernando Lopez-Lezcano,
	mark_h_johnson, Amit Shah, Karsten Wiese, Bill Huey, Adam Heath,
	emann, Gunther Persoons, linux-kernel, Florian Schmidt,
	Lee Revell, Shane Shrybman, Thomas Gleixner, Michal Schmidt

On Fri, 28 Jan 2005, Ingo Molnar wrote:

> 
> * Esben Nielsen <simlo@phys.au.dk> wrote:
> 
> > I noticed that you changed rw-locks to behave quite diferently under
> > real-time preemption: They basicly works like normal locks now. I.e.
> > there can only be one reader task within each region. This can can
> > however lock the region recursively. [...]
> 
> correct.
> 
> > [...] I wanted to start looking at fixing that because it ought to
> > hurt scalability quite a bit - and even on UP create a few unneeded
> > task-switchs. [...]
> 
> no, it's not a big scalability problem. rwlocks are really a mistake -
> if you want scalability and spinlocks/semaphores are not enough then one
> should either use per-CPU locks or lockless structures. rwlocks/rwsems
> will very unlikely help much.
>
I agree that RCU ought to do the trick in a lot of cases. Unfortunately,
people haven't used RCU in a lot of code but an rwlock. I also like the
idea of rwlocks: Many readers or just one writer. I don't see the need to
take that away from people.  Here is an examble which even on a UP will
give problems without it:
You have a shared datastructure, rarely updated with many readers. A low
priority task is reading it. That is preempted a high priority task which
finds out it can't read it -> priority inheritance, task switch. The low
priority task finishes the job -> priority set back, task switch. If it
was done with a rwlock two task switchs would have been saved.

 
> > However, the more I think about it the bigger the problem:
> 
> yes, that complexity to get it perform in a deterministic manner is why
> i introduced this (major!) simplification of locking. It turns out that
> most of the time the actual use of rwlocks matches this simplified
> 'owner-recursive exclusive lock' semantics, so we are lucky.
> 
> look at what kind of worst-case scenarios there may already be with
> multiple spinlocks (blocker.c). With rwlocks that just gets insane.
> 
Yes it does. But one could make a compromise: The up_write() should _not_
be deterministic. In that case it would be very simple to implement.
up_read() could still be deterministic as it would only involve boosting
one writer in the rare case such exists. That kind of locking would be
very usefull in many real-time systems. Ofcourse, RCU can do the job as
well, but it puts a lot of contrains on the code. 

However, as Linux is a general OS there is no way to know wether a
specific lock needs to be determnistic wrt. writing or not as the actual
application is not known at the time the lock type is specified.

> 	Ingo
> 

Esben


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
  2005-01-30 22:03                 ` Esben Nielsen
@ 2005-01-30 23:59                   ` Kyle Moffett
  0 siblings, 0 replies; 30+ messages in thread
From: Kyle Moffett @ 2005-01-30 23:59 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Shane Shrybman, K.R. Foley, Thomas Gleixner, emann,
	Rui Nuno Capela, Adam Heath, Gunther Persoons, Florian Schmidt,
	mark_h_johnson, linux-kernel, Fernando Lopez-Lezcano,
	Ingo Molnar, Karsten Wiese, Bill Huey, Amit Shah, Lee Revell,
	Michal Schmidt

For anybody who wants a good executive summary of RCU, see these:
http://lse.sourceforge.net/locking/rcupdate.html
http://lse.sourceforge.net/locking/rcu/HOWTO/intro.html#WHATIS

On Jan 30, 2005, at 17:03, Esben Nielsen wrote:
> I agree that RCU ought to do the trick in a lot of cases. 
> Unfortunately,
> people haven't used RCU in a lot of code but an rwlock. I also like the
> idea of rwlocks: Many readers or just one writer.

Well, RCU is nice because as long as there are no processes attempting 
to
modify the data, the performance is as though there was no locking at 
all,
which is better than the cacheline-bouncing for rwlock read-acquires,
which must modify the rwlock data every time you acquire.  It's only 
when
you need to modify the data that readers or other writers must repeat
their calculations when they find out that the data's changed.  In the
case of a reader and a writer, the performance reduction is the same as 
a
cmpxchg and the reader redoing their calculations (if necessary).

> You have a shared datastructure, rarely updated with many readers. A 
> low
> priority task is reading it. That is preempted a high priority task 
> which
> finds out it can't read it -> priority inheritance, task switch. The 
> low
> priority task finishes the job -> priority set back, task switch. If it
> was done with a rwlock two task switchs would have been saved.

With RCU the high priority task (unlikely to be preempted) gets to run 
all
the way through with its calculation, and any low priority tasks are the
ones that will probably need to redo their calculations.

> Yes it does. But one could make a compromise: The up_write() should 
> _not_
> be deterministic. In that case it would be very simple to implement.
> up_read() could still be deterministic as it would only involve 
> boosting
> one writer in the rare case such exists. That kind of locking would be
> very usefull in many real-time systems. Ofcourse, RCU can do the job as
> well, but it puts a lot of contrains on the code.

Yeah, unfortunately it's harder to write good reliable RCU code than 
good
reliable rwlock code, because the semantics of RCU WRT memory access are
much more difficult, so more people write rwlock code that needs to be
cleaned up.  It's not like normal locking is easily comprehensible 
either.
:-\

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r  
!y?(-)
------END GEEK CODE BLOCK------



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2005-01-30 23:59 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-10 17:49 [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15 Mark_H_Johnson
2004-12-10 21:09 ` Ingo Molnar
2004-12-10 21:12 ` Ingo Molnar
2004-12-10 21:24 ` Ingo Molnar
2004-12-13  0:16 ` Fernando Lopez-Lezcano
2004-12-13  6:47   ` Ingo Molnar
2004-12-14  0:46     ` Fernando Lopez-Lezcano
2004-12-14  4:42       ` K.R. Foley
2004-12-14  8:47         ` Rui Nuno Capela
2004-12-14 11:35           ` Ingo Molnar
2004-12-27 14:35             ` Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15) Esben Nielsen
2004-12-27 15:27               ` Steven Rostedt
2004-12-27 16:23                 ` Esben Nielsen
2004-12-27 16:39                   ` Steven Rostedt
2004-12-27 21:06                     ` Bill Huey
2004-12-27 21:48                       ` Valdis.Kletnieks
2004-12-28 21:59                       ` Lee Revell
2005-01-04 15:25                       ` Andrew McGregor
2004-12-28 21:42                   ` Lee Revell
2005-01-28  7:38               ` Ingo Molnar
2005-01-28 11:56                 ` William Lee Irwin III
2005-01-28 15:28                   ` Ingo Molnar
2005-01-28 15:55                     ` William Lee Irwin III
2005-01-28 16:16                       ` Ingo Molnar
2005-01-28 19:18                 ` Trond Myklebust
2005-01-28 19:45                   ` Ingo Molnar
2005-01-28 23:25                     ` Bill Huey
2005-01-28 21:13                   ` Lee Revell
2005-01-30 22:03                 ` Esben Nielsen
2005-01-30 23:59                   ` Kyle Moffett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).