All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.19-rt14 slowdown compared to 2.6.19
@ 2006-12-22 21:39 Chen, Tim C
  2006-12-23  0:38 ` Daniel Walker
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Chen, Tim C @ 2006-12-22 21:39 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, Siddha, Suresh B

Ingo,
 
We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19 
kernel and noticed several slowdowns.  The test machine is a 2 socket
woodcrest machine with your default configuration.
 
Netperf TCP Streaming was slower by 40% ( 1 server and 1 client 
each bound to separate cpu cores on different socket, network
loopback mode was used).  

Volanomark was slower by 80% (Server and Clients communicate with 
network loopback mode. Idle time goes from 1% to 60%)

Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)

Wonder if you have any suggestions on what could cause the slowdown.  
We've tried disabling CONFIG_NO_HZ and it didn't help much.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-22 21:39 2.6.19-rt14 slowdown compared to 2.6.19 Chen, Tim C
@ 2006-12-23  0:38 ` Daniel Walker
  2006-12-23  2:43 ` K.R. Foley
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Daniel Walker @ 2006-12-23  0:38 UTC (permalink / raw)
  To: Chen, Tim C; +Cc: mingo, linux-kernel, Siddha, Suresh B

On Fri, 2006-12-22 at 13:39 -0800, Chen, Tim C wrote:

> Wonder if you have any suggestions on what could cause the slowdown.  
> We've tried disabling CONFIG_NO_HZ and it didn't help much.

Did you compare PREEMPT_RT with vanilla PREEMPT ? Or one version of
PREEMPT_RT vs. another ?

Daniel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-22 21:39 2.6.19-rt14 slowdown compared to 2.6.19 Chen, Tim C
  2006-12-23  0:38 ` Daniel Walker
@ 2006-12-23  2:43 ` K.R. Foley
  2006-12-23  3:00 ` K.R. Foley
  2006-12-23 10:59 ` Ingo Molnar
  3 siblings, 0 replies; 9+ messages in thread
From: K.R. Foley @ 2006-12-23  2:43 UTC (permalink / raw)
  To: Chen, Tim C; +Cc: mingo, linux-kernel, Siddha, Suresh B

Chen, Tim C wrote:
> Ingo,
>  
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19 
> kernel and noticed several slowdowns.  The test machine is a 2 socket
> woodcrest machine with your default configuration.
>  
> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client 
> each bound to separate cpu cores on different socket, network
> loopback mode was used).  
> 
> Volanomark was slower by 80% (Server and Clients communicate with 
> network loopback mode. Idle time goes from 1% to 60%)
> 
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
> 
> Wonder if you have any suggestions on what could cause the slowdown.  
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
>
> Thanks.
> 
> Tim

Take a look at this. Not sure if this is the same problem but it looks
like a candidate. I can definitely say that some latencies I have seen
in my recent testing have gone away in the last few versions
2.6.20-rc1-rt{3,4}.

-- 
	kr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-22 21:39 2.6.19-rt14 slowdown compared to 2.6.19 Chen, Tim C
  2006-12-23  0:38 ` Daniel Walker
  2006-12-23  2:43 ` K.R. Foley
@ 2006-12-23  3:00 ` K.R. Foley
  2006-12-23 10:59 ` Ingo Molnar
  3 siblings, 0 replies; 9+ messages in thread
From: K.R. Foley @ 2006-12-23  3:00 UTC (permalink / raw)
  To: Chen, Tim C; +Cc: mingo, linux-kernel, Siddha, Suresh B

Chen, Tim C wrote:
> Ingo,
>  
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19 
> kernel and noticed several slowdowns.  The test machine is a 2 socket
> woodcrest machine with your default configuration.
>  
> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client 
> each bound to separate cpu cores on different socket, network
> loopback mode was used).  
> 
> Volanomark was slower by 80% (Server and Clients communicate with 
> network loopback mode. Idle time goes from 1% to 60%)
> 
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
> 
> Wonder if you have any suggestions on what could cause the slowdown.  
> We've tried disabling CONFIG_NO_HZ and it didn't help much.
>
> Thanks.
> 
> Tim

Take a look at this. Not sure if this is the same problem but it looks
like a candidate. I can definitely say that some latencies I have seen
in my recent testing have gone away in the last few versions
2.6.20-rc1-rt{3,4}.

Sorry I missed the URL first time around.
http://marc.theaimsgroup.com/?l=linux-kernel&m=116670388527349&w=2

-- 
	kr


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-22 21:39 2.6.19-rt14 slowdown compared to 2.6.19 Chen, Tim C
                   ` (2 preceding siblings ...)
  2006-12-23  3:00 ` K.R. Foley
@ 2006-12-23 10:59 ` Ingo Molnar
  2006-12-29 23:52   ` Chen, Tim C
  3 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2006-12-23 10:59 UTC (permalink / raw)
  To: Chen, Tim C; +Cc: linux-kernel, Siddha, Suresh B, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 4472 bytes --]


* Chen, Tim C <tim.c.chen@intel.com> wrote:

> Ingo,
>  
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19 
> kernel and noticed several slowdowns.  The test machine is a 2 socket 
> woodcrest machine with your default configuration.

cool - thanks for the feedback! Running the 64-bit kernel, right?

> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client each 
> bound to separate cpu cores on different socket, network loopback mode 
> was used).
> 
> Volanomark was slower by 80% (Server and Clients communicate with 
> network loopback mode. Idle time goes from 1% to 60%)
> 
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
> 
> Wonder if you have any suggestions on what could cause the slowdown.  
> We've tried disabling CONFIG_NO_HZ and it didn't help much.

while some slowdown is to be expected, did in each case idle time 
increase significantly? If yes then this is the effect of lock 
contention. Lock contention effects are 'magnified' by PREEMPT_RT. For 
example if you run 128 threads workload that all use the same lock then 
the -rt kernel can act as if it were a 128-way box (!). This way by 
running -rt you'll see scalability problems alot sooner than on real 
hardware. In other words: PREEMPT_RT in essence simulates the 
scalability behavior of up to an infinite amount of CPUs. (with the 
exception of cachemiss emulation ;) [the effect is not this precise, but 
that's the rough trend]

the good news: any fix for such problems in -rt is useful for the 
upstream kernel too, because it improves scalability on real hardware as 
well. The bad news: this work hasnt been finished yet ;-)

We've fixed a couple of contention points discovered by -rt this way, in 
particular Peter's excellent scalability patches at:

  http://programming.kicks-ass.net/kernel-patches/

were done based on lock-contention profiling feedback. I've included the 
lockless pagecache and the s_files scalability patches which were very 
useful already. The lru_cache_add one is the next one for me to try, 
plus on i686 -rt kernels highmem locking hurts, hence the 
highmem_rewrite.patch (not in -rt yet either). Plus there's the really 
exciting concurrent-pagecache patch-queue from Peter, which will be 
interesting to try too.

If you'd like to profile this yourself then the lowest-cost way of 
profiling lock contention on -rt is to use the yum kernel and run the 
attached trace-it-lock-prof.c code on the box while your workload is in 
'steady state' (and is showing those extended idle times):

  ./trace-it-lock-prof > trace.txt

this captures up to 1 second worth of system activity, on the current 
CPU. Then you can construct the histogram via:

  grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
                                  uniq -c | sort -n > prof.txt

i just did this for "hackbench 20", and it gave me:

     38  kfree()<-skb_release_data()<-kfree_skbmem()<-(-1)()
     40  kmem_cache_free()<-kfree_skbmem()<-__kfree_skb()<-(-1)()
     47  __schedule()<-preempt_schedule_irq()<-retint_kernel()<-__spin_unlock_irqrestore()
    173  rt_read_lock()<-sock_def_readable()<-unix_stream_sendmsg()<-(-1)()
    173  __schedule()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
    274  __schedule()<-schedule()<-posix_cpu_timers_thread()<-kthread()
    277  __schedule()<-schedule()<-ksoftirqd()<-kthread()
    365  kmem_cache_alloc_node()<-__alloc_skb()<-sock_alloc_send_skb()<-(-1)()
    366  kmem_cache_alloc_node()<-__kmalloc_node()<-__alloc_skb()<-(-1)()
    814  __schedule()<-schedule()<-rt_spin_lock_slowlock()<-rt_spin_lock()
   1819  __schedule()<-schedule()<-int_careful()<-<ffffe405>()
   2346  __schedule()<-schedule()<-schedule_timeout()<-unix_stream_recvmsg()
   2346  sock_aio_read()<-do_sync_read()<-vfs_read()<-(-1)()

(note the output mixes the two lines of symbolic stack backtrace entries 
that are generated by the tracer, so the last two lines for example are 
for the same contention/preemption point.)

this means that the biggest contention points (or natural rescheduling 
points) in this workload are sock_aio_read()->unix_stream_recvmsg() - 
not unexpected. But there's also ~10% of contention in the kmalloc code, 
meaning that for this workload the per-CPU slabs are probably set a bit 
too low. There's also ~5% of contention from sock_def_readable()'s use 
of an rwlock.

with more CPUs and more tasks you'll see these effects in a more drastic 
way.

	Ingo

[-- Attachment #2: trace-it-lock-prof.c --]
[-- Type: text/plain, Size: 2530 bytes --]


/*
 * Copyright (C) 2005, Ingo Molnar <mingo@redhat.com>
 *
 * user-triggered tracing.
 *
 * The -rt kernel has a built-in kernel tracer, which will trace
 * all kernel function calls (and a couple of special events as well),
 * by using a build-time gcc feature that instruments all kernel
 * functions.
 * 
 * The tracer is highly automated for a number of latency tracing purposes,
 * but it can also be switched into 'user-triggered' mode, which is a
 * half-automatic tracing mode where userspace apps start and stop the
 * tracer. This file shows a dumb example how to turn user-triggered
 * tracing on, and how to start/stop tracing. Note that if you do
 * multiple start/stop sequences, the kernel will do a maximum search
 * over their latencies, and will keep the trace of the largest latency
 * in /proc/latency_trace. The maximums are also reported to the kernel
 * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
 *
 * For the tracer to be activated, turn on CONFIG_WAKEUP_TIMING and
 * CONFIG_LATENCY_TRACE in the .config, rebuild the kernel and boot
 * into it. Note that the tracer can have significant runtime overhead,
 * so you dont want to use it for performance testing :)
 */

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <linux/unistd.h>

int main (int argc, char **argv)
{
	int ret;

	if (getuid() != 0) {
		fprintf(stderr, "needs to run as root.\n");
		exit(1);
	}
	ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null");
	if (ret) {
		fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
		exit(1);
	}
	system("echo 1 > /proc/sys/kernel/trace_enabled");
	system("echo 0 > /proc/sys/kernel/trace_freerunning");
	system("echo 0 > /proc/sys/kernel/trace_print_at_crash");
	system("echo 1 > /proc/sys/kernel/trace_user_triggered");
	system("echo 1 > /proc/sys/kernel/stackframe_tracing");
	system("echo 0 > /proc/sys/kernel/trace_verbose");
	system("echo 0 > /proc/sys/kernel/preempt_max_latency");
	system("echo 0 > /proc/sys/kernel/preempt_thresh");
	system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 1 > /proc/sys/kernel/wakeup_timing");
	system("echo 0 > /proc/sys/kernel/mcount_enabled");

	// start tracing
	if (prctl(0, 1)) {
		fprintf(stderr, "trace-it: couldnt start tracing!\n");
		return 1;
	}
	usleep(1000000);
	if (prctl(0, 0)) {
		fprintf(stderr, "trace-it: couldnt start tracing!\n");
		return 1;
	}

	system("cat /proc/latency_trace");

	return 0;
}



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-23 10:59 ` Ingo Molnar
@ 2006-12-29 23:52   ` Chen, Tim C
  2006-12-30 10:04     ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Tim C @ 2006-12-29 23:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Siddha, Suresh B, Peter Zijlstra

Ingo Molnar wrote:
> 
> If you'd like to profile this yourself then the lowest-cost way of
> profiling lock contention on -rt is to use the yum kernel and run the
> attached trace-it-lock-prof.c code on the box while your workload is
> in 'steady state' (and is showing those extended idle times):
> 
>   ./trace-it-lock-prof > trace.txt
> 
> this captures up to 1 second worth of system activity, on the current
> CPU. Then you can construct the histogram via:
> 
>   grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
>                                   uniq -c | sort -n > prof.txt
> 

I did lock profiling on Volanomark as suggested and obtained the 
profile that is listed below. 

    246
__sched_text_start()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_s
tart()
    264  rt_mutex_slowunlock()<-rt_mutex_unlock()<-rt_up_read()<-(-1)()
    334
__sched_text_start()<-schedule()<-posix_cpu_timers_thread()<-kthread()
    437  __sched_text_start()<-schedule()<-do_futex()<-sys_futex()
    467  (-1)()<-(0)()<-(0)()<-(0)()
    495
__sched_text_start()<-preempt_schedule()<-__spin_unlock_irqrestore()<-rt
_mutex_adjust_prio()
    497  __netif_rx_schedule()<-netif_rx()<-loopback_xmit()<-(-1)()
    499
__sched_text_start()<-schedule()<-schedule_timeout()<-sk_wait_data()
    500  tcp_recvmsg()<-sock_common_recvmsg()<-sock_recvmsg()<-(-1)()
    503  __rt_down_read()<-rt_down_read()<-do_futex()<-(-1)()
   1160  __sched_text_start()<-schedule()<-ksoftirqd()<-kthread()
   1433  __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
   1497  child_rip()<-(-1)()<-(0)()<-(0)()
   1936
__sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()

Looks like the idle time I saw was due to lock contention 
during call to futex_wake, which requires acquisition of
current->mm->mmap_sem. 
Many of the java threads share mm and result in concurrent access to
common mm.  
Looks like under rt case there is no special treatment to read locking
so
the read lock accesses are contended under __rt_down_read.  For non rt
case, 
__down_read makes the distinction for read lock access and the read
lockings 
do not contend. 

Things are made worse here as this delayed waking up processes locked by
the futex.
See also a snippet of the latency_trace below. 

  <idle>-0     2D..2 5821us!: thread_return <softirq--31> (150 20)
  <idle>-0     2DN.1 6278us :
__sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
  <idle>-0     2DN.1 6278us : (0)()<-(0)()<-(0)()<-(0)()
    java-6648  2D..2 6280us+: thread_return <<idle>-0> (20 -4)
    java-6648  2D..1 6296us :
try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slow
unlock()
    java-6648  2D..1 6296us :
rt_mutex_unlock()<-rt_up_read()<-do_futex()<-(-1)()
    java-6648  2D..2 6297us : effective_prio <<...>-6673> (-4 -4)
    java-6648  2D..2 6297us : __activate_task <<...>-6673> (-4 1)
    java-6648  2.... 6297us < (-11)
    java-6648  2.... 6298us+> sys_futex (0000000000afaf50
0000000000000001 0000000000000001)
    java-6648  2...1 6315us :
__sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
    java-6648  2...1 6315us :
__rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
    java-6648  2D..2 6316us+: deactivate_task <java-6648> (-4 1)
  <idle>-0     2D..2 6318us+: thread_return <java-6648> (-4 20)
  <idle>-0     2DN.1 6327us :
__sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
  <idle>-0     2DN.1 6328us+: (0)()<-(0)()<-(0)()<-(0)()
    java-6629  2D..2 6330us+: thread_return <<idle>-0> (20 -4)
    java-6629  2D..1 6347us :
try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slow
unlock()
    java-6629  2D..1 6347us :
rt_mutex_unlock()<-rt_up_read()<-futex_wake()<-(-1)()
    java-6629  2D..2 6348us : effective_prio <java-6235> (-4 -4)
    java-6629  2D..2 6349us : __activate_task <java-6235> (-4 1)
    java-6629  2.... 6350us+< (0)
    java-6629  2.... 6352us+> sys_futex (0000000000afc1dc
0000000000000001 0000000000000001)
    java-6629  2...1 6368us :
__sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
    java-6629  2...1 6368us :
__rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
    java-6629  2D..2 6369us+: deactivate_task <java-6629> (-4 1)
  <idle>-0     2D..2 6404us!: thread_return <java-6629> (-4 20)
  <idle>-0     2DN.1 6584us :
__sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()

Thanks.

Tim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-29 23:52   ` Chen, Tim C
@ 2006-12-30 10:04     ` Ingo Molnar
  2007-01-02 18:07       ` Chen, Tim C
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2006-12-30 10:04 UTC (permalink / raw)
  To: Chen, Tim C; +Cc: linux-kernel, Siddha, Suresh B, Peter Zijlstra


* Chen, Tim C <tim.c.chen@intel.com> wrote:

> Ingo Molnar wrote:
> > 
> > If you'd like to profile this yourself then the lowest-cost way of
> > profiling lock contention on -rt is to use the yum kernel and run the
> > attached trace-it-lock-prof.c code on the box while your workload is
> > in 'steady state' (and is showing those extended idle times):
> > 
> >   ./trace-it-lock-prof > trace.txt
> > 
> > this captures up to 1 second worth of system activity, on the current
> > CPU. Then you can construct the histogram via:
> > 
> >   grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
> >                                   uniq -c | sort -n > prof.txt
> > 
> 
> I did lock profiling on Volanomark as suggested and obtained the 
> profile that is listed below.

thanks - this is really useful!

>     246  __sched_text_start()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
>     264  rt_mutex_slowunlock()<-rt_mutex_unlock()<-rt_up_read()<-(-1)()
>     334  __sched_text_start()<-schedule()<-posix_cpu_timers_thread()<-kthread()
>     437  __sched_text_start()<-schedule()<-do_futex()<-sys_futex()
>     467  (-1)()<-(0)()<-(0)()<-(0)()
>     495  __sched_text_start()<-preempt_schedule()<-__spin_unlock_irqrestore()<-rt_mutex_adjust_prio()
>     497  __netif_rx_schedule()<-netif_rx()<-loopback_xmit()<-(-1)()
>     499  __sched_text_start()<-schedule()<-schedule_timeout()<-sk_wait_data()
>     500  tcp_recvmsg()<-sock_common_recvmsg()<-sock_recvmsg()<-(-1)()
>     503  __rt_down_read()<-rt_down_read()<-do_futex()<-(-1)()
>    1160  __sched_text_start()<-schedule()<-ksoftirqd()<-kthread()
>    1433  __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
>    1497  child_rip()<-(-1)()<-(0)()<-(0)()
>    1936  __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
> 
> Looks like the idle time I saw was due to lock contention during call 
> to futex_wake, which requires acquisition of current->mm->mmap_sem. 
> Many of the java threads share mm and result in concurrent access to 
> common mm. [...]

ah. This explains why i'm not seeing this bad contention in a comparable 
workload (hackbench.c): because hackbench uses processes not threads.

> [...] Looks like under rt case there is no special treatment to read 
> locking so the read lock accesses are contended under __rt_down_read.  
> For non rt case, __down_read makes the distinction for read lock 
> access and the read lockings do not contend.

yeah, makes sense. I'll do something about this.

> Things are made worse here as this delayed waking up processes locked 
> by the futex. See also a snippet of the latency_trace below.
> 
>   <idle>-0     2D..2 5821us!: thread_return <softirq--31> (150 20)
>   <idle>-0     2DN.1 6278us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
>   <idle>-0     2DN.1 6278us : (0)()<-(0)()<-(0)()<-(0)()
>     java-6648  2D..2 6280us+: thread_return <<idle>-0> (20 -4)
>     java-6648  2D..1 6296us : try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slowunlock()
>     java-6648  2D..1 6296us : rt_mutex_unlock()<-rt_up_read()<-do_futex()<-(-1)()
>     java-6648  2D..2 6297us : effective_prio <<...>-6673> (-4 -4)
>     java-6648  2D..2 6297us : __activate_task <<...>-6673> (-4 1)
>     java-6648  2.... 6297us < (-11)
>     java-6648  2.... 6298us+> sys_futex (0000000000afaf50 0000000000000001 0000000000000001)
>     java-6648  2...1 6315us : __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
>     java-6648  2...1 6315us : __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
>     java-6648  2D..2 6316us+: deactivate_task <java-6648> (-4 1)
>   <idle>-0     2D..2 6318us+: thread_return <java-6648> (-4 20)
>   <idle>-0     2DN.1 6327us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()
>   <idle>-0     2DN.1 6328us+: (0)()<-(0)()<-(0)()<-(0)()
>     java-6629  2D..2 6330us+: thread_return <<idle>-0> (20 -4)
>     java-6629  2D..1 6347us : try_to_wake_up()<-wake_up_process()<-wakeup_next_waiter()<-rt_mutex_slowunlock()
>     java-6629  2D..1 6347us : rt_mutex_unlock()<-rt_up_read()<-futex_wake()<-(-1)()
>     java-6629  2D..2 6348us : effective_prio <java-6235> (-4 -4)
>     java-6629  2D..2 6349us : __activate_task <java-6235> (-4 1)
>     java-6629  2.... 6350us+< (0)
>     java-6629  2.... 6352us+> sys_futex (0000000000afc1dc 0000000000000001 0000000000000001)
>     java-6629  2...1 6368us : __sched_text_start()<-schedule()<-rt_mutex_slowlock()<-rt_mutex_lock()
>     java-6629  2...1 6368us : __rt_down_read()<-rt_down_read()<-futex_wake()<-(-1)()
>     java-6629  2D..2 6369us+: deactivate_task <java-6629> (-4 1)
>   <idle>-0     2D..2 6404us!: thread_return <java-6629> (-4 20)
>   <idle>-0     2DN.1 6584us : __sched_text_start()<-cpu_idle()<-start_secondary()<-(-1)()

indeed - basically the mm semaphore is a common resource here. I suspect 
you'll see somewhat better numbers by using idle=poll or idle=mwait (or 
are using those options already?).

(could you send me the whole trace if you still have it? It would be 
interesting to see a broader snippet from the life of individual java 
threads.)

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: 2.6.19-rt14 slowdown compared to 2.6.19
  2006-12-30 10:04     ` Ingo Molnar
@ 2007-01-02 18:07       ` Chen, Tim C
  0 siblings, 0 replies; 9+ messages in thread
From: Chen, Tim C @ 2007-01-02 18:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Siddha, Suresh B, Peter Zijlstra

Ingo Molnar wrote:
> 
> (could you send me the whole trace if you still have it? It would be
> interesting to see a broader snippet from the life of individual java
> threads.)
> 
> 	Ingo

Sure, I'll send it to you separately due to the size of the complete
trace.

Tim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: 2.6.19-rt14 slowdown compared to 2.6.19
@ 2006-12-27  0:51 Chen, Tim C
  0 siblings, 0 replies; 9+ messages in thread
From: Chen, Tim C @ 2006-12-27  0:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Siddha, Suresh B, Peter Zijlstra

Ingo Molnar wrote:
> 
> cool - thanks for the feedback! Running the 64-bit kernel, right?
> 

Yes, 64-bit kernel was used.

> 
> while some slowdown is to be expected, did in each case idle time
> increase significantly? 

Volanomark and Re-Aim7 ran close to 0% idle time for 2.6.19 kernel.
Idle time
increase significantly for Volanomark (to 60% idle) and Re-Aim7 (to 20%
idle) 
with the rt kernel.  For netperf, the system was 60% idle for 
both 2.6.19 and rt kernel and changes in idle time was not significant.

> If yes then this is the effect of lock
> contention. Lock contention effects are 'magnified' by PREEMPT_RT. For
> example if you run 128 threads workload that all use the same lock
> then 
> the -rt kernel can act as if it were a 128-way box (!). This way by
> running -rt you'll see scalability problems alot sooner than on real
> hardware. In other words: PREEMPT_RT in essence simulates the
> scalability behavior of up to an infinite amount of CPUs. (with the
> exception of cachemiss emulation ;) [the effect is not this precise,
> but 
> that's the rough trend]

Turning off PREEMPT_RT for 2.6.20-rc2-rt0 kernel
restored most the performance of Volanaomark
and Re-Aim7.  Idle time is close to 0%.  So the benchmarks
with large number of threads are affected more by PREEMPT_RT.

For netperf TCP streaming, the performance improved from 40% down to 20%
down from 2.6.20-rc2 kernel.  There is only a server and a client
process
for netperf.  The underlying reason for the change in performance
is probably different.

> 
> If you'd like to profile this yourself then the lowest-cost way of
> profiling lock contention on -rt is to use the yum kernel and run the
> attached trace-it-lock-prof.c code on the box while your workload is
> in 'steady state' (and is showing those extended idle times):
> 
>   ./trace-it-lock-prof > trace.txt
> 

Thanks for the pointer.  Will let you know of any relevant traces.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-01-02 18:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-22 21:39 2.6.19-rt14 slowdown compared to 2.6.19 Chen, Tim C
2006-12-23  0:38 ` Daniel Walker
2006-12-23  2:43 ` K.R. Foley
2006-12-23  3:00 ` K.R. Foley
2006-12-23 10:59 ` Ingo Molnar
2006-12-29 23:52   ` Chen, Tim C
2006-12-30 10:04     ` Ingo Molnar
2007-01-02 18:07       ` Chen, Tim C
2006-12-27  0:51 Chen, Tim C

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.