linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.5.36-mm1 dbench 512 profiles
@ 2002-09-19 22:30 William Lee Irwin III
  2002-09-19 23:18 ` Hanna Linder
  0 siblings, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-19 22:30 UTC (permalink / raw)
  To: linux-kernel

Well, from some private responses, it appeared this is of more general
interest than linux-mm, so I'm reposting here.

I'll follow up with some suggested patches for addressing some of the
performance issues that may have been encountered.

>From dbench 512 on a 32x NUMA-Q with 32GB of RAM running 2.5.36-mm1:

c01053a4 14040139 35.542      default_idle
c0114ab8 4436882  11.2318     load_balance
c015c5c6 4243413  10.742      .text.lock.dcache
c01317f4 2229431  5.64371     generic_file_write_nolock
c0130d10 2182906  5.52593     file_read_actor
c0114f30 2126191  5.38236     scheduler_tick
c0154b83 1905648  4.82407     .text.lock.namei
c011749d 1344623  3.40386     .text.lock.sched
c019f8ab 1102566  2.7911      .text.lock.dec_and_lock
c01066a8 612167   1.54968     .text.lock.semaphore
c015ba5c 440889   1.11609     d_lookup
c013f81c 314222   0.79544     blk_queue_bounce
c0111798 310317   0.785554    smp_apic_timer_interrupt
c013fac4 228103   0.577433    .text.lock.highmem
c01523b8 206811   0.523533    path_lookup
c0115274 164177   0.415607    do_schedule
c019f830 143365   0.362922    atomic_dec_and_lock
c0114628 136075   0.344468    try_to_wake_up
c01062dc 125245   0.317052    __down
c010d9d8 121864   0.308494    timer_interrupt
c015ae30 114653   0.290239    prune_dcache
c0144e00 102093   0.258444    generic_file_llseek
c015b714 83273    0.210802    d_instantiate

with akpm's removal of lock section directives:

c01053a4 31781009 38.3441     default_idle
c0114968 13184373 15.9071     load_balance
c0114de0 6545861  7.89765     scheduler_tick
c0151718 4514372  5.44664     path_lookup
c015ac4c 3314721  3.99924     d_lookup
c0130560 3153290  3.80448     file_read_actor
c0131044 2816477  3.39811     generic_file_write_nolock
c015a8e4 1980809  2.38987     d_instantiate
c019e1b0 1959187  2.36378     atomic_dec_and_lock
c0111668 1447604  1.74655     smp_apic_timer_interrupt
c0159fc0 1291884  1.55867     prune_dcache
c015a714 1089696  1.31473     d_alloc
c01062cc 1030194  1.24294     __down
c015b0dc 625279   0.754405    d_rehash
c013edac 554017   0.668427    blk_queue_bounce
c0115128 508229   0.613183    do_schedule
c01144c8 441818   0.533058    try_to_wake_up
c010d8f8 403607   0.486956    timer_interrupt
c01229a4 333023   0.401796    update_one_process
c015af70 322781   0.389439    d_delete
c01508a0 248442   0.299748    do_lookup
c01155f4 213738   0.257877    __wake_up
c013e63c 185472   0.223774    kmap_high


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-19 22:30 2.5.36-mm1 dbench 512 profiles William Lee Irwin III
@ 2002-09-19 23:18 ` Hanna Linder
  2002-09-19 23:38   ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Hanna Linder @ 2002-09-19 23:18 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel; +Cc: Hanna Linder, viro

--On Thursday, September 19, 2002 15:30:07 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:

> From dbench 512 on a 32x NUMA-Q with 32GB of RAM running 2.5.36-mm1:
> 
> c015c5c6 4243413  10.742      .text.lock.dcache
> c01317f4 2229431  5.64371     generic_file_write_nolock
> c0130d10 2182906  5.52593     file_read_actor
> c0114f30 2126191  5.38236     scheduler_tick
> c0154b83 1905648  4.82407     .text.lock.namei
> c011749d 1344623  3.40386     .text.lock.sched
> c019f8ab 1102566  2.7911      .text.lock.dec_and_lock
> c01066a8 612167   1.54968     .text.lock.semaphore
> c015ba5c 440889   1.11609     d_lookup

> 
> with akpm's removal of lock section directives:
> 
> c0114de0 6545861  7.89765     scheduler_tick
> c0151718 4514372  5.44664     path_lookup
> c015ac4c 3314721  3.99924     d_lookup
> c0130560 3153290  3.80448     file_read_actor
> c0131044 2816477  3.39811     generic_file_write_nolock
> c015a8e4 1980809  2.38987     d_instantiate
> c019e1b0 1959187  2.36378     atomic_dec_and_lock
> c0111668 1447604  1.74655     smp_apic_timer_interrupt
> c0159fc0 1291884  1.55867     prune_dcache
> c015a714 1089696  1.31473     d_alloc
> c01062cc 1030194  1.24294     __down

	So akpm's removal of lock section directives breaks down the
functions holding locks that previously were reported under the 
.text.lock.filename?  Looks like fastwalk might not behave so well
on this 32 cpu numa system...

Hanna


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-19 23:18 ` Hanna Linder
@ 2002-09-19 23:38   ` Andrew Morton
  2002-09-19 23:45     ` Hanna Linder
                       ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Andrew Morton @ 2002-09-19 23:38 UTC (permalink / raw)
  To: Hanna Linder; +Cc: William Lee Irwin III, linux-kernel, viro

Hanna Linder wrote:
> 
> ...
>         So akpm's removal of lock section directives breaks down the
> functions holding locks that previously were reported under the
> .text.lock.filename?

Yup.  It makes the profiler report the spinlock cost at the
actual callsite.  Patch below.

> Looks like fastwalk might not behave so well
> on this 32 cpu numa system...

I've rather lost the plot.  Have any of the dcache speedup
patches been merged into 2.5?

It would be interesting to know the context switch rate
during this test, and to see what things look like with HZ=100.



--- 2.5.24/include/asm-i386/spinlock.h~spinlock-inline	Fri Jun 21 13:12:01 2002
+++ 2.5.24-akpm/include/asm-i386/spinlock.h	Fri Jun 21 13:18:12 2002
@@ -46,13 +46,13 @@ typedef struct {
 	"\n1:\t" \
 	"lock ; decb %0\n\t" \
 	"js 2f\n" \
-	LOCK_SECTION_START("") \
+	"jmp 3f\n" \
 	"2:\t" \
 	"cmpb $0,%0\n\t" \
 	"rep;nop\n\t" \
 	"jle 2b\n\t" \
 	"jmp 1b\n" \
-	LOCK_SECTION_END
+	"3:\t" \
 
 /*
  * This works. Despite all the confusion.
--- 2.5.24/include/asm-i386/rwlock.h~spinlock-inline	Fri Jun 21 13:18:33 2002
+++ 2.5.24-akpm/include/asm-i386/rwlock.h	Fri Jun 21 13:22:09 2002
@@ -22,25 +22,19 @@
 
 #define __build_read_lock_ptr(rw, helper)   \
 	asm volatile(LOCK "subl $1,(%0)\n\t" \
-		     "js 2f\n" \
-		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tcall " helper "\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
+		     "jns 1f\n\t" \
+		     "call " helper "\n\t" \
+		     "1:\t" \
 		     ::"a" (rw) : "memory")
 
 #define __build_read_lock_const(rw, helper)   \
 	asm volatile(LOCK "subl $1,%0\n\t" \
-		     "js 2f\n" \
-		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tpushl %%eax\n\t" \
+		     "jns 1f\n\t" \
+		     "pushl %%eax\n\t" \
 		     "leal %0,%%eax\n\t" \
 		     "call " helper "\n\t" \
 		     "popl %%eax\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
+		     "1:\t" \
 		     :"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_read_lock(rw, helper)	do { \
@@ -52,25 +46,19 @@
 
 #define __build_write_lock_ptr(rw, helper) \
 	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
-		     "jnz 2f\n" \
+		     "jz 1f\n\t" \
+		     "call " helper "\n\t" \
 		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tcall " helper "\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
 		     ::"a" (rw) : "memory")
 
 #define __build_write_lock_const(rw, helper) \
 	asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
-		     "jnz 2f\n" \
-		     "1:\n" \
-		     LOCK_SECTION_START("") \
-		     "2:\tpushl %%eax\n\t" \
+		     "jz 1f\n\t" \
+		     "pushl %%eax\n\t" \
 		     "leal %0,%%eax\n\t" \
 		     "call " helper "\n\t" \
 		     "popl %%eax\n\t" \
-		     "jmp 1b\n" \
-		     LOCK_SECTION_END \
+		     "1:\n" \
 		     :"=m" (*(volatile int *)rw) : : "memory")
 
 #define __build_write_lock(rw, helper)	do { \

-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-19 23:38   ` Andrew Morton
@ 2002-09-19 23:45     ` Hanna Linder
  2002-09-20  0:08     ` William Lee Irwin III
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Hanna Linder @ 2002-09-19 23:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, linux-kernel, viro, Hanna Linder


--On Thursday, September 19, 2002 16:38:14 -0700 Andrew Morton <akpm@digeo.com> wrote:

> Hanna Linder wrote:
>> 
>> ...
>>         So akpm's removal of lock section directives breaks down the
>> functions holding locks that previously were reported under the
>> .text.lock.filename?
> 
> Yup.  It makes the profiler report the spinlock cost at the
> actual callsite.  Patch below.

	Thanks. We've needed that for quite some time.
> 
>> Looks like fastwalk might not behave so well
>> on this 32 cpu numa system...
> 
> I've rather lost the plot.  Have any of the dcache speedup
> patches been merged into 2.5?

	Yes, starting with 2.5.11. Al Viro made some changes to
it and it went in. Havent heard anything about it since...

Hanna


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-19 23:38   ` Andrew Morton
  2002-09-19 23:45     ` Hanna Linder
@ 2002-09-20  0:08     ` William Lee Irwin III
  2002-09-20  4:02       ` William Lee Irwin III
  2002-09-20  7:59       ` Maneesh Soni
  2002-09-20  5:14     ` William Lee Irwin III
  2002-09-20  6:59     ` William Lee Irwin III
  3 siblings, 2 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20  0:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hanna Linder, linux-kernel, viro

Hanna Linder wrote:
>> Looks like fastwalk might not behave so well
>> on this 32 cpu numa system...

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> I've rather lost the plot.  Have any of the dcache speedup
> patches been merged into 2.5?

As far as the dcache goes, I'll stick to observing and reporting.
I'll rerun with dcache patches applied, though.


On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> It would be interesting to know the context switch rate
> during this test, and to see what things look like with HZ=100.

The context switch rate was 60 or 70 cs/sec. during the steady
state of the test, and around 10K cs/sec for ramp-up and ramp-down.

I've already prepared a kernel with a lowered HZ, but stopped briefly to
debug a calibrate_delay() oops and chat with folks around the workplace.


Thanks,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20  0:08     ` William Lee Irwin III
@ 2002-09-20  4:02       ` William Lee Irwin III
  2002-09-20  7:59       ` Maneesh Soni
  1 sibling, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20  4:02 UTC (permalink / raw)
  To: Andrew Morton, Hanna Linder, linux-kernel, viro

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
>> It would be interesting to know the context switch rate
>> during this test, and to see what things look like with HZ=100.

On Thu, Sep 19, 2002 at 05:08:15PM -0700, William Lee Irwin III wrote:
> The context switch rate was 60 or 70 cs/sec. during the steady
> state of the test, and around 10K cs/sec for ramp-up and ramp-down.
> I've already prepared a kernel with a lowered HZ, but stopped briefly to
> debug a calibrate_delay() oops and chat with folks around the workplace.

Okay, figured that one out (c.f. x86_udelay_tsc thread). I'll grind out
another one in about 90-120 minutes or thereabouts with HZ == 100. I'm
going to take a wild guess param.h should have an #ifdef there for
NR_CPUS >= WLI_SAW_EXCESS_TIMER_INTS_HERE or something. It's probably
possible to figure out what the service time vs. arrival rate stuff says
but it's too easy to fix to be worth analyzing, and we don't exist to
process timer ticks anyway. Hrm, yet another cry for i386 subarches?

Hanna, did you have a particular dcache patch in mind? ISTR there were
several flavors. (I can of course sift through them myself as well.)

Cheers,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-19 23:38   ` Andrew Morton
  2002-09-19 23:45     ` Hanna Linder
  2002-09-20  0:08     ` William Lee Irwin III
@ 2002-09-20  5:14     ` William Lee Irwin III
  2002-09-20  6:59     ` William Lee Irwin III
  3 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20  5:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hanna Linder, linux-kernel, viro

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> It would be interesting to know the context switch rate
> during this test, and to see what things look like with HZ=100.

Ow! Throughput went down by something like 25% by just bumping HZ down
to 100. (or so it appears)

One of the odd things here is that this is absolutely not being allowed
to sample times when the test is not running. So default_idle is some
kind of actual scheduling artifact, though I'm not entirely sure when
it's sitting idle. It may just be the most predominant sampled thing
within the kernel despite 98% non-idle system time and HZ == 1000 is not
the issue. Not sure.


Cheers,
Bill

Out-of-line lock version:

c01053a4 20974286 42.2901     default_idle
c015c586 5759482  11.6127     .text.lock.dcache
c0154b43 5747223  11.588      .text.lock.namei
c01317e4 4653534  9.38284     generic_file_write_nolock
c0130d00 1861383  3.75308     file_read_actor
c0114a98 1230049  2.48013     load_balance
c019f6bb 866076   1.74625     .text.lock.dec_and_lock
c01066a8 796042   1.60505     .text.lock.semaphore
c013f7fc 749976   1.51216     blk_queue_bounce
c019f640 497122   1.00234     atomic_dec_and_lock
c0114f10 321897   0.649036    scheduler_tick
c0152378 262290   0.528851    path_lookup
c0144dc0 223189   0.450012    generic_file_llseek
c015adf0 207285   0.417945    prune_dcache
c011748d 185193   0.373402    .text.lock.sched
c0115258 184852   0.372714    do_schedule
c0114628 171719   0.346234    try_to_wake_up
c013676c 170725   0.34423     .text.lock.slab
c013faa4 143586   0.28951     .text.lock.highmem
c01062dc 142571   0.287464    __down
c015ba1c 140737   0.283766    d_lookup
c014675c 130760   0.263649    __fput
c0152a30 126688   0.255439    open_namei


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-19 23:38   ` Andrew Morton
                       ` (2 preceding siblings ...)
  2002-09-20  5:14     ` William Lee Irwin III
@ 2002-09-20  6:59     ` William Lee Irwin III
  3 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20  6:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hanna Linder, linux-kernel, viro

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> It would be interesting to know the context switch rate
> during this test, and to see what things look like with HZ=100.

There is no obvious time when the machine appears idle, but regardless:
HZ == 100 with idle=poll. The numbers are still down, and I'm not 100%
sure why, so I'm backing out HZ = 100 and trying something else.

c01053dc 296114302 79.5237     poll_idle
c01053a4 20974286 5.6328      default_idle
c015c586 11098020 2.98046     .text.lock.dcache
c0154b43 10046044 2.69794     .text.lock.namei
c01317e4 9712304  2.60831     generic_file_write_nolock
c0130d00 3133118  0.841423    file_read_actor
c0114a98 2300611  0.617847    load_balance
c013f7fc 1615733  0.433917    blk_queue_bounce
c019f6bb 1591350  0.427369    .text.lock.dec_and_lock
c01066a8 1554624  0.417506    .text.lock.semaphore
c019f640 1000989  0.268823    atomic_dec_and_lock
c0114f10 851084   0.228565    scheduler_tick
c0152378 639577   0.171763    path_lookup
c0144dc0 456594   0.122622    generic_file_llseek
c0114628 407284   0.109379    try_to_wake_up
c0115258 399893   0.107394    do_schedule
c015adf0 370906   0.0996096   prune_dcache
c011748d 366470   0.0984183   .text.lock.sched
c015b6d4 306140   0.0822162   d_instantiate
c013faa4 292987   0.0786839   .text.lock.highmem
c013676c 291664   0.0783286   .text.lock.slab
c01062dc 282983   0.0759972   __down
c014675c 281106   0.0754932   __fput

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20  0:08     ` William Lee Irwin III
  2002-09-20  4:02       ` William Lee Irwin III
@ 2002-09-20  7:59       ` Maneesh Soni
  2002-09-20  8:06         ` William Lee Irwin III
  2002-09-20 14:34         ` Dave Hansen
  1 sibling, 2 replies; 24+ messages in thread
From: Maneesh Soni @ 2002-09-20  7:59 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, linux-kernel, viro

On Fri, 20 Sep 2002 05:48:38 +0530, William Lee Irwin III wrote:

> Hanna Linder wrote:
>>> Looks like fastwalk might not behave so well on this 32 cpu numa
>>> system...
> 
> On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
>> I've rather lost the plot.  Have any of the dcache speedup patches been
>> merged into 2.5?
> 
> As far as the dcache goes, I'll stick to observing and reporting. I'll
> rerun with dcache patches applied, though.
> 
..
> Thanks,
> Bill
> -

For a 32-way system fastwalk will perform badly from dcache_lock point of 
view, basically due to increased lock hold time. dcache_rcu-12 should reduce
dcache_lock contention and hold time. The patch uses RCU infrastructer patch and
read_barrier_depends patch. The patches are available in Read-Copy-Update
section on lse site at

http://sourceforge.net/projects/lse

Regards
Maneesh

-- 
Maneesh Soni
IBM Linux Technology Center, 
IBM India Software Lab, Bangalore.
Phone: +91-80-5044999 email: maneesh@in.ibm.com
http://lse.sourceforge.net/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20  7:59       ` Maneesh Soni
@ 2002-09-20  8:06         ` William Lee Irwin III
  2002-09-20 12:03           ` William Lee Irwin III
  2002-09-20 14:34         ` Dave Hansen
  1 sibling, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20  8:06 UTC (permalink / raw)
  To: Maneesh Soni; +Cc: Andrew Morton, linux-kernel, viro

On Fri, 20 Sep 2002 05:48:38 +0530, William Lee Irwin III wrote:
>> As far as the dcache goes, I'll stick to observing and reporting. I'll
>> rerun with dcache patches applied, though.

On Fri, Sep 20, 2002 at 01:29:28PM +0530, Maneesh Soni wrote:
> For a 32-way system fastwalk will perform badly from dcache_lock
> point of view, basically due to increased lock hold time.
> dcache_rcu-12 should reduce dcache_lock contention and hold time. The
> patch uses RCU infrastructer patch and read_barrier_depends patch.
> The patches are available in Read-Copy-Update section on lse site at
> http://sourceforge.net/projects/lse

ISTR Hubertus mentioning this at OLS, and it sounded like a problem to
me. I'm doing some runs with this to see if it fixes the problem.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20  8:06         ` William Lee Irwin III
@ 2002-09-20 12:03           ` William Lee Irwin III
  2002-09-20 18:51             ` Hanna Linder
                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20 12:03 UTC (permalink / raw)
  To: Maneesh Soni, Andrew Morton, linux-kernel, viro

On Fri, Sep 20, 2002 at 01:29:28PM +0530, Maneesh Soni wrote:
>> For a 32-way system fastwalk will perform badly from dcache_lock
>> point of view, basically due to increased lock hold time.
>> dcache_rcu-12 should reduce dcache_lock contention and hold time. The
>> patch uses RCU infrastructer patch and read_barrier_depends patch.
>> The patches are available in Read-Copy-Update section on lse site at
>> http://sourceforge.net/projects/lse

On Fri, Sep 20, 2002 at 01:06:28AM -0700, William Lee Irwin III wrote:
> ISTR Hubertus mentioning this at OLS, and it sounded like a problem to
> me. I'm doing some runs with this to see if it fixes the problem.

AFAICT, with one bottleneck out of the way, a new one merely arises to
take its place. Ugly. OTOH the qualitative difference is striking. The
interactive responsiveness of the machine, even when entirely unloaded,
is drastically improved, along with such nice things as init scripts
and kernel compiles also markedly faster. I suspect this is just the
wrong benchmark to show throughput benefits with.

Also notable is that the system time was significantly reduced though
I didn't log it. Essentially a long period of 100% system time is
entered after a certain point in the benchmark, during which there are
few (around 60 or 70) context switches in a second, and the duration
of this period was shortened.

The results here contradict my prior conclusions wrt. HZ 100 vs. 1000.

IMHO this worked and the stuff aroung generic_file_write_nolock(),
file_read_actor(), whatever is hammering semaphore.c, and reducing
blk_queue_bounce() traffic are the next issues to address. Any ideas?


dcache_rcu, HZ == 1000:
Throughput 36.5059 MB/sec (NB=45.6324 MB/sec  365.059 MBit/sec)  512 procs
---------------------------------------------------------------------------
c01053dc 320521015 90.6236     poll_idle
c0114ab8 13559139 3.83369     load_balance
c0114f30 3146028  0.889502    scheduler_tick
c011751d 2702819  0.76419     .text.lock.sched
c0131110 2534516  0.716605    file_read_actor
c0131bf4 1307874  0.369786    generic_file_write_nolock
c0111798 1243507  0.351587    smp_apic_timer_interrupt
c01066a8 1108969  0.313548    .text.lock.semaphore
c013fc0c 772807   0.218502    blk_queue_bounce
c01152e4 559869   0.158296    do_schedule
c0114628 323975   0.0916001   try_to_wake_up
c010d9d8 304144   0.0859931   timer_interrupt
c01062dc 271440   0.0767465   __down
c013feb4 240824   0.0680902   .text.lock.highmem
c01450b0 224874   0.0635805   generic_file_llseek
c019f55b 214729   0.0607121   .text.lock.dec_and_lock
c0136b7c 208790   0.0590329   .text.lock.slab
c0122ef4 185013   0.0523103   update_one_process
c0146dee 135391   0.0382802   .text.lock.file_table
c01472dc 127782   0.0361289   __find_get_block_slow
c015ba4c 122577   0.0346572   d_lookup
c0173cd0 115446   0.032641    ext2_new_block
c0132958 114472   0.0323656   generic_file_write


dcache_rcu, HZ == 100:
Throughput 39.1471 MB/sec (NB=48.9339 MB/sec  391.471 MBit/sec)  512 procs
--------------------------------------------------------------------------
c01053dc 331775731 95.9799     poll_idle
c0131be4 3310063  0.957573    generic_file_write_nolock
c0131100 1552058  0.448997    file_read_actor
c0114a98 1491802  0.431565    load_balance
c01066a8 1048138  0.303217    .text.lock.semaphore
c013fbec 570986   0.165181    blk_queue_bounce
c0114f10 532451   0.154033    scheduler_tick
c01152c8 311667   0.0901626   do_schedule
c013fe94 239497   0.0692844   .text.lock.highmem
c0114628 222569   0.0643873   try_to_wake_up
c019f36b 220632   0.0638269   .text.lock.dec_and_lock
c01062dc 191477   0.0553926   __down
c0136b6c 164682   0.0476411   .text.lock.slab
c011750d 160221   0.0463506   .text.lock.sched
c014729c 123385   0.0356942   __find_get_block_slow
c0173b00 120967   0.0349947   ext2_new_block
c01387f0 111699   0.0323136   __free_pages_ok
c0146dae 104794   0.030316    .text.lock.file_table
c019f2f0 102715   0.0297146   atomic_dec_and_lock
c0145070 96505    0.0279181   generic_file_llseek
c01367c4 95436    0.0276088   s_show
c0138b24 91321    0.0264184   rmqueue
c01523a8 87421    0.0252901   path_lookup

mm1, HZ == 1000:
Throughput 36.3452 MB/sec (NB=45.4315 MB/sec  363.452 MBit/sec)  512 procs
--------------------------------------------------------------------------
c01053dc 291824934 78.5936     poll_idle
c0114ab8 15361229 4.13705     load_balance
c01053a4 14040139 3.78126     default_idle
c015c5c6 7489522  2.01706     .text.lock.dcache
c01317f4 5707336  1.53709     generic_file_write_nolock
c0114f30 5425740  1.46125     scheduler_tick
c0130d10 5397721  1.4537      file_read_actor
c0154b83 3917278  1.05499     .text.lock.namei
c011749d 3508427  0.944882    .text.lock.sched
c019f8ab 2415903  0.650646    .text.lock.dec_and_lock
c01066a8 1615952  0.435205    .text.lock.semaphore
c0111798 1461670  0.393654    smp_apic_timer_interrupt
c013f81c 1330609  0.358357    blk_queue_bounce
c015ba5c 780847   0.210296    d_lookup
c013fac4 578235   0.155729    .text.lock.highmem
c0115274 542453   0.146092    do_schedule
c0114628 441528   0.118911    try_to_wake_up
c010d9d8 437417   0.117804    timer_interrupt
c01523b8 399484   0.107588    path_lookup
c01062dc 362925   0.0977422   __down
c019f830 275515   0.0742011   atomic_dec_and_lock
c0122e94 271817   0.0732051   update_one_process
c0144e00 260097   0.0700487   generic_file_llseek

mm1, HZ == 100:
Throughput 39.0368 MB/sec (NB=48.796 MB/sec  390.368 MBit/sec)  512 procs
-------------------------------------------------------------------------
c01053dc 572091962 84.309      poll_idle
c01053a4 20974286 3.09097     default_idle
c015c586 17014849 2.50747     .text.lock.dcache
c0154b43 16074116 2.36884     .text.lock.namei
c01317e4 14653053 2.15942     generic_file_write_nolock
c0130d00 5295158  0.780346    file_read_actor
c0114a98 3437483  0.506581    load_balance
c019f6bb 2455126  0.361811    .text.lock.dec_and_lock
c013f7fc 2428344  0.357864    blk_queue_bounce
c01066a8 2379650  0.350688    .text.lock.semaphore
c019f640 1525996  0.224886    atomic_dec_and_lock
c0114f10 1328712  0.195812    scheduler_tick
c0152378 923439   0.136087    path_lookup
c0144dc0 692727   0.102087    generic_file_llseek
c0115258 599269   0.0883141   do_schedule
c0114628 593380   0.0874462   try_to_wake_up
c011748d 574637   0.0846841   .text.lock.sched
c015adf0 516917   0.0761779   prune_dcache
c013676c 496571   0.0731795   .text.lock.slab
c013faa4 471971   0.0695542   .text.lock.highmem
c015b6d4 444406   0.065492    d_instantiate
c01062dc 436983   0.064398    __down
c014675c 420142   0.0619162   __fput


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20  7:59       ` Maneesh Soni
  2002-09-20  8:06         ` William Lee Irwin III
@ 2002-09-20 14:34         ` Dave Hansen
  2002-09-20 16:07           ` Martin J. Bligh
  2002-09-20 17:40           ` Dipankar Sarma
  1 sibling, 2 replies; 24+ messages in thread
From: Dave Hansen @ 2002-09-20 14:34 UTC (permalink / raw)
  To: maneesh
  Cc: William Lee Irwin III, Andrew Morton, linux-kernel, viro, Hanna Linder

Maneesh Soni wrote:
> On Fri, 20 Sep 2002 05:48:38 +0530, William Lee Irwin III wrote:
> 
>>Hanna Linder wrote:
>>
>>>>Looks like fastwalk might not behave so well on this 32 cpu numa
>>>>system...
>>
>>On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
>>
>>>I've rather lost the plot.  Have any of the dcache speedup patches been
>>>merged into 2.5?
>>
>>As far as the dcache goes, I'll stick to observing and reporting. I'll
>>rerun with dcache patches applied, though.
>>
> For a 32-way system fastwalk will perform badly from dcache_lock point of 
> view, basically due to increased lock hold time. dcache_rcu-12 should reduce
> dcache_lock contention and hold time.

Isn't increased hold time _good_ on NUMA-Q?  I thought that the really 
costy operation was bouncing the lock around the interconnect, not 
holding it.  Has fastwalk ever been tested on NUMA-Q?

Remember when John Stultz tried MCS (fair) locks on NUMA-Q?  They 
sucked because low hold times, which result from fairness, aren't 
efficient.  It is actually faster to somewhat starve remote CPUs.

In any case, we all know often acquired global locks are a bad idea on 
a 32-way, and should be avoided like the plague.  I just wish we had a 
dcache solution that didn't even need locks as much... :)

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 14:34         ` Dave Hansen
@ 2002-09-20 16:07           ` Martin J. Bligh
  2002-09-20 17:48             ` Dipankar Sarma
  2002-09-20 17:40           ` Dipankar Sarma
  1 sibling, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-09-20 16:07 UTC (permalink / raw)
  To: Dave Hansen, maneesh
  Cc: William Lee Irwin III, Andrew Morton, linux-kernel, viro, Hanna Linder

>> For a 32-way system fastwalk will perform badly from dcache_lock point of 
>> view, basically due to increased lock hold time. dcache_rcu-12 should reduce
>> dcache_lock contention and hold time.
> 
> Isn't increased hold time _good_ on NUMA-Q?  I thought that the 
> really costy operation was bouncing the lock around the interconnect, 
> not holding it. 

Depends what you get it return. The object of fastwalk was to stop the
cacheline bouncing on all the individual dentry counters, at the cost
of increased dcache_lock hold times. It's a tradeoff ... and in this
instance it wins. In general, long lock hold times are bad.

> Has fastwalk ever been tested on NUMA-Q?

Yes, in 2.4. Gave good results, I forget exactly what ... something
like 5-10% off kernel compile times.

> Remember when John Stultz tried MCS (fair) locks on NUMA-Q?  They
> sucked because low hold times, which result from fairness, aren't 
> efficient.  It is actually faster to somewhat starve remote CPUs.

Nothing to do with low hold times - it's to do with bouncing the 
lock between nodes.

> In any case, we all know often acquired global locks are a bad idea 
> on a 32-way, and should be avoided like the plague.  I just wish we 
> had a dcache solution that didn't even need locks as much... :)

Well, avoiding data corruption is a preferable goal too. The point of
RCU is not to have to take a lock for the common read case. I'd expect
good results from it on the NUMA machines - never been benchmarked, as
far as I recall.

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 14:34         ` Dave Hansen
  2002-09-20 16:07           ` Martin J. Bligh
@ 2002-09-20 17:40           ` Dipankar Sarma
  2002-09-20 20:28             ` Dipankar Sarma
  1 sibling, 1 reply; 24+ messages in thread
From: Dipankar Sarma @ 2002-09-20 17:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: maneesh, William Lee Irwin III, Andrew Morton, linux-kernel,
	viro, Hanna Linder

On Fri, Sep 20, 2002 at 02:37:41PM +0000, Dave Hansen wrote:
> Isn't increased hold time _good_ on NUMA-Q?  I thought that the really 
> costy operation was bouncing the lock around the interconnect, not 

Increased hold time isn't necessarily good. If you acquire the lock
often, your lock wait time will increase correspondingly. The ultimate
goal should be to decrease the total number of acquisitions.

> holding it.  Has fastwalk ever been tested on NUMA-Q?

Fastwalk is in 2.5. You can see wli's profile numbers for dbench 512
earlier in this thread.

> 
> Remember when John Stultz tried MCS (fair) locks on NUMA-Q?  They 
> sucked because low hold times, which result from fairness, aren't 
> efficient.  It is actually faster to somewhat starve remote CPUs.

One workaround is to keep scheduling the lock within the CPUs of
a node as much as possible and release it to a different node
only if there isn't any CPU available in the current node. Anyway
these are not real solutions, just band-aids.

> 
> In any case, we all know often acquired global locks are a bad idea on 
> a 32-way, and should be avoided like the plague.  I just wish we had a 
> dcache solution that didn't even need locks as much... :)

You have one - dcache_rcu. It reduces the dcache_lock acquisition
by about 65% over fastwalk.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 16:07           ` Martin J. Bligh
@ 2002-09-20 17:48             ` Dipankar Sarma
  0 siblings, 0 replies; 24+ messages in thread
From: Dipankar Sarma @ 2002-09-20 17:48 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Dave Hansen, maneesh, William Lee Irwin III, Andrew Morton,
	linux-kernel, viro, Hanna Linder

On Fri, Sep 20, 2002 at 04:17:10PM +0000, Martin J. Bligh wrote:
> > Isn't increased hold time _good_ on NUMA-Q?  I thought that the 
> > really costy operation was bouncing the lock around the interconnect, 
> > not holding it. 
> 
> Depends what you get it return. The object of fastwalk was to stop the
> cacheline bouncing on all the individual dentry counters, at the cost
> of increased dcache_lock hold times. It's a tradeoff ... and in this
> instance it wins. In general, long lock hold times are bad.

I don't think individual dentry counters are as much a problem as
acquisition of dcache_lock for every path component lookup as done
by the earlier path walking algorithm. The big deal with fastwalk
is that it decreases the number of acquisitions of dcache_lock
for a webserver workload by 70% on an 8-CPU machine. That is avoiding
a lot of possible cacheline bouncing of dcache_lock.


> > In any case, we all know often acquired global locks are a bad idea 
> > on a 32-way, and should be avoided like the plague.  I just wish we 
> > had a dcache solution that didn't even need locks as much... :)
> 
> Well, avoiding data corruption is a preferable goal too. The point of
> RCU is not to have to take a lock for the common read case. I'd expect
> good results from it on the NUMA machines - never been benchmarked, as
> far as I recall.

You can see that in wli's dbench 512 results on his NUMA box.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 12:03           ` William Lee Irwin III
@ 2002-09-20 18:51             ` Hanna Linder
  2002-09-20 20:32               ` Hanna Linder
  2002-09-20 20:39               ` William Lee Irwin III
  2002-09-20 21:30             ` Martin J. Bligh
  2002-09-21  7:52             ` William Lee Irwin III
  2 siblings, 2 replies; 24+ messages in thread
From: Hanna Linder @ 2002-09-20 18:51 UTC (permalink / raw)
  To: William Lee Irwin III, Maneesh Soni, Andrew Morton, linux-kernel, viro
  Cc: Hanna Linder

--On Friday, September 20, 2002 05:03:58 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Sep 20, 2002 at 01:29:28PM +0530, Maneesh Soni wrote:
>>> For a 32-way system fastwalk will perform badly from dcache_lock
>>> point of view, basically due to increased lock hold time.
>>> dcache_rcu-12 should reduce dcache_lock contention and hold time. The
>>> patch uses RCU infrastructer patch and read_barrier_depends patch.
>>> The patches are available in Read-Copy-Update section on lse site at
>>> http://sourceforge.net/projects/lse
> 
> On Fri, Sep 20, 2002 at 01:06:28AM -0700, William Lee Irwin III wrote:
>> ISTR Hubertus mentioning this at OLS, and it sounded like a problem to
>> me. I'm doing some runs with this to see if it fixes the problem.

	I mentioned it at OLS too. It was the point of my talk. Next
	time I will request a non 10am time slot!

> take its place. Ugly. OTOH the qualitative difference is striking. The
> interactive responsiveness of the machine, even when entirely unloaded,
> is drastically improved, along with such nice things as init scripts
> and kernel compiles also markedly faster. I suspect this is just the
> wrong benchmark to show throughput benefits with.
> 
> Also notable is that the system time was significantly reduced though
> I didn't log it. Essentially a long period of 100% system time is
> entered after a certain point in the benchmark, during which there are
> few (around 60 or 70) context switches in a second, and the duration
> of this period was shortened.

	Bill, you are saying that replacing dcache_rcu significantly
	improved system response time among other things? 

	Perhaps it is time to reconsider replacing fastwalk with dcache_rcu. 

	Viro? What are your objections?

Thanks.

Hanna


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 17:40           ` Dipankar Sarma
@ 2002-09-20 20:28             ` Dipankar Sarma
  0 siblings, 0 replies; 24+ messages in thread
From: Dipankar Sarma @ 2002-09-20 20:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: maneesh, William Lee Irwin III, Andrew Morton, linux-kernel,
	viro, Hanna Linder

On Fri, Sep 20, 2002 at 11:10:20PM +0530, Dipankar Sarma wrote:
> > 
> > In any case, we all know often acquired global locks are a bad idea on 
> > a 32-way, and should be avoided like the plague.  I just wish we had a 
> > dcache solution that didn't even need locks as much... :)
> 
> You have one - dcache_rcu. It reduces the dcache_lock acquisition
> by about 65% over fastwalk.

I should clarify, this was with a webserver benchmark.

For those who want to use them, Maneesh's dcache_rcu-12 patch and my
RCU "performance" infrastructure patches are in -

http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=111743

The latest release is 2.5.36-mm1.
rcu_ltimer and read_barrier_depends are pre-requisites for dcache_rcu.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 18:51             ` Hanna Linder
@ 2002-09-20 20:32               ` Hanna Linder
  2002-09-20 20:54                 ` Dipankar Sarma
  2002-09-20 20:39               ` William Lee Irwin III
  1 sibling, 1 reply; 24+ messages in thread
From: Hanna Linder @ 2002-09-20 20:32 UTC (permalink / raw)
  To: William Lee Irwin III, Maneesh Soni, Andrew Morton, linux-kernel, viro
  Cc: Hanna Linder

--On Friday, September 20, 2002 11:51:13 -0700 Hanna Linder <hannal@us.ibm.com> wrote:

> 
> 	Perhaps it is time to reconsider replacing fastwalk with dcache_rcu. 

These patches were written by Maneesh Soni. Since the Read-Copy Update
infrastructure has not been accepted into the mainline kernel yet (although
there were murmurings of it being acceptable) you will need to apply
those first. Here they are, apply in this order. Too big to post
inline text though. These are provided against 2.5.36-mm1.


http://prdownloads.sourceforge.net/lse/rcu_ltimer-2.5.36-mm1

http://prdownloads.sourceforge.net/lse/read_barrier_depends-2.5.36-mm1

http://prdownloads.sourceforge.net/lse/dcache_rcu-12-2.5.36-mm1

There has been quite a bit of testing done on this and it has proven
quite stable. If anyone wants to do any additional testing that would
be great.

Thanks.

Hanna


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 18:51             ` Hanna Linder
  2002-09-20 20:32               ` Hanna Linder
@ 2002-09-20 20:39               ` William Lee Irwin III
  1 sibling, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20 20:39 UTC (permalink / raw)
  To: Hanna Linder; +Cc: Maneesh Soni, Andrew Morton, linux-kernel, viro

On Fri, Sep 20, 2002 at 11:51:13AM -0700, Hanna Linder wrote:
> 	I mentioned it at OLS too. It was the point of my talk. Next
> 	time I will request a non 10am time slot!

10AM is relatively early in the morning for me. =)


On Friday, September 20, 2002 05:03:58 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
>> take its place. Ugly. OTOH the qualitative difference is striking. The
>> interactive responsiveness of the machine, even when entirely unloaded,
>> is drastically improved, along with such nice things as init scripts
>> and kernel compiles also markedly faster. I suspect this is just the
>> wrong benchmark to show throughput benefits with.
>> Also notable is that the system time was significantly reduced though
>> I didn't log it. Essentially a long period of 100% system time is
>> entered after a certain point in the benchmark, during which there are
>> few (around 60 or 70) context switches in a second, and the duration
>> of this period was shortened.

On Fri, Sep 20, 2002 at 11:51:13AM -0700, Hanna Linder wrote:
> 	Bill, you are saying that replacing dcache_rcu significantly
> 	improved system response time among other things? 
> 	Perhaps it is time to reconsider replacing fastwalk with dcache_rcu. 
> 	Viro? What are your objections?

Basically, the big ones get laggy, and laggier with more cpus. This fixed
a decent amount of that.

Another thing to note is that the max bandwidth of these disks is 40MB/s,
so we're running pretty close to peak anyway. I need to either get an FC
cable or something to see larger bandwidth gains.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 20:32               ` Hanna Linder
@ 2002-09-20 20:54                 ` Dipankar Sarma
  0 siblings, 0 replies; 24+ messages in thread
From: Dipankar Sarma @ 2002-09-20 20:54 UTC (permalink / raw)
  To: Hanna Linder
  Cc: William Lee Irwin III, Maneesh Soni, Andrew Morton, linux-kernel, viro

On Fri, Sep 20, 2002 at 08:32:48PM +0000, Hanna Linder wrote:
> --On Friday, September 20, 2002 11:51:13 -0700 Hanna Linder <hannal@us.ibm.com> wrote:
> 
> > 
> > 	Perhaps it is time to reconsider replacing fastwalk with dcache_rcu. 
> 
> These patches were written by Maneesh Soni. Since the Read-Copy Update
> infrastructure has not been accepted into the mainline kernel yet (although
> there were murmurings of it being acceptable) you will need to apply
> those first. Here they are, apply in this order. Too big to post
> inline text though. These are provided against 2.5.36-mm1.
> 
> 
> http://prdownloads.sourceforge.net/lse/rcu_ltimer-2.5.36-mm1
> 
> http://prdownloads.sourceforge.net/lse/read_barrier_depends-2.5.36-mm1
> 
> http://prdownloads.sourceforge.net/lse/dcache_rcu-12-2.5.36-mm1
> 
> There has been quite a bit of testing done on this and it has proven
> quite stable. If anyone wants to do any additional testing that would
> be great.

Thanks for the vote of confidence :)

Now for some results (out of date, but also has results with backported code 
from 2.5) see http://lse.sf.net/locking/dcache/dcache.html.

Preliminary profiling of webserver benchmarks in 2.5.3X show similar potential
for dcache_rcu. I will have actual results published when we can
get formal runs done.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 12:03           ` William Lee Irwin III
  2002-09-20 18:51             ` Hanna Linder
@ 2002-09-20 21:30             ` Martin J. Bligh
  2002-09-20 23:11               ` William Lee Irwin III
  2002-09-21  7:52             ` William Lee Irwin III
  2 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-09-20 21:30 UTC (permalink / raw)
  To: William Lee Irwin III, Maneesh Soni, Andrew Morton, linux-kernel, viro

> AFAICT, with one bottleneck out of the way, a new one merely arises to
> take its place. Ugly. OTOH the qualitative difference is striking. The
> interactive responsiveness of the machine, even when entirely unloaded,
> is drastically improved, along with such nice things as init scripts
> and kernel compiles also markedly faster. I suspect this is just the
> wrong benchmark to show throughput benefits with.
> 
> Also notable is that the system time was significantly reduced though
> I didn't log it. Essentially a long period of 100% system time is
> entered after a certain point in the benchmark, during which there are
> few (around 60 or 70) context switches in a second, and the duration
> of this period was shortened.
> 
> The results here contradict my prior conclusions wrt. HZ 100 vs. 1000.

Hmmm ... I think you need the NUMA aware scheduler ;-) 
On the plus side, that does look like RCU pretty much obliterated the dcache
problems ....

M.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 21:30             ` Martin J. Bligh
@ 2002-09-20 23:11               ` William Lee Irwin III
  2002-09-20 23:22                 ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-20 23:11 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Maneesh Soni, Andrew Morton, linux-kernel, viro

At some point in the past, I wrote:
>> AFAICT, with one bottleneck out of the way, a new one merely arises to
>> take its place. Ugly. OTOH the qualitative difference is striking. The
>> interactive responsiveness of the machine, even when entirely unloaded,
>> is drastically improved, along with such nice things as init scripts
>> and kernel compiles also markedly faster. I suspect this is just the
>> wrong benchmark to show throughput benefits with.

On Fri, Sep 20, 2002 at 02:30:23PM -0700, Martin J. Bligh wrote:
> Hmmm ... I think you need the NUMA aware scheduler ;-) 
> On the plus side, that does look like RCU pretty much obliterated the dcache
> problems ....

This sounds like a likely solution to the expense of load_balance().
Do you have a patch for it floating around?


Thanks,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 23:11               ` William Lee Irwin III
@ 2002-09-20 23:22                 ` Martin J. Bligh
  0 siblings, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2002-09-20 23:22 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Maneesh Soni, Andrew Morton, linux-kernel, viro

>>> AFAICT, with one bottleneck out of the way, a new one merely arises to
>>> take its place. Ugly. OTOH the qualitative difference is striking. The
>>> interactive responsiveness of the machine, even when entirely unloaded,
>>> is drastically improved, along with such nice things as init scripts
>>> and kernel compiles also markedly faster. I suspect this is just the
>>> wrong benchmark to show throughput benefits with.
> 
> On Fri, Sep 20, 2002 at 02:30:23PM -0700, Martin J. Bligh wrote:
>> Hmmm ... I think you need the NUMA aware scheduler ;-) 
>> On the plus side, that does look like RCU pretty much obliterated the dcache
>> problems ....
> 
> This sounds like a likely solution to the expense of load_balance().
> Do you have a patch for it floating around?

I have a really old hacky one from Mike Kravetz, or Michael Hohnbaum
is working on something new, but I don't think it's ready yet .... 
I think Mike's will need some rework. Will send it to you ...

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 2.5.36-mm1 dbench 512 profiles
  2002-09-20 12:03           ` William Lee Irwin III
  2002-09-20 18:51             ` Hanna Linder
  2002-09-20 21:30             ` Martin J. Bligh
@ 2002-09-21  7:52             ` William Lee Irwin III
  2 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-09-21  7:52 UTC (permalink / raw)
  To: Maneesh Soni, Andrew Morton, linux-kernel, viro

On Fri, Sep 20, 2002 at 05:03:58AM -0700, William Lee Irwin III wrote:
> Also notable is that the system time was significantly reduced though
> I didn't log it. Essentially a long period of 100% system time is
> entered after a certain point in the benchmark, during which there are
> few (around 60 or 70) context switches in a second, and the duration
> of this period was shortened.

A radical difference is present in 2.5.37: the long period of 100%
system time is instead a long period of idle time.

I don't have an oprofile vs. 2.5.37 but I'll report back when I do.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-09-21  7:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-19 22:30 2.5.36-mm1 dbench 512 profiles William Lee Irwin III
2002-09-19 23:18 ` Hanna Linder
2002-09-19 23:38   ` Andrew Morton
2002-09-19 23:45     ` Hanna Linder
2002-09-20  0:08     ` William Lee Irwin III
2002-09-20  4:02       ` William Lee Irwin III
2002-09-20  7:59       ` Maneesh Soni
2002-09-20  8:06         ` William Lee Irwin III
2002-09-20 12:03           ` William Lee Irwin III
2002-09-20 18:51             ` Hanna Linder
2002-09-20 20:32               ` Hanna Linder
2002-09-20 20:54                 ` Dipankar Sarma
2002-09-20 20:39               ` William Lee Irwin III
2002-09-20 21:30             ` Martin J. Bligh
2002-09-20 23:11               ` William Lee Irwin III
2002-09-20 23:22                 ` Martin J. Bligh
2002-09-21  7:52             ` William Lee Irwin III
2002-09-20 14:34         ` Dave Hansen
2002-09-20 16:07           ` Martin J. Bligh
2002-09-20 17:48             ` Dipankar Sarma
2002-09-20 17:40           ` Dipankar Sarma
2002-09-20 20:28             ` Dipankar Sarma
2002-09-20  5:14     ` William Lee Irwin III
2002-09-20  6:59     ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).