linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Performance differences in recent kernels
@ 2002-09-12  3:45 rwhron
  2002-09-12 11:41 ` Hans Reiser
  0 siblings, 1 reply; 8+ messages in thread
From: rwhron @ 2002-09-12  3:45 UTC (permalink / raw)
  To: reiser; +Cc: linux-kernel

> Can you test on equal partitions too?

Tonight I updated the scripts so all filesystem types 
will use the same partition.

> We need to get Chris's patches into the tree

> Can we ask you to test again with these patches applied?

> ftp://ftp.suse.com/pub/people/mason/patches/data-logging

If you have patches against a current tree, I can apply
them before testing a kernel.  I'll start paying more
attention to reiserfs-list.  It's probably best to test
patches when the -rc series starts.  

> AIM is a proprietary benchmark, yes?

It's gpl.  i only running aim7 on ext2 atm.

> If we send you a copy of reiser4
> next month, would you be willing to give it a run?

Will there be a choice of mounting reiserfs
or reiser4?  (like ext2 or ext3), or will there
be a complete departure?  

-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance differences in recent kernels
  2002-09-12  3:45 Performance differences in recent kernels rwhron
@ 2002-09-12 11:41 ` Hans Reiser
  0 siblings, 0 replies; 8+ messages in thread
From: Hans Reiser @ 2002-09-12 11:41 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

rwhron@earthlink.net wrote:

>>    
>>
>Will there be a choice of mounting reiserfs
>or reiser4?  (like ext2 or ext3), or will there
>be a complete departure?  
>
>  
>

reiser4 is a completely different filesystem type for mount.  It is 
possible someone may sponsor making reiser4 plugins that understand 
reiser3 disk format, but I cannot ask DARPA to pay for that, because 
their mission is research not code maintenance.

ReiserFS will become an old, stable, unchanging except in response to 
VFS changes, filesystem.  This is a good thing.  Users need something 
that just works.  reiser3 is done.   It works.  After 2.4.21 ships it 
will be time to stop disturbing the users with changes to it.

reiser4 on the other hand will be the focal point of efforts to counter 
Microsoft's OFS, and the performance will increase every month or so, 
and new plugins will appear, and you'll have things like inheritance, 
encryption, compression, and eventually key word indexing.... but we'll 
increment the major version number when key word indexing goes in....

The nice thing about reiser4 is that the disk format is so plugin based 
that we can accomodate all future changes by just adding more plugins 
for it to understand.  Hmmm.  Maybe I should not put the 4 in the 
filesystem type name.  I'll have to think about that.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance differences in recent kernels
@ 2002-09-12  3:11 rwhron
  0 siblings, 0 replies; 8+ messages in thread
From: rwhron @ 2002-09-12  3:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

>> AIM7 database workload
>> kernel                   Tasks   Jobs/Min       Real    CPU
>> 2.5.33-mm5               256    763.0           1992.9 1020.8

> I assume that's seconds of CPU for the entire run?

Yes.

>> IRMAN - interactive response measurement.
>>
>>                    FILE_IO Response time measurements (milliseconds)
>>                            Max         Min         Avg       StdDev
>> 2.4.20-pre4-ac1          40.603       0.008       0.009       0.043
>> 2.4.20-pre5              52.405       0.009       0.011       0.080
>> 2.5.33-mm5                2.955       0.008       0.010       0.004

> For many things, 1/latency == throughput.  But the averages are
> the same here.  Be interesting to run it on the 384 megabyte machine.

This is on the 384 mb machine.  It doesn't have the really low
max response time on 2.5.33-mm5.

                      FILE_IO Response time measurements (milliseconds)
                              Max         Min         Avg       StdDev
2.4.19-rmap14              190.061       0.016       0.137       4.017
2.4.20-pre5                180.719       0.016       0.113       3.708
2.4.20-pre5-ac1            552.150       0.011       0.035       3.160
2.4.20-pre5aa1             450.191       0.012       0.030       2.772
2.5.33-mm1                 456.736       0.012       0.033       2.979
2.5.33-mm5                 456.733       0.016       0.047       3.633
2.5.33                     456.061       0.012       0.034       3.077

> It would be interesting to run irman in conjunction with tiobench
> or dbench.  One the same disk and on a different disk.

hm, i'll think about ways to do that and keep the test repeatable.

>> Time to build the kernel 12 times.  Not a lot of difference here.
>>
>> kernel                           seconds
>> 2.5.33                             728.2
>> 2.5.33-mm5                         736.8

> Should have been better than that.  Although the kmap work doesn't
> seem to affect these machines.  Is that a `make -j1', `-j6'???

That's make -j1 executed simultaneously on 4 different kernel
trees.

> It is useful to watch the amount of CPU which is consumed with dbench,
> btw.  That's one thing which tends to be fairly repeatable between runs.

Like "time dbench 64", or something else?

>> Sequential Writes ext2
>> There is a dramatic reduction in cpu utilization in 2.5.33-mm5 and increase in
>> throughput compared to 2.5.33 when thread count is high.
>>
>>                    Num                    Avg       Maximum      Lat%   CPU
>> Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>> ------------------ ---  ---------------------------------------------------
>> 2.4.19-rc5-aa1     128   37.40 45.99%    32.405    46333.30   0.00105    81
>> 2.4.20-pre4-ac1    128   34.01 36.94%    40.121    47331.57   0.00058    92
>> 2.4.20-pre5        128   32.98 49.33%    39.692    52093.19   0.01446    67
>> 2.5.33             128   12.17 222.9%   108.966   910455.61   0.19503     5
>> 2.5.33-mm5         128   30.78 30.03%    32.973   909931.81   0.07858   102

> This test is highly dependent upon the size of the request queues.  The
> queues have 128 slots and you're running 128 threads.  One would expect
> to see a lot of variability with that combo.  Would be interesting to also
> test 64 threads, and 256.

> -aa has the "merge even after latency has expired" logic in the elevator,
> which could make some difference at the 128 thread level.

On 2.5.33-mm1, the throughput and cpu is pretty comparable at 64, 128 and
256 threads.  cpu utilization on -aa is very different at 64 and 256
threads compared to 128.

Sequential Writes ext2
                    Num                    Avg       Maximum       Lat%    CPU
Kernel              Thr   Rate  (CPU%)   Latency     Latency       >10s    Eff
------------------- ---  -----------------------------------------------------
2.4.19-rc1-aa1       64   35.62 92.35%    19.063    23691.67    0.00000     39
2.4.20-pre4-ac1      64   34.17 37.34%    20.282    23123.27    0.00000     92
2.4.20-pre5          64   33.19 48.62%    20.235    28646.79    0.00003     68
2.5.33               64   14.06 191.6%    47.371   275718.71    0.09892      7
2.5.33-mm5           64   30.69 29.55%    19.323   277839.26    0.07670    104

2.4.19-rc1-aa1      256   35.29 91.34%    72.996    83127.21    0.16864     39
2.4.20-pre4-ac1     256   34.88 38.32%    76.128    91906.08    0.48809     91
2.4.20-pre5         256   33.20 49.30%    76.394    95761.85    0.50046     67
2.5.33              256   12.65 252.6%   197.722  1334605.09    0.31655      5
2.5.33-mm5          256   28.37 29.96%    67.484   699119.03    0.10001     95

>> Sequential Reads ext3
>> 2.5.33-mm5 has a more graceful degradation in throughput on ext3.
>> Fairness is better too.
>>
>>                    Num                    Avg       Maximum      Lat%   CPU
>> Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>> ------------------ ---  ---------------------------------------------------
>> 2.4.19-rc5-aa1       1   51.13 29.59%     0.227      460.92   0.00000   173
>> 2.4.20-pre4-ac1      1   34.12 17.37%     0.341     1019.65   0.00000   196
>> 2.4.20-pre5          1   33.28 20.62%     0.350      137.44   0.00000   161
>> 2.5.33-mm5           1   31.70 14.75%     0.367      581.89   0.00000   215

> 20% drop in CPU load is typical, but the reduced bandwidth on such
> a simple test is unexpected.  Presumably that's the driver thing.

>> 2.4.19-rc5-aa1      64    7.38  4.51%    98.947    20638.56   0.00000   164
>> 2.4.20-pre4-ac1     64    6.55  3.94%   110.432    14937.49   0.00000   166
>> 2.4.20-pre5         64    6.34  4.16%   111.299    14234.83   0.00000   152
>> 2.5.33-mm5          64   12.29  8.51%    55.372     8799.99   0.00000   144

> hm.  Don't know.  Something went right in the 2.5 elevator I guess.

It went right at 32 threads too.  128 and 256 are better than average 
on 2.5.33-mm5.

Sequential Reads ext3
                  Num                    Avg       Maximum       Lat%    CPU
Kernel            Thr   Rate  (CPU%)   Latency     Latency      >10s    Eff
----------------- ---  -----------------------------------------------------
2.4.19-rc1-aa1     32    7.19  5.13%    51.480    14981.42    0.00000    140
2.4.20-pre4-ac1    32    6.70  3.96%    54.502     6613.54    0.00000    169
2.4.20-pre5        32    7.25  4.72%    49.839     7344.23    0.00000    153
2.5.33-mm5         32   15.26  9.89%    23.999    10942.51    0.00000    154

2.4.19-rc1-aa1    128    6.70  4.86%   220.082    56700.25    0.91508    138
2.4.20-pre4-ac1   128    6.17  3.71%   227.375    26637.67    0.00000    166
2.4.20-pre5       128    6.41  4.21%   212.999    32363.01    0.00009    152
2.5.33-mm5        128    9.63  7.89%   128.288    16677.89    0.00000    122

2.4.19-rc1-aa1    256    6.51  4.70%   450.763   109737.74    2.23919    139
2.4.20-pre4-ac1   256    5.69  3.41%   475.816    54311.24    0.00703    167
2.4.20-pre5       256    5.74  3.90%   470.792    55091.26    0.05414    147
2.5.33-mm5        256    7.39  6.92%   305.286    46425.82    0.08529    107




-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance differences in recent kernels
  2002-09-11 14:50   ` Randy.Dunlap
@ 2002-09-11 16:57     ` Hans Reiser
  0 siblings, 0 replies; 8+ messages in thread
From: Hans Reiser @ 2002-09-11 16:57 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: rwhron, linux-kernel, mason, akpm

Randy.Dunlap wrote:

>On Wed, 11 Sep 2002, Hans Reiser wrote:
>
>| AIM is a proprietary benchmark, yes?  If we send you a copy of reiser4
>| next month, would you be willing to give it a run?
>
>No, it's now GPL and available at
>  http://caldera.com/developers/community/contrib/aim.html
>
>  
>
Thanks much to caldera for doing this.   I have wanted to try the 
benchmark for years, but it was too expensive for us.

We'll use it for debugging and benchmarking reiser4 also.

Hans


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance differences in recent kernels
  2002-09-11 10:03 ` Hans Reiser
@ 2002-09-11 14:50   ` Randy.Dunlap
  2002-09-11 16:57     ` Hans Reiser
  0 siblings, 1 reply; 8+ messages in thread
From: Randy.Dunlap @ 2002-09-11 14:50 UTC (permalink / raw)
  To: Hans Reiser; +Cc: rwhron, linux-kernel, mason, akpm

On Wed, 11 Sep 2002, Hans Reiser wrote:

| AIM is a proprietary benchmark, yes?  If we send you a copy of reiser4
| next month, would you be willing to give it a run?

No, it's now GPL and available at
  http://caldera.com/developers/community/contrib/aim.html

-- 
~Randy
"Linux is not a research project. Never was, never will be."
  -- Linus, 2002-09-02


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance differences in recent kernels
  2002-09-11  3:54 rwhron
  2002-09-11  5:59 ` Andrew Morton
@ 2002-09-11 10:03 ` Hans Reiser
  2002-09-11 14:50   ` Randy.Dunlap
  1 sibling, 1 reply; 8+ messages in thread
From: Hans Reiser @ 2002-09-11 10:03 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel, mason

We need to get Chris's patches into the tree, as they improve the write 
performance for reiserfs a lot.  (Chris!  Send them in! ;-) )

Can we ask you to test again with these patches applied?

ftp://ftp.suse.com/pub/people/mason/patches/data-logging

Can you test on equal partitions too?

AIM is a proprietary benchmark, yes?  If we send you a copy of reiser4 
next month, would you be willing to give it a run?

Hans

rwhron@earthlink.net wrote:

>Just to note a few differences in recent benchmarks on quad xeon 
>with 3.75 gb ram and qlogic 2200 -> raid 5 array.
>
>For AIM7, the outstanding metrics are jobs/min (high is good),
>and cpu time (in seconds).  The tasks column is equivalent to
>load average.
>
>AIM7 database workload
>
>Andrea's tree has the v6.0 qlogic driver which helps i/o a lot.
>It's the only tree with that driver atm.  The other trees look
>pretty similar at load averages of 32 and 256.  
>
>
>kernel                   Tasks   Jobs/Min       Real    CPU     
>2.4.19-rc5-aa1           32	555.4		342.2	146.1	
>2.4.20-pre5              32	470.7		403.8	147.2	
>2.4.20-pre4-ac1          32	472.0		402.7	142.4	
>2.5.33-mm5               32	474.4		400.7	144.2	
>
>2.4.19-rc5-aa1           256	905.2		1679.9	931.9	
>2.4.20-pre5              256	769.1		1977.0 1048.5	
>2.4.20-pre4-ac1          256	766.4		1984.2	945.5	
>2.5.33-mm5               256	763.0		1992.9 1020.8	
>
>
>AIM7 file server workload
>
>Interesting here to note that with low load averages, 
>2.5.33-mm5 is on top, but as load average increases, -aa is
>ahead.
>
>kernel                   Tasks   Jobs/Min        Real    CPU    
>2.4.19-rc5-aa1           4	131.6		184.2	45.5	
>2.4.20-pre5              4	132.7		182.7	44.1	
>2.4.20-pre4-ac1          4	132.7		182.6	46.0	
>2.5.33-mm5               4	140.4		172.6	37.7	
>
>2.4.19-rc5-aa1           32	264.8		732.3	219.1	
>2.4.20-pre5              32	230.5		841.5	265.7	
>2.4.20-pre4-ac1          32	227.7		851.6	257.6	
>2.5.33-mm5               32	229.8		843.7	224.7	
>
>
>AIM7 shared multiuser workload
>
>This is more cpu intensive than the other aim7 workloads.
>2.5.33-mm5 is using a lot more cpu time.  That may be a bug in
>the workload.  I'm investigating that.
>
>
>kernel                   Tasks   Jobs/Min        Real    CPU    
>2.4.19-rc5-aa1           64	2319.6		160.6	163.8	
>2.4.20-pre4-ac1          64	1960.4		190.0	164.8	
>2.4.20-pre5              64	1980.3		188.1	185.1	
>2.5.33-mm5               64	1461.2		254.9	566.2	
>
>2.4.19-rc5-aa1           256	2835.5		525.5	652.6	
>2.4.20-pre4-ac1          256	2444.2		609.6	656.6	
>2.4.20-pre5              256	2432.8		612.4	701.0	
>2.5.33-mm5               256	1890.5		788.1  2316.4	
>
>
>IRMAN - interactive response measurement.
>2.5.33-mm5 has much lower max response time for file io. 
>The standard deviation is very low too (which is good).
>
>                   FILE_IO Response time measurements (milliseconds)
>                           Max         Min         Avg       StdDev
>2.4.20-pre4-ac1          40.603       0.008       0.009       0.043
>2.4.20-pre5              52.405       0.009       0.011       0.080
>2.5.33-mm5                2.955       0.008       0.010       0.004
>
>
>autoconf-2.53 build (12 times) creates about 1.2 million processes.
>It's a good fork test.  rmap slows this one down.  There is a healthy
>difference between the rmap in 2.5.33-mm5 and 2.4.20-pre4-ac1.
>
>kernel                   	 seconds (smaller is better)
>2.4.20-pre4-ac1          	   856.4
>2.4.19-rc5-aa1           	   727.2
>2.4.20-pre5              	   718.4
>2.5.33                   	   799.2
>2.5.33-mm5               	   782.0
>
>
>Time to build the kernel 12 times.  Not a lot of difference here.
>
>kernel                   	 seconds
>2.4.19-rc5-aa1           	   718.8
>2.4.20-pre4-ac1          	   735.8
>2.4.20-pre5              	   728.1
>2.5.33                   	   728.2
>2.5.33-mm5               	   736.8
>
>
>The Open Source database benchmark doesn't vary much between trees.
>
>
>dbench on various filesystems.   This isn't meant to compare
>filesystem because the disk geometry is different for each fs.
>
>rmap has generally not done well on dbench when the process
>count is high, but 2.5.33* on ext2 and ext3 really smokes at 
>64 processes.
>
>dbench ext2 64 processes		Average	(5 runs)
>2.4.19-rc5-aa1           		179.61	MB/second
>2.4.20-pre4-ac1          		140.63	
>2.4.20-pre5              		145.00	
>2.5.33                   		220.54	
>2.5.33-mm5               		214.78	
>
>dbench ext2 192 processes		Average	
>2.4.19-rc5-aa1           		155.44	
>2.4.20-pre4-ac1          		 79.16	
>2.4.20-pre5              		115.31	
>2.5.33                   		134.27	
>2.5.33-mm5               		174.17	
>
>
>dbench ext3 64 processes		Average	
>2.4.19-rc5-aa1           		 97.69	
>2.4.20-pre4-ac1          		 59.42	
>2.4.20-pre5              		 80.79	
>2.5.33-mm5               		112.20	
>
>dbench ext3 192 processes		Average	
>2.4.19-rc5-aa1           		 77.06	
>2.4.20-pre4-ac1          		 28.48	
>2.4.20-pre5              		 58.66	
>2.5.33-mm5               		 72.92	
>
>
>dbench reiserfs 64 processes		Average	
>2.4.19-rc5-aa1           		 70.50	
>2.4.20-pre4-ac1          		 57.30	
>2.4.20-pre5              		 62.60	
>2.5.33-mm5               		 77.22	
>
>dbench reiserfs 192 processes		Average	
>2.4.19-rc5-aa1           		 55.37	
>2.4.20-pre4-ac1          		 20.56	
>2.4.20-pre5              		 44.14	
>2.5.33-mm5               		 49.61	
>
>
>The O(1) scheduler helps tbench a lot when the process
>count is high.  The ac tree may not have the latest 
>scheduler updates.  
>
>tbench 192 processes		Average	
>2.4.19-rc5-aa1           	116.76	
>2.4.20-pre4-ac1          	100.30	
>2.4.20-pre5              	 27.98	
>2.5.33                   	115.93	
>2.5.33-mm5               	117.91	
>
>
>LMbench latency running /bin/sh had a big regression in the
>-mm tree recently.
>
>                      fork    execve  /bin/sh
>kernel              process  process  process
>------------------  -------  -------  -------
>2.4.19-rc5-aa1        186.8    883.1   3937.9
>2.4.20-pre4-ac1       227.9    904.5   3866.0
>2.4.20-pre5           310.0    990.9   4178.1
>2.5.33-mm5            244.3    949.0  71588.2
>
>
>Context switching with 32K - times in microseconds - smaller is better
>----------------------------------------------------------------------
>                   32prc/32k  64prc/32k  96prc/32k
>kernel             ctx swtch  ctx swtch  ctx swtch
>----------------   ---------  ---------  ---------
>2.4.19-rc5-aa1        35.411     65.120     64.686
>2.4.20-pre4-ac1       30.642     49.307     56.068
>2.4.20-pre5           17.716     27.205     43.716
>2.5.33-mm5            21.786     49.555     63.000
>
>Context switching with 64K - times in microseconds - smaller is better
>----------------------------------------------------------------------
>                   16prc/64k  32prc/64k  64prc/64k  
>kernel             ctx swtch  ctx swtch  ctx swtch  
>----------------   ---------  ---------  ---------  
>2.4.19-rc5-aa1        50.523    111.320    137.383  
>2.4.20-pre4-ac1       50.691     92.204    122.261  
>2.4.20-pre5           36.763     44.498    111.952  
>2.5.33-mm5            27.113     42.679    124.907  
>
>File create/delete and VM system latencies in microseconds - smaller is better
>----------------------------------------------------------------------------
>The -aa tree higher latency for file creation.  File delete latency is
>similar for all trees.  2.4.20-pre5 has the lowest mmap latency, 2.5.33-mm5 
>the highest.
>
>                   0K         1K       10K      10K     Mmap     Page
>kernel           Create     Create    Create   Delete   Latency  Fault
>---------------- -------    -------   -------  -------  -------  ------
>2.4.19-rc5-aa1    126.57     174.70    256.64    62.50   3728.2    4.00
>2.4.20-pre4-ac1    86.92     137.28    217.73    61.22   3557.2    3.00
>2.4.20-pre5        90.24     140.22    219.17    61.38   2673.8    3.00
>2.5.33-mm5         93.43     143.58    225.19    63.83   4634.7    4.00
>
>*Local* Communication latencies in microseconds - smaller is better
>-------------------------------------------------------------------
>2.5.33-mm5 has significanly lower latency here, except for tcp connection.
>
>kernel               Pipe   AF/Unix     UDP       TCP   RPC/TCP  TCPconn
>-----------------  -------  -------  -------   -------  -------  -------
>2.4.19-rc5-aa1      36.697   48.436  55.3271   50.8352  80.8498   88.330
>2.4.20-pre4-ac1     34.110   56.582  53.9643   54.7447  84.4660   86.195
>2.4.20-pre5         10.819   25.379  38.4917   45.2661  79.1166   86.745
>2.5.33-mm5           8.337   14.122  23.6442   35.4457  77.0814  111.252
>
>*Local* Communication bandwidths in MB/s - bigger is better
>-----------------------------------------------------------
>                                            
>kernel               Pipe   AF/Unix    TCP  
>-----------------  -------  -------  -------
>2.4.19-rc5-aa1      541.56   253.43   166.08
>2.4.20-pre4-ac1     552.99   240.54   168.34
>2.4.20-pre5         462.82   273.55   161.28
>2.5.33-mm5          515.64   543.57   171.01
>
>
>tiobench-0.3.3 is create 12 gigabytes worth of files.
>
>Unit information
>================
>Rate      = megabytes per second
>CPU%      = percentage of CPU used during the test
>Latency   = milliseconds
>Lat%      = percent of requests that took longer than 10 seconds
>CPU Eff   = Rate divided by CPU% - throughput per cpu load
>
>Sequential Reads ext2
>2.5.33-mm5 has much lower max latency when the thread count is high for 
>sequentional reads.  The qlogic driver in -aa helps a lot here too.
>
>                   Num                    Avg       Maximum      Lat%   CPU
>Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>------------------ ---  ---------------------------------------------------
>
>2.4.19-rc5-aa1       1   51.21 28.87%     0.226      103.26   0.00000   177
>2.4.20-pre4-ac1      1   34.14 17.25%     0.341      851.34   0.00000   198
>2.4.20-pre5          1   33.68 20.36%     0.345      110.11   0.00000   165
>2.5.33               1   25.36 13.67%     0.460     1512.99   0.00000   185
>2.5.33-mm5           1   31.73 14.80%     0.367      853.99   0.00000   214
>
>2.4.19-rc5-aa1     256   40.68 25.39%    64.084   107977.97   0.36264   160
>2.4.20-pre4-ac1    256   34.51 19.63%    51.031   845159.88   0.02919   176
>2.4.20-pre5        256   31.89 22.95%    57.236   849792.70   0.03459   139
>2.5.33             256   24.54 14.46%    94.422   449274.89   0.09794   170
>2.5.33-mm5         256   22.39 18.56%   104.515    24623.21   0.00000   121
>
>Sequential Writes ext2
>There is a dramatic reduction in cpu utilization in 2.5.33-mm5 and increase in 
>throughput compared to 2.5.33 when thread count is high.
>
>                   Num                    Avg       Maximum      Lat%   CPU
>Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>------------------ ---  ---------------------------------------------------
>2.4.19-rc5-aa1     128   37.40 45.99%    32.405    46333.30   0.00105    81
>2.4.20-pre4-ac1    128   34.01 36.94%    40.121    47331.57   0.00058    92
>2.4.20-pre5        128   32.98 49.33%    39.692    52093.19   0.01446    67
>2.5.33             128   12.17 222.9%   108.966   910455.61   0.19503     5
>2.5.33-mm5         128   30.78 30.03%    32.973   909931.81   0.07858   102
>
>
>Sequential Reads ext3
>2.5.33-mm5 has a more graceful degradation in throughput on ext3.  
>Fairness is better too.
>
>                   Num                    Avg       Maximum      Lat%   CPU
>Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>------------------ ---  ---------------------------------------------------
>2.4.19-rc5-aa1       1   51.13 29.59%     0.227      460.92   0.00000   173
>2.4.20-pre4-ac1      1   34.12 17.37%     0.341     1019.65   0.00000   196
>2.4.20-pre5          1   33.28 20.62%     0.350      137.44   0.00000   161
>2.5.33-mm5           1   31.70 14.75%     0.367      581.89   0.00000   215
>
>2.4.19-rc5-aa1      64    7.38  4.51%    98.947    20638.56   0.00000   164
>2.4.20-pre4-ac1     64    6.55  3.94%   110.432    14937.49   0.00000   166
>2.4.20-pre5         64    6.34  4.16%   111.299    14234.83   0.00000   152
>2.5.33-mm5          64   12.29  8.51%    55.372     8799.99   0.00000   144
>
>
>
>Sequential Writes ext3
>Here 2.5.33-mm5 is great with 1 thread, but takes a hit at 32 threads.  
>Latency is pretty high too.  Cpu utilization is quite low though.
>
>                   Num                    Avg       Maximum      Lat%   CPU
>Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>------------------ ---  ---------------------------------------------------
>2.4.19-rc5-aa1       1   44.23 53.01%     0.243     6084.88   0.00000    83
>2.4.20-pre4-ac1      1   37.86 50.66%     0.300     4288.99   0.00000    75
>2.4.20-pre5          1   37.58 55.38%     0.295    14659.06   0.00003    68
>2.5.33-mm5           1   54.16 65.87%     0.211     5605.87   0.00000    82
>
>2.4.19-rc5-aa1      32   20.86 121.6%     8.861    13693.99   0.00000    17
>2.4.20-pre4-ac1     32   28.33 156.6%    10.041    15724.46   0.00000    18
>2.4.20-pre5         32   22.36 114.3%    10.382    12867.96   0.00000    20
>2.5.33-mm5          32    5.90 11.67%    52.386  1150696.62   0.08252    50
>
>
>Sequential Reads on reiserfs
>Don't know what happened to the 2.5 numbers here.
>-aa has much higher throughput at high thread count,
>but I believe that's a reiserfs change that is fixed in 2.4.20-pre6.
>
>                   Num                    Avg       Maximum      Lat%   CPU
>Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
>------------------ ---  ---------------------------------------------------
>2.4.19-rc5-aa1       1   48.21 30.97%     0.241      104.82   0.00000   156
>2.4.20-pre4-ac1      1   33.65 19.27%     0.346      136.95   0.00000   175
>2.4.20-pre5          1   35.25 23.00%     0.330      492.30   0.00000   153
>
>2.4.19-rc5-aa1      32   36.27 25.59%     9.946    12613.17   0.00000   142
>2.4.20-pre4-ac1     32    7.08  4.73%    51.894     5808.95   0.00000   149
>2.4.20-pre5         32    6.74  5.16%    53.395     8148.47   0.00000   131
>
>
>
>Sequential Writes reiserfs - max latency is very high for everyone here.
>
>                   Num                    Avg       Maximum      Lat%  CPU
>Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s  Eff
>------------------ ---  --------------------------------------------------
>
>2.4.19-rc5-aa1     256   31.90 121.9%    67.227   166079.82   0.28051   26
>2.4.20-pre4-ac1    256   23.83 128.1%    84.309   135202.89   0.27039   19
>2.4.20-pre5        256   18.23 88.00%    76.265   258230.65   0.26893   21
>
>More details and more kernel tests at:
>http://home.earthlink.net/~rwhron/kernel/bigbox.html
>  
>




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Performance differences in recent kernels
  2002-09-11  3:54 rwhron
@ 2002-09-11  5:59 ` Andrew Morton
  2002-09-11 10:03 ` Hans Reiser
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2002-09-11  5:59 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

rwhron@earthlink.net wrote:
> 
> Just to note a few differences in recent benchmarks on quad xeon
> with 3.75 gb ram and qlogic 2200 -> raid 5 array.

Awesome.  Thanks.

> For AIM7, the outstanding metrics are jobs/min (high is good),
> and cpu time (in seconds).  The tasks column is equivalent to
> load average.
> 
> AIM7 database workload

I'll have to get my hands on this.
 
> Andrea's tree has the v6.0 qlogic driver which helps i/o a lot.
> It's the only tree with that driver atm.  The other trees look
> pretty similar at load averages of 32 and 256.
> 
> kernel                   Tasks   Jobs/Min       Real    CPU
> 2.4.19-rc5-aa1           32     555.4           342.2   146.1
> 2.4.20-pre5              32     470.7           403.8   147.2
> 2.4.20-pre4-ac1          32     472.0           402.7   142.4
> 2.5.33-mm5               32     474.4           400.7   144.2
> 
> 2.4.19-rc5-aa1           256    905.2           1679.9  931.9
> 2.4.20-pre5              256    769.1           1977.0 1048.5
> 2.4.20-pre4-ac1          256    766.4           1984.2  945.5
> 2.5.33-mm5               256    763.0           1992.9 1020.8

I assume that's seconds of CPU for the entire run?
 
> AIM7 file server workload
> 
> Interesting here to note that with low load averages,
> 2.5.33-mm5 is on top, but as load average increases, -aa is
> ahead.
> 
> kernel                   Tasks   Jobs/Min        Real    CPU
> 2.4.19-rc5-aa1           4      131.6           184.2   45.5
> 2.4.20-pre5              4      132.7           182.7   44.1
> 2.4.20-pre4-ac1          4      132.7           182.6   46.0
> 2.5.33-mm5               4      140.4           172.6   37.7
> 
> 2.4.19-rc5-aa1           32     264.8           732.3   219.1
> 2.4.20-pre5              32     230.5           841.5   265.7
> 2.4.20-pre4-ac1          32     227.7           851.6   257.6
> 2.5.33-mm5               32     229.8           843.7   224.7
> 
> AIM7 shared multiuser workload
> 
> This is more cpu intensive than the other aim7 workloads.
> 2.5.33-mm5 is using a lot more cpu time.  That may be a bug in
> the workload.  I'm investigating that.
> 
> kernel                   Tasks   Jobs/Min        Real    CPU
> 2.4.19-rc5-aa1           64     2319.6          160.6   163.8
> 2.4.20-pre4-ac1          64     1960.4          190.0   164.8
> 2.4.20-pre5              64     1980.3          188.1   185.1
> 2.5.33-mm5               64     1461.2          254.9   566.2
> 
> 2.4.19-rc5-aa1           256    2835.5          525.5   652.6
> 2.4.20-pre4-ac1          256    2444.2          609.6   656.6
> 2.4.20-pre5              256    2432.8          612.4   701.0
> 2.5.33-mm5               256    1890.5          788.1  2316.4

I think there may be a couple of spots where I'm spinning out of control
in there. Badari has reported a big regression in writeout speed to
sixty(!) disks.  Or possibly block-highmem got broken again; don't
know yet.
 
> IRMAN - interactive response measurement.
> 2.5.33-mm5 has much lower max response time for file io.
> The standard deviation is very low too (which is good).
> 
>                    FILE_IO Response time measurements (milliseconds)
>                            Max         Min         Avg       StdDev
> 2.4.20-pre4-ac1          40.603       0.008       0.009       0.043
> 2.4.20-pre5              52.405       0.009       0.011       0.080
> 2.5.33-mm5                2.955       0.008       0.010       0.004

For many things, 1/latency == throughput.  But the averages are
the same here.  Be interesting to run it on the 384 megabyte machine.

irman's io test doesn't seem to be very good, actually.  It creates
a megabyte of pagecache and seeks around it.  That's an uncommon
thing to do.

It would be interesting to run irman in conjunction with tiobench
or dbench.  One the same disk and on a different disk.

> autoconf-2.53 build (12 times) creates about 1.2 million processes.
> It's a good fork test.  rmap slows this one down.  There is a healthy
> difference between the rmap in 2.5.33-mm5 and 2.4.20-pre4-ac1.
> 
> kernel                           seconds (smaller is better)
> 2.4.20-pre4-ac1                    856.4
> 2.4.19-rc5-aa1                     727.2
> 2.4.20-pre5                        718.4
> 2.5.33                             799.2
> 2.5.33-mm5                         782.0

Yup.  We've pulled back half of the rmap CPU cost, and half of its
space consumption.
 
> Time to build the kernel 12 times.  Not a lot of difference here.
> 
> kernel                           seconds
> 2.4.19-rc5-aa1                     718.8
> 2.4.20-pre4-ac1                    735.8
> 2.4.20-pre5                        728.1
> 2.5.33                             728.2
> 2.5.33-mm5                         736.8

Should have been better than that.  Although the kmap work doesn't
seem to affect these machines.  Is that a `make -j1', `-j6'???

> The Open Source database benchmark doesn't vary much between trees.
> 
> dbench on various filesystems.   This isn't meant to compare
> filesystem because the disk geometry is different for each fs.
> 
> rmap has generally not done well on dbench when the process
> count is high, but 2.5.33* on ext2 and ext3 really smokes at
> 64 processes.
>
> dbench ext2 64 processes                Average (5 runs)
> 2.4.19-rc5-aa1                          179.61  MB/second
> 2.4.20-pre4-ac1                         140.63
> 2.4.20-pre5                             145.00
> 2.5.33                                  220.54
> 2.5.33-mm5                              214.78

At 64 processes you're pretty much CPU-bound.  The difference is due
to the buffer-layer rework.  -mm5 would have done more IO here.  The
lower dirty memory threshold costs 15% here.

> dbench ext2 192 processes               Average
> 2.4.19-rc5-aa1                          155.44
> 2.4.20-pre4-ac1                          79.16
> 2.4.20-pre5                             115.31
> 2.5.33                                  134.27
> 2.5.33-mm5                              174.17

-mm5 starts writeback much earlier than 2.5.33 and does more IO.  
It is faster because the VM does not wait on individual dirty pages.

> dbench ext3 64 processes                Average
> 2.4.19-rc5-aa1                           97.69
> 2.4.20-pre4-ac1                          59.42
> 2.4.20-pre5                              80.79
> 2.5.33-mm5                              112.20
> 
> dbench ext3 192 processes               Average
> 2.4.19-rc5-aa1                           77.06
> 2.4.20-pre4-ac1                          28.48
> 2.4.20-pre5                              58.66
> 2.5.33-mm5                               72.92
> 
> dbench reiserfs 64 processes            Average
> 2.4.19-rc5-aa1                           70.50
> 2.4.20-pre4-ac1                          57.30
> 2.4.20-pre5                              62.60
> 2.5.33-mm5                               77.22
> 
> dbench reiserfs 192 processes           Average
> 2.4.19-rc5-aa1                           55.37
> 2.4.20-pre4-ac1                          20.56
> 2.4.20-pre5                              44.14
> 2.5.33-mm5                               49.61

Odd.  I've been steadily winding down the dirty memory thresholds.
dbench 192 should have suffered more.

It is useful to watch the amount of CPU which is consumed with dbench,
btw.  That's one thing which tends to be fairly repeatable between runs.
 
> The O(1) scheduler helps tbench a lot when the process
> count is high.  The ac tree may not have the latest
> scheduler updates.
> 
> tbench 192 processes            Average
> 2.4.19-rc5-aa1                  116.76
> 2.4.20-pre4-ac1                 100.30
> 2.4.20-pre5                      27.98
> 2.5.33                          115.93
> 2.5.33-mm5                      117.91
> 
> LMbench latency running /bin/sh had a big regression in the
> -mm tree recently.
> 
>                       fork    execve  /bin/sh
> kernel              process  process  process
> ------------------  -------  -------  -------
> 2.4.19-rc5-aa1        186.8    883.1   3937.9
> 2.4.20-pre4-ac1       227.9    904.5   3866.0
> 2.4.20-pre5           310.0    990.9   4178.1
> 2.5.33-mm5            244.3    949.0  71588.2

Something went horridly wrong with that in 2.5.33.  I think it got
fixed up.
 
> Context switching with 32K - times in microseconds - smaller is better
> ----------------------------------------------------------------------
>                    32prc/32k  64prc/32k  96prc/32k
> kernel             ctx swtch  ctx swtch  ctx swtch
> ----------------   ---------  ---------  ---------
> 2.4.19-rc5-aa1        35.411     65.120     64.686
> 2.4.20-pre4-ac1       30.642     49.307     56.068
> 2.4.20-pre5           17.716     27.205     43.716
> 2.5.33-mm5            21.786     49.555     63.000
> 
> Context switching with 64K - times in microseconds - smaller is better
> ----------------------------------------------------------------------
>                    16prc/64k  32prc/64k  64prc/64k
> kernel             ctx swtch  ctx swtch  ctx swtch
> ----------------   ---------  ---------  ---------
> 2.4.19-rc5-aa1        50.523    111.320    137.383
> 2.4.20-pre4-ac1       50.691     92.204    122.261
> 2.4.20-pre5           36.763     44.498    111.952
> 2.5.33-mm5            27.113     42.679    124.907
> 
> File create/delete and VM system latencies in microseconds - smaller is better
> ----------------------------------------------------------------------------
> The -aa tree higher latency for file creation.  File delete latency is
> similar for all trees.  2.4.20-pre5 has the lowest mmap latency, 2.5.33-mm5
> the highest.
> 
>                    0K         1K       10K      10K     Mmap     Page
> kernel           Create     Create    Create   Delete   Latency  Fault
> ---------------- -------    -------   -------  -------  -------  ------
> 2.4.19-rc5-aa1    126.57     174.70    256.64    62.50   3728.2    4.00
> 2.4.20-pre4-ac1    86.92     137.28    217.73    61.22   3557.2    3.00
> 2.4.20-pre5        90.24     140.22    219.17    61.38   2673.8    3.00
> 2.5.33-mm5         93.43     143.58    225.19    63.83   4634.7    4.00

Most of the create cost is in a bit of debug code in ext2.

> *Local* Communication latencies in microseconds - smaller is better
> -------------------------------------------------------------------
> 2.5.33-mm5 has significanly lower latency here, except for tcp connection.
> 
> kernel               Pipe   AF/Unix     UDP       TCP   RPC/TCP  TCPconn
> -----------------  -------  -------  -------   -------  -------  -------
> 2.4.19-rc5-aa1      36.697   48.436  55.3271   50.8352  80.8498   88.330
> 2.4.20-pre4-ac1     34.110   56.582  53.9643   54.7447  84.4660   86.195
> 2.4.20-pre5         10.819   25.379  38.4917   45.2661  79.1166   86.745
> 2.5.33-mm5           8.337   14.122  23.6442   35.4457  77.0814  111.252
> 
> *Local* Communication bandwidths in MB/s - bigger is better
> -----------------------------------------------------------
> 
> kernel               Pipe   AF/Unix    TCP
> -----------------  -------  -------  -------
> 2.4.19-rc5-aa1      541.56   253.43   166.08
> 2.4.20-pre4-ac1     552.99   240.54   168.34
> 2.4.20-pre5         462.82   273.55   161.28
> 2.5.33-mm5          515.64   543.57   171.01
> 
> tiobench-0.3.3 is create 12 gigabytes worth of files.
> 
> Unit information
> ================
> Rate      = megabytes per second
> CPU%      = percentage of CPU used during the test
> Latency   = milliseconds
> Lat%      = percent of requests that took longer than 10 seconds
> CPU Eff   = Rate divided by CPU% - throughput per cpu load
> 
> Sequential Reads ext2
> 2.5.33-mm5 has much lower max latency when the thread count is high for
> sequentional reads.  The qlogic driver in -aa helps a lot here too.
> 
>                    Num                    Avg       Maximum      Lat%   CPU
> Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
> ------------------ ---  ---------------------------------------------------
> 
> 2.4.19-rc5-aa1       1   51.21 28.87%     0.226      103.26   0.00000   177
> 2.4.20-pre4-ac1      1   34.14 17.25%     0.341      851.34   0.00000   198
> 2.4.20-pre5          1   33.68 20.36%     0.345      110.11   0.00000   165
> 2.5.33               1   25.36 13.67%     0.460     1512.99   0.00000   185
> 2.5.33-mm5           1   31.73 14.80%     0.367      853.99   0.00000   214
> 
> 2.4.19-rc5-aa1     256   40.68 25.39%    64.084   107977.97   0.36264   160
> 2.4.20-pre4-ac1    256   34.51 19.63%    51.031   845159.88   0.02919   176
> 2.4.20-pre5        256   31.89 22.95%    57.236   849792.70   0.03459   139
> 2.5.33             256   24.54 14.46%    94.422   449274.89   0.09794   170
> 2.5.33-mm5         256   22.39 18.56%   104.515    24623.21   0.00000   121

That's actually a bug.  There is some form of interaction between readahead
and request merging which results in single read streams not being able
to capture the disk head for a decent amount of time.
 
> Sequential Writes ext2
> There is a dramatic reduction in cpu utilization in 2.5.33-mm5 and increase in
> throughput compared to 2.5.33 when thread count is high.
> 
>                    Num                    Avg       Maximum      Lat%   CPU
> Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
> ------------------ ---  ---------------------------------------------------
> 2.4.19-rc5-aa1     128   37.40 45.99%    32.405    46333.30   0.00105    81
> 2.4.20-pre4-ac1    128   34.01 36.94%    40.121    47331.57   0.00058    92
> 2.4.20-pre5        128   32.98 49.33%    39.692    52093.19   0.01446    67
> 2.5.33             128   12.17 222.9%   108.966   910455.61   0.19503     5
> 2.5.33-mm5         128   30.78 30.03%    32.973   909931.81   0.07858   102

This test is highly dependent upon the size of the request queues.  The
queues have 128 slots and you're running 128 threads.  One would expect
to see a lot of variability with that combo.  Would be interesting to also
test 64 threads, and 256.

-aa has the "merge even after latency has expired" logic in the elevator,
which could make some difference at the 128 thread level.

> Sequential Reads ext3
> 2.5.33-mm5 has a more graceful degradation in throughput on ext3.
> Fairness is better too.
> 
>                    Num                    Avg       Maximum      Lat%   CPU
> Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
> ------------------ ---  ---------------------------------------------------
> 2.4.19-rc5-aa1       1   51.13 29.59%     0.227      460.92   0.00000   173
> 2.4.20-pre4-ac1      1   34.12 17.37%     0.341     1019.65   0.00000   196
> 2.4.20-pre5          1   33.28 20.62%     0.350      137.44   0.00000   161
> 2.5.33-mm5           1   31.70 14.75%     0.367      581.89   0.00000   215

20% drop in CPU load is typical, but the reduced bandwidth on such
a simple test is unexpected.  Presumably that's the driver thing.

> 2.4.19-rc5-aa1      64    7.38  4.51%    98.947    20638.56   0.00000   164
> 2.4.20-pre4-ac1     64    6.55  3.94%   110.432    14937.49   0.00000   166
> 2.4.20-pre5         64    6.34  4.16%   111.299    14234.83   0.00000   152
> 2.5.33-mm5          64   12.29  8.51%    55.372     8799.99   0.00000   144

hm.  Don't know.  Something went right in the 2.5 elevator I guess.

> Sequential Writes ext3
> Here 2.5.33-mm5 is great with 1 thread, but takes a hit at 32 threads.
> Latency is pretty high too.  Cpu utilization is quite low though.
> 
>                    Num                    Avg       Maximum      Lat%   CPU
> Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
> ------------------ ---  ---------------------------------------------------
> 2.4.19-rc5-aa1       1   44.23 53.01%     0.243     6084.88   0.00000    83
> 2.4.20-pre4-ac1      1   37.86 50.66%     0.300     4288.99   0.00000    75
> 2.4.20-pre5          1   37.58 55.38%     0.295    14659.06   0.00003    68
> 2.5.33-mm5           1   54.16 65.87%     0.211     5605.87   0.00000    82
> 
> 2.4.19-rc5-aa1      32   20.86 121.6%     8.861    13693.99   0.00000    17
> 2.4.20-pre4-ac1     32   28.33 156.6%    10.041    15724.46   0.00000    18
> 2.4.20-pre5         32   22.36 114.3%    10.382    12867.96   0.00000    20
> 2.5.33-mm5          32    5.90 11.67%    52.386  1150696.62   0.08252    50

There's a funny situation where ext3 ends up going all synchronous when
someone wants to modify an inode at a particular part of the commit
lifecycle.  Perhaps that happened..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Performance differences in recent kernels
@ 2002-09-11  3:54 rwhron
  2002-09-11  5:59 ` Andrew Morton
  2002-09-11 10:03 ` Hans Reiser
  0 siblings, 2 replies; 8+ messages in thread
From: rwhron @ 2002-09-11  3:54 UTC (permalink / raw)
  To: linux-kernel

Just to note a few differences in recent benchmarks on quad xeon 
with 3.75 gb ram and qlogic 2200 -> raid 5 array.

For AIM7, the outstanding metrics are jobs/min (high is good),
and cpu time (in seconds).  The tasks column is equivalent to
load average.

AIM7 database workload

Andrea's tree has the v6.0 qlogic driver which helps i/o a lot.
It's the only tree with that driver atm.  The other trees look
pretty similar at load averages of 32 and 256.  


kernel                   Tasks   Jobs/Min       Real    CPU     
2.4.19-rc5-aa1           32	555.4		342.2	146.1	
2.4.20-pre5              32	470.7		403.8	147.2	
2.4.20-pre4-ac1          32	472.0		402.7	142.4	
2.5.33-mm5               32	474.4		400.7	144.2	

2.4.19-rc5-aa1           256	905.2		1679.9	931.9	
2.4.20-pre5              256	769.1		1977.0 1048.5	
2.4.20-pre4-ac1          256	766.4		1984.2	945.5	
2.5.33-mm5               256	763.0		1992.9 1020.8	


AIM7 file server workload

Interesting here to note that with low load averages, 
2.5.33-mm5 is on top, but as load average increases, -aa is
ahead.

kernel                   Tasks   Jobs/Min        Real    CPU    
2.4.19-rc5-aa1           4	131.6		184.2	45.5	
2.4.20-pre5              4	132.7		182.7	44.1	
2.4.20-pre4-ac1          4	132.7		182.6	46.0	
2.5.33-mm5               4	140.4		172.6	37.7	

2.4.19-rc5-aa1           32	264.8		732.3	219.1	
2.4.20-pre5              32	230.5		841.5	265.7	
2.4.20-pre4-ac1          32	227.7		851.6	257.6	
2.5.33-mm5               32	229.8		843.7	224.7	


AIM7 shared multiuser workload

This is more cpu intensive than the other aim7 workloads.
2.5.33-mm5 is using a lot more cpu time.  That may be a bug in
the workload.  I'm investigating that.


kernel                   Tasks   Jobs/Min        Real    CPU    
2.4.19-rc5-aa1           64	2319.6		160.6	163.8	
2.4.20-pre4-ac1          64	1960.4		190.0	164.8	
2.4.20-pre5              64	1980.3		188.1	185.1	
2.5.33-mm5               64	1461.2		254.9	566.2	

2.4.19-rc5-aa1           256	2835.5		525.5	652.6	
2.4.20-pre4-ac1          256	2444.2		609.6	656.6	
2.4.20-pre5              256	2432.8		612.4	701.0	
2.5.33-mm5               256	1890.5		788.1  2316.4	


IRMAN - interactive response measurement.
2.5.33-mm5 has much lower max response time for file io. 
The standard deviation is very low too (which is good).

                   FILE_IO Response time measurements (milliseconds)
                           Max         Min         Avg       StdDev
2.4.20-pre4-ac1          40.603       0.008       0.009       0.043
2.4.20-pre5              52.405       0.009       0.011       0.080
2.5.33-mm5                2.955       0.008       0.010       0.004


autoconf-2.53 build (12 times) creates about 1.2 million processes.
It's a good fork test.  rmap slows this one down.  There is a healthy
difference between the rmap in 2.5.33-mm5 and 2.4.20-pre4-ac1.

kernel                   	 seconds (smaller is better)
2.4.20-pre4-ac1          	   856.4
2.4.19-rc5-aa1           	   727.2
2.4.20-pre5              	   718.4
2.5.33                   	   799.2
2.5.33-mm5               	   782.0


Time to build the kernel 12 times.  Not a lot of difference here.

kernel                   	 seconds
2.4.19-rc5-aa1           	   718.8
2.4.20-pre4-ac1          	   735.8
2.4.20-pre5              	   728.1
2.5.33                   	   728.2
2.5.33-mm5               	   736.8


The Open Source database benchmark doesn't vary much between trees.


dbench on various filesystems.   This isn't meant to compare
filesystem because the disk geometry is different for each fs.

rmap has generally not done well on dbench when the process
count is high, but 2.5.33* on ext2 and ext3 really smokes at 
64 processes.

dbench ext2 64 processes		Average	(5 runs)
2.4.19-rc5-aa1           		179.61	MB/second
2.4.20-pre4-ac1          		140.63	
2.4.20-pre5              		145.00	
2.5.33                   		220.54	
2.5.33-mm5               		214.78	

dbench ext2 192 processes		Average	
2.4.19-rc5-aa1           		155.44	
2.4.20-pre4-ac1          		 79.16	
2.4.20-pre5              		115.31	
2.5.33                   		134.27	
2.5.33-mm5               		174.17	


dbench ext3 64 processes		Average	
2.4.19-rc5-aa1           		 97.69	
2.4.20-pre4-ac1          		 59.42	
2.4.20-pre5              		 80.79	
2.5.33-mm5               		112.20	

dbench ext3 192 processes		Average	
2.4.19-rc5-aa1           		 77.06	
2.4.20-pre4-ac1          		 28.48	
2.4.20-pre5              		 58.66	
2.5.33-mm5               		 72.92	


dbench reiserfs 64 processes		Average	
2.4.19-rc5-aa1           		 70.50	
2.4.20-pre4-ac1          		 57.30	
2.4.20-pre5              		 62.60	
2.5.33-mm5               		 77.22	

dbench reiserfs 192 processes		Average	
2.4.19-rc5-aa1           		 55.37	
2.4.20-pre4-ac1          		 20.56	
2.4.20-pre5              		 44.14	
2.5.33-mm5               		 49.61	


The O(1) scheduler helps tbench a lot when the process
count is high.  The ac tree may not have the latest 
scheduler updates.  

tbench 192 processes		Average	
2.4.19-rc5-aa1           	116.76	
2.4.20-pre4-ac1          	100.30	
2.4.20-pre5              	 27.98	
2.5.33                   	115.93	
2.5.33-mm5               	117.91	


LMbench latency running /bin/sh had a big regression in the
-mm tree recently.

                      fork    execve  /bin/sh
kernel              process  process  process
------------------  -------  -------  -------
2.4.19-rc5-aa1        186.8    883.1   3937.9
2.4.20-pre4-ac1       227.9    904.5   3866.0
2.4.20-pre5           310.0    990.9   4178.1
2.5.33-mm5            244.3    949.0  71588.2


Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
                   32prc/32k  64prc/32k  96prc/32k
kernel             ctx swtch  ctx swtch  ctx swtch
----------------   ---------  ---------  ---------
2.4.19-rc5-aa1        35.411     65.120     64.686
2.4.20-pre4-ac1       30.642     49.307     56.068
2.4.20-pre5           17.716     27.205     43.716
2.5.33-mm5            21.786     49.555     63.000

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
                   16prc/64k  32prc/64k  64prc/64k  
kernel             ctx swtch  ctx swtch  ctx swtch  
----------------   ---------  ---------  ---------  
2.4.19-rc5-aa1        50.523    111.320    137.383  
2.4.20-pre4-ac1       50.691     92.204    122.261  
2.4.20-pre5           36.763     44.498    111.952  
2.5.33-mm5            27.113     42.679    124.907  

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
The -aa tree higher latency for file creation.  File delete latency is
similar for all trees.  2.4.20-pre5 has the lowest mmap latency, 2.5.33-mm5 
the highest.

                   0K         1K       10K      10K     Mmap     Page
kernel           Create     Create    Create   Delete   Latency  Fault
---------------- -------    -------   -------  -------  -------  ------
2.4.19-rc5-aa1    126.57     174.70    256.64    62.50   3728.2    4.00
2.4.20-pre4-ac1    86.92     137.28    217.73    61.22   3557.2    3.00
2.4.20-pre5        90.24     140.22    219.17    61.38   2673.8    3.00
2.5.33-mm5         93.43     143.58    225.19    63.83   4634.7    4.00

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
2.5.33-mm5 has significanly lower latency here, except for tcp connection.

kernel               Pipe   AF/Unix     UDP       TCP   RPC/TCP  TCPconn
-----------------  -------  -------  -------   -------  -------  -------
2.4.19-rc5-aa1      36.697   48.436  55.3271   50.8352  80.8498   88.330
2.4.20-pre4-ac1     34.110   56.582  53.9643   54.7447  84.4660   86.195
2.4.20-pre5         10.819   25.379  38.4917   45.2661  79.1166   86.745
2.5.33-mm5           8.337   14.122  23.6442   35.4457  77.0814  111.252

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
                                            
kernel               Pipe   AF/Unix    TCP  
-----------------  -------  -------  -------
2.4.19-rc5-aa1      541.56   253.43   166.08
2.4.20-pre4-ac1     552.99   240.54   168.34
2.4.20-pre5         462.82   273.55   161.28
2.5.33-mm5          515.64   543.57   171.01


tiobench-0.3.3 is create 12 gigabytes worth of files.

Unit information
================
Rate      = megabytes per second
CPU%      = percentage of CPU used during the test
Latency   = milliseconds
Lat%      = percent of requests that took longer than 10 seconds
CPU Eff   = Rate divided by CPU% - throughput per cpu load

Sequential Reads ext2
2.5.33-mm5 has much lower max latency when the thread count is high for 
sequentional reads.  The qlogic driver in -aa helps a lot here too.

                   Num                    Avg       Maximum      Lat%   CPU
Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
------------------ ---  ---------------------------------------------------

2.4.19-rc5-aa1       1   51.21 28.87%     0.226      103.26   0.00000   177
2.4.20-pre4-ac1      1   34.14 17.25%     0.341      851.34   0.00000   198
2.4.20-pre5          1   33.68 20.36%     0.345      110.11   0.00000   165
2.5.33               1   25.36 13.67%     0.460     1512.99   0.00000   185
2.5.33-mm5           1   31.73 14.80%     0.367      853.99   0.00000   214

2.4.19-rc5-aa1     256   40.68 25.39%    64.084   107977.97   0.36264   160
2.4.20-pre4-ac1    256   34.51 19.63%    51.031   845159.88   0.02919   176
2.4.20-pre5        256   31.89 22.95%    57.236   849792.70   0.03459   139
2.5.33             256   24.54 14.46%    94.422   449274.89   0.09794   170
2.5.33-mm5         256   22.39 18.56%   104.515    24623.21   0.00000   121

Sequential Writes ext2
There is a dramatic reduction in cpu utilization in 2.5.33-mm5 and increase in 
throughput compared to 2.5.33 when thread count is high.

                   Num                    Avg       Maximum      Lat%   CPU
Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
------------------ ---  ---------------------------------------------------
2.4.19-rc5-aa1     128   37.40 45.99%    32.405    46333.30   0.00105    81
2.4.20-pre4-ac1    128   34.01 36.94%    40.121    47331.57   0.00058    92
2.4.20-pre5        128   32.98 49.33%    39.692    52093.19   0.01446    67
2.5.33             128   12.17 222.9%   108.966   910455.61   0.19503     5
2.5.33-mm5         128   30.78 30.03%    32.973   909931.81   0.07858   102


Sequential Reads ext3
2.5.33-mm5 has a more graceful degradation in throughput on ext3.  
Fairness is better too.

                   Num                    Avg       Maximum      Lat%   CPU
Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
------------------ ---  ---------------------------------------------------
2.4.19-rc5-aa1       1   51.13 29.59%     0.227      460.92   0.00000   173
2.4.20-pre4-ac1      1   34.12 17.37%     0.341     1019.65   0.00000   196
2.4.20-pre5          1   33.28 20.62%     0.350      137.44   0.00000   161
2.5.33-mm5           1   31.70 14.75%     0.367      581.89   0.00000   215

2.4.19-rc5-aa1      64    7.38  4.51%    98.947    20638.56   0.00000   164
2.4.20-pre4-ac1     64    6.55  3.94%   110.432    14937.49   0.00000   166
2.4.20-pre5         64    6.34  4.16%   111.299    14234.83   0.00000   152
2.5.33-mm5          64   12.29  8.51%    55.372     8799.99   0.00000   144



Sequential Writes ext3
Here 2.5.33-mm5 is great with 1 thread, but takes a hit at 32 threads.  
Latency is pretty high too.  Cpu utilization is quite low though.

                   Num                    Avg       Maximum      Lat%   CPU
Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
------------------ ---  ---------------------------------------------------
2.4.19-rc5-aa1       1   44.23 53.01%     0.243     6084.88   0.00000    83
2.4.20-pre4-ac1      1   37.86 50.66%     0.300     4288.99   0.00000    75
2.4.20-pre5          1   37.58 55.38%     0.295    14659.06   0.00003    68
2.5.33-mm5           1   54.16 65.87%     0.211     5605.87   0.00000    82

2.4.19-rc5-aa1      32   20.86 121.6%     8.861    13693.99   0.00000    17
2.4.20-pre4-ac1     32   28.33 156.6%    10.041    15724.46   0.00000    18
2.4.20-pre5         32   22.36 114.3%    10.382    12867.96   0.00000    20
2.5.33-mm5          32    5.90 11.67%    52.386  1150696.62   0.08252    50


Sequential Reads on reiserfs
Don't know what happened to the 2.5 numbers here.
-aa has much higher throughput at high thread count,
but I believe that's a reiserfs change that is fixed in 2.4.20-pre6.

                   Num                    Avg       Maximum      Lat%   CPU
Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s   Eff
------------------ ---  ---------------------------------------------------
2.4.19-rc5-aa1       1   48.21 30.97%     0.241      104.82   0.00000   156
2.4.20-pre4-ac1      1   33.65 19.27%     0.346      136.95   0.00000   175
2.4.20-pre5          1   35.25 23.00%     0.330      492.30   0.00000   153

2.4.19-rc5-aa1      32   36.27 25.59%     9.946    12613.17   0.00000   142
2.4.20-pre4-ac1     32    7.08  4.73%    51.894     5808.95   0.00000   149
2.4.20-pre5         32    6.74  5.16%    53.395     8148.47   0.00000   131



Sequential Writes reiserfs - max latency is very high for everyone here.

                   Num                    Avg       Maximum      Lat%  CPU
Kernel             Thr   Rate  (CPU%)   Latency     Latency      >10s  Eff
------------------ ---  --------------------------------------------------

2.4.19-rc5-aa1     256   31.90 121.9%    67.227   166079.82   0.28051   26
2.4.20-pre4-ac1    256   23.83 128.1%    84.309   135202.89   0.27039   19
2.4.20-pre5        256   18.23 88.00%    76.265   258230.65   0.26893   21

More details and more kernel tests at:
http://home.earthlink.net/~rwhron/kernel/bigbox.html
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-09-12 11:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-12  3:45 Performance differences in recent kernels rwhron
2002-09-12 11:41 ` Hans Reiser
  -- strict thread matches above, loose matches on Subject: below --
2002-09-12  3:11 rwhron
2002-09-11  3:54 rwhron
2002-09-11  5:59 ` Andrew Morton
2002-09-11 10:03 ` Hans Reiser
2002-09-11 14:50   ` Randy.Dunlap
2002-09-11 16:57     ` Hans Reiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).