All of lore.kernel.org
 help / color / mirror / Atom feed
* MMTests 0.04
@ 2012-06-20 11:32 ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-20 11:32 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

MMTests 0.04 is a configurable test suite that runs a number of common
workloads of interest to MM developers. Apparently I never sent a release
note for 0.03 so here is the changelog for both

v0.04
o Add benchmarks for tbench, pipetest, lmbench, starve, memcachedtest
o Add basic benchmark to run trinity fuzz testing tool
o Add monitor that runs parallel IO in the background. Measures how much
  IO interferes with a target workload.
o Allow limited run of sysbench to save time
o Add helpers for running oprofile, taken from libhugetlbfs
o Add fsmark configurations suitable for page reclaim and metadata tests
o Add a mailserver simulator (needs work, takes too long to run)
o Tune page fault test configuration for page allocator comparisons 
o Allow greater skew when running STREAM on NUMA machines
o Add a monitor that roughly measures interactive app startup times
o Add a monitor that tracks read() latency (useful for interactivity tests)
o Add script for calculating quartiles (incomplete, not tested properly)
o Add config examples for measuring interactivity during IO (not validated)
o Add background allocator for hugepage allocations (not validated)
o Patch SystemTap installation to work with 3.4 and later kernels
o Allow use of out-of-box THP configuration

v0.03
o Add a page allocator micro-benchmark
o Add monitor for tracking processes stuck in D state
o Add a page fault micro-benchmark
o Add a memory compaction micro-benchmark
o Patch a tiobench divide-by-0 error
o Adapt systemtap for >= 3.3 kernel
o Reporting scripts for kernbench
o Reporting scripts for ddresidency

At LSF/MM at some point a request was made that a series of tests
be identified that were of interest to MM developers and that could be
used for testing the Linux memory management subsystem. There is renewed
interest in some sort of general testing framework during discussions for
Kernel Summit 2012 so here is what I use.

http://www.csn.ul.ie/~mel/projects/mmtests/
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.04-mmtests-0.01.tar.gz

In this release there are a number of stock configuration options added.
For example config-global-dhp__pagealloc-performance runs a number of tests
that may be able to identify performance regressions or gains in the page
allocator. Similarly there network and scheduler configs. There are also
more complex options. config-global-dhp__parallelio-memcachetest will run
memcachetest in the foreground while doing IO of different sizes in the
background to measure how much unrelated IO affects the throughput of an
in-memory database.

This release is a little half-baked but decided to release anyway due to
current discussions. By my own admission there are areas that need cleaning
up and there is some serious cut&paste-itis going on in parts.  I wanted
to get that all fixed up before releasing but that could take too long.
The biggest warts by far are in how reports are generated due to being
able to crank a new one out in 3 minutes where as doing it properly would
require redesign.  What should have happened is that the stats generation and
reporting be completely separated but that can be still fixed because the raw
data is captured.  The stats reporting in general needs better work because
while some tests know how to make a better estimate of mean by filtering
outliers it is not being handled consistently and the methodology needs work.
The raw data is there which I considered to be a higher priority initially.

I ran a number of tests against kernels since 2.6.32 and there is a lot of
interesting stuff in there. Unfortunately I have not had the chance to dig
through it all and validate all the tests are working exactly as expected
so they are not all available. However, this is an example report for one
test configuration on one machine. It's a bit ugly but that was not a high
priority. The other tests work on a similar principal

http://www.csn.ul.ie/~mel/postings/mmtests-20120620/global-dhp__pagealloc-performance/comparison.html

Just glancing through, it's possible to see interesting things and additional
investigation work that is required.

o Something awful happened in 3.2.9 across the board according to this
  machine
o Kernbench in 3.3 and 3.4 was still not great in comparison to 3.1
o Page allocator performance was ruined for a large number of releases
  but generally improved in 3.3 although it's still a bit all over
  the place [*]
o hackbench-pipes had a fun history but is mostly good in 3.4
o hackbench-sockets had a similarly fun history
o aim9 shows that page-test took a big drop after 2.6.32 and has not
  recovered yet. Some of the other tests are also very alarming
o STREAM is ok at least but that is not heavily dependant on kernel

[*] It was this report that led to commit cc9a6c877 and the effect is
    visible if you squint hard enough. Needs a graph generator and
    a double checking that true-mean is measuring the right thing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* MMTests 0.04
@ 2012-06-20 11:32 ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-20 11:32 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

MMTests 0.04 is a configurable test suite that runs a number of common
workloads of interest to MM developers. Apparently I never sent a release
note for 0.03 so here is the changelog for both

v0.04
o Add benchmarks for tbench, pipetest, lmbench, starve, memcachedtest
o Add basic benchmark to run trinity fuzz testing tool
o Add monitor that runs parallel IO in the background. Measures how much
  IO interferes with a target workload.
o Allow limited run of sysbench to save time
o Add helpers for running oprofile, taken from libhugetlbfs
o Add fsmark configurations suitable for page reclaim and metadata tests
o Add a mailserver simulator (needs work, takes too long to run)
o Tune page fault test configuration for page allocator comparisons 
o Allow greater skew when running STREAM on NUMA machines
o Add a monitor that roughly measures interactive app startup times
o Add a monitor that tracks read() latency (useful for interactivity tests)
o Add script for calculating quartiles (incomplete, not tested properly)
o Add config examples for measuring interactivity during IO (not validated)
o Add background allocator for hugepage allocations (not validated)
o Patch SystemTap installation to work with 3.4 and later kernels
o Allow use of out-of-box THP configuration

v0.03
o Add a page allocator micro-benchmark
o Add monitor for tracking processes stuck in D state
o Add a page fault micro-benchmark
o Add a memory compaction micro-benchmark
o Patch a tiobench divide-by-0 error
o Adapt systemtap for >= 3.3 kernel
o Reporting scripts for kernbench
o Reporting scripts for ddresidency

At LSF/MM at some point a request was made that a series of tests
be identified that were of interest to MM developers and that could be
used for testing the Linux memory management subsystem. There is renewed
interest in some sort of general testing framework during discussions for
Kernel Summit 2012 so here is what I use.

http://www.csn.ul.ie/~mel/projects/mmtests/
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.04-mmtests-0.01.tar.gz

In this release there are a number of stock configuration options added.
For example config-global-dhp__pagealloc-performance runs a number of tests
that may be able to identify performance regressions or gains in the page
allocator. Similarly there network and scheduler configs. There are also
more complex options. config-global-dhp__parallelio-memcachetest will run
memcachetest in the foreground while doing IO of different sizes in the
background to measure how much unrelated IO affects the throughput of an
in-memory database.

This release is a little half-baked but decided to release anyway due to
current discussions. By my own admission there are areas that need cleaning
up and there is some serious cut&paste-itis going on in parts.  I wanted
to get that all fixed up before releasing but that could take too long.
The biggest warts by far are in how reports are generated due to being
able to crank a new one out in 3 minutes where as doing it properly would
require redesign.  What should have happened is that the stats generation and
reporting be completely separated but that can be still fixed because the raw
data is captured.  The stats reporting in general needs better work because
while some tests know how to make a better estimate of mean by filtering
outliers it is not being handled consistently and the methodology needs work.
The raw data is there which I considered to be a higher priority initially.

I ran a number of tests against kernels since 2.6.32 and there is a lot of
interesting stuff in there. Unfortunately I have not had the chance to dig
through it all and validate all the tests are working exactly as expected
so they are not all available. However, this is an example report for one
test configuration on one machine. It's a bit ugly but that was not a high
priority. The other tests work on a similar principal

http://www.csn.ul.ie/~mel/postings/mmtests-20120620/global-dhp__pagealloc-performance/comparison.html

Just glancing through, it's possible to see interesting things and additional
investigation work that is required.

o Something awful happened in 3.2.9 across the board according to this
  machine
o Kernbench in 3.3 and 3.4 was still not great in comparison to 3.1
o Page allocator performance was ruined for a large number of releases
  but generally improved in 3.3 although it's still a bit all over
  the place [*]
o hackbench-pipes had a fun history but is mostly good in 3.4
o hackbench-sockets had a similarly fun history
o aim9 shows that page-test took a big drop after 2.6.32 and has not
  recovered yet. Some of the other tests are also very alarming
o STREAM is ok at least but that is not heavily dependant on kernel

[*] It was this report that led to commit cc9a6c877 and the effect is
    visible if you squint hard enough. Needs a graph generator and
    a double checking that true-mean is measuring the right thing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: MMTests 0.04
  2012-06-20 11:32 ` Mel Gorman
@ 2012-06-29 11:19   ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

On Wed, Jun 20, 2012 at 12:32:52PM +0100, Mel Gorman wrote:
> MMTests 0.04 is a configurable test suite that runs a number of common
> workloads of interest to MM developers. Apparently I never sent a release
> note for 0.03 so here is the changelog for both
> 

Using MMTests 0.04 I ran a number of tests between 2.6.32 and 3.4 on three
test machines. None of them are particularly powerful but the results are
still useful because it's worth knowing how we are doing for some ordinary
cases over time.

There were 34 test configurations in all taking between 3-5 days to run all
the tests for a single kernel. I expect that not all the results will be
useful when I look closer but that can be improved. I have not looked at
all the results yet and will only talk about the ones I have had a chance
to read.

I know the presentation is ugly but it was not a high priority to make
them very pretty. The analysis is also superficial as it's time consuming
to do a full analysis for any of these tests. In general the stats need
improving but this is also something that can be improved over time once
the raw data can be collected. Right now I tend to look closer at the
data when I am trying to narrow a problem down to a specific area or when
a regression might have been introduced.  When this happens I can usually
apply what stats I need manually or rerun the specific test with additional
monitoring which is less than ideal for automation.

Due to the superficial nature I suggest you take these summaries with a
grain of salt and draw your own conclusions.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: MMTests 0.04
@ 2012-06-29 11:19   ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

On Wed, Jun 20, 2012 at 12:32:52PM +0100, Mel Gorman wrote:
> MMTests 0.04 is a configurable test suite that runs a number of common
> workloads of interest to MM developers. Apparently I never sent a release
> note for 0.03 so here is the changelog for both
> 

Using MMTests 0.04 I ran a number of tests between 2.6.32 and 3.4 on three
test machines. None of them are particularly powerful but the results are
still useful because it's worth knowing how we are doing for some ordinary
cases over time.

There were 34 test configurations in all taking between 3-5 days to run all
the tests for a single kernel. I expect that not all the results will be
useful when I look closer but that can be improved. I have not looked at
all the results yet and will only talk about the ones I have had a chance
to read.

I know the presentation is ugly but it was not a high priority to make
them very pretty. The analysis is also superficial as it's time consuming
to do a full analysis for any of these tests. In general the stats need
improving but this is also something that can be improved over time once
the raw data can be collected. Right now I tend to look closer at the
data when I am trying to narrow a problem down to a specific area or when
a regression might have been introduced.  When this happens I can usually
apply what stats I need manually or rerun the specific test with additional
monitoring which is less than ideal for automation.

Due to the superficial nature I suggest you take these summaries with a
grain of salt and draw your own conclusions.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page allocator
  2012-06-29 11:19   ` Mel Gorman
@ 2012-06-29 11:21     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:21 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__pagealloc-performance
Benchmarks:	kernbench vmr-aim9 vmr-stream pagealloc pft hackbench-pipes hackbench-sockets

Summary
=======
kernbench and aim9 is looking bad in a lot of areas. The page allocator
itself was in very bad shape for a long time but this has improved in 3.4.
If there are reports of page allocator intensive workloads suffering badly
in recent kernels then it may be worth backporting the barrier fix.

Benchmark notes
===============

kernbench is a similar average of five compiles of vmlinux.

vmr-aim9 is a number of micro-benchmarks. The results of this are very
sensitive to a number of factors but it can be useful early warning system.

vmr-stream is the STREAM memory benchmark and variations in it can be
indicative of problems with cache usage.

pagealloc is a page allocator microbenchmark run via SystemTap. The page
allocator is rarely a major component of a workloads time but it can
be a source of slow degrataion of overall performance.

pft is a microbenchmark for page fault rates.

hackbench is usually used for scheduler comparisons but it can sometimes
highlight problems in the page allocator as well.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagealloc-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Good, except for aim9
===========================================================

kernbench
---------
  2.6.32 looks quite bad which was surprising.  That aside, there was a
  major degrading of performance between 2.6.34 and 2.6.39 that is only
  being resolved now. System CPU time was steadily getting worse for quite
  some time.

pagealloc
---------
  Page allocator performance was completely screwed for a long time with
  massive additional latencies in the alloc path. This was fixed in 3.4
  by removing barriers introduced for cpusets.

hackbench-pipes
---------------
  Generally looks ok.

hackbench-sockets
-----------------
  Some very poor results although it seems to have recovered recently. 2.6.39
  through to 3.2 were all awful.

vmr-aim9
--------
  page_test, brk_test, exec_test and fork_test all took a major pounding
  between 2.6.34 and 2.6.39. It has been improving since but still is
  far short of 2.6.32 levels in some cases.

vmr-stream
----------
  Generally looks ok.

pft
---
  Indications are we scaled better over time with a greater number of faults
  being handled when spread across CPUs.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Bad, both kernbench and aim9 need care
==========================================================

kernbench
---------
  2.6.32 looked quite bad and great in 2.6.34. Between 2.6.34 and 2.6.39
  it regressed again and got worse after that. System CPU time looks
  generally good but 3.2 and later kernels are in bad shape in terms of
  overall elapsed time.

pagealloc
---------
  As with arnold, page allocator performance was completely screwed for a
  long time but mostly resolved in 3.4.

hackbench-pipes
---------------
  This has varied considerably over time. Currently looking good but there
  was a time when high number of clients regressed considerably. Judging
  from when it got fixed this might be a scheduler problem rather than a
  page allocator one.

hackbench-sockets
-----------------
  This is marginal at the moment and has had some serious regressions in
  the past.

vmr-aim9
--------
  Like with arnold, a lot of tests took a complete hammering mostly between
  2.6.34 and 2.6.39 with the exception of exec_test which got screwed at
  2.6.34 as well. Like arnold, it has improved in 3.4 but is still far 
  short of 2.6.32

vmr-stream
----------
  Generally looks ok.

pft
---
  Unlike arnold, the figures are worse here. It looks like we were not
  handling as many faults for some time although this is better now. It
  might be related to the page allocator being crap for a long time.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Bad, both kernbench and aim9 need care
==========================================================

kernbench
---------
  As before, 2.6.32 looked bad. 2.6.34 was good but we got worse after that.
  Elapsed time in 3.4 is screwed.

pagealloc
---------
  As with the other two, page allocator performance was bad for a long time
  but not quite as bad as the others. Maybe barriers are cheaper on the I7
  than they are on the other machines. Still, 3.4 is looking good.

hackbench-pipes
---------------
  Looks great. I suspect a lot of scheduler developers must have modern
  Intel CPUs for testing with

hackbench-sockets
-----------------
  Not so great. Performance dropped for a while but is looking marginally
  better now.

vmr-aim9
--------
  Variation of the same story. In general this is looking worse but was
  not as consistently bad as the other two machines. Performance in 3.4
  is a mixed bag.

vmr-stream
----------
  Generally looks ok.

pft
---
  Generally looking good, tests are completing faster. There were regressions
  in old kernels but it has been looking better recently.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page allocator
@ 2012-06-29 11:21     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:21 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__pagealloc-performance
Benchmarks:	kernbench vmr-aim9 vmr-stream pagealloc pft hackbench-pipes hackbench-sockets

Summary
=======
kernbench and aim9 is looking bad in a lot of areas. The page allocator
itself was in very bad shape for a long time but this has improved in 3.4.
If there are reports of page allocator intensive workloads suffering badly
in recent kernels then it may be worth backporting the barrier fix.

Benchmark notes
===============

kernbench is a similar average of five compiles of vmlinux.

vmr-aim9 is a number of micro-benchmarks. The results of this are very
sensitive to a number of factors but it can be useful early warning system.

vmr-stream is the STREAM memory benchmark and variations in it can be
indicative of problems with cache usage.

pagealloc is a page allocator microbenchmark run via SystemTap. The page
allocator is rarely a major component of a workloads time but it can
be a source of slow degrataion of overall performance.

pft is a microbenchmark for page fault rates.

hackbench is usually used for scheduler comparisons but it can sometimes
highlight problems in the page allocator as well.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagealloc-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Good, except for aim9
===========================================================

kernbench
---------
  2.6.32 looks quite bad which was surprising.  That aside, there was a
  major degrading of performance between 2.6.34 and 2.6.39 that is only
  being resolved now. System CPU time was steadily getting worse for quite
  some time.

pagealloc
---------
  Page allocator performance was completely screwed for a long time with
  massive additional latencies in the alloc path. This was fixed in 3.4
  by removing barriers introduced for cpusets.

hackbench-pipes
---------------
  Generally looks ok.

hackbench-sockets
-----------------
  Some very poor results although it seems to have recovered recently. 2.6.39
  through to 3.2 were all awful.

vmr-aim9
--------
  page_test, brk_test, exec_test and fork_test all took a major pounding
  between 2.6.34 and 2.6.39. It has been improving since but still is
  far short of 2.6.32 levels in some cases.

vmr-stream
----------
  Generally looks ok.

pft
---
  Indications are we scaled better over time with a greater number of faults
  being handled when spread across CPUs.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Bad, both kernbench and aim9 need care
==========================================================

kernbench
---------
  2.6.32 looked quite bad and great in 2.6.34. Between 2.6.34 and 2.6.39
  it regressed again and got worse after that. System CPU time looks
  generally good but 3.2 and later kernels are in bad shape in terms of
  overall elapsed time.

pagealloc
---------
  As with arnold, page allocator performance was completely screwed for a
  long time but mostly resolved in 3.4.

hackbench-pipes
---------------
  This has varied considerably over time. Currently looking good but there
  was a time when high number of clients regressed considerably. Judging
  from when it got fixed this might be a scheduler problem rather than a
  page allocator one.

hackbench-sockets
-----------------
  This is marginal at the moment and has had some serious regressions in
  the past.

vmr-aim9
--------
  Like with arnold, a lot of tests took a complete hammering mostly between
  2.6.34 and 2.6.39 with the exception of exec_test which got screwed at
  2.6.34 as well. Like arnold, it has improved in 3.4 but is still far 
  short of 2.6.32

vmr-stream
----------
  Generally looks ok.

pft
---
  Unlike arnold, the figures are worse here. It looks like we were not
  handling as many faults for some time although this is better now. It
  might be related to the page allocator being crap for a long time.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Bad, both kernbench and aim9 need care
==========================================================

kernbench
---------
  As before, 2.6.32 looked bad. 2.6.34 was good but we got worse after that.
  Elapsed time in 3.4 is screwed.

pagealloc
---------
  As with the other two, page allocator performance was bad for a long time
  but not quite as bad as the others. Maybe barriers are cheaper on the I7
  than they are on the other machines. Still, 3.4 is looking good.

hackbench-pipes
---------------
  Looks great. I suspect a lot of scheduler developers must have modern
  Intel CPUs for testing with

hackbench-sockets
-----------------
  Not so great. Performance dropped for a while but is looking marginally
  better now.

vmr-aim9
--------
  Variation of the same story. In general this is looking worse but was
  not as consistently bad as the other two machines. Performance in 3.4
  is a mixed bag.

vmr-stream
----------
  Generally looks ok.

pft
---
  Generally looking good, tests are completing faster. There were regressions
  in old kernels but it has been looking better recently.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Network performance
  2012-06-29 11:19   ` Mel Gorman
@ 2012-06-29 11:22     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, netdev

Configuration:	global-dhp__network-performance
Benchmarks:	netperf-udp, netperf-tcp, tbench4

Summary
=======
Some tests look good but netperf-tcp tests show a number of problems.

Benchmark notes
===============

netperf used the TCP_STREAM or UDP_STREAM tests. Server and client were bound
to CPU 0 and 1 respectively. To improve the chances of getting an accurate
reading "-i 50,6 -I 99,1" was specified on the command line.  Personally I
tend to find netperf figures a bit unreliable and can vary depending on the
exact starting conditions. This might be due to the test being run against
localhost or because there is no other machine activity to smooth outliers
related to cache coloring. Suggestions on how to mitigate this are welcome.

tbench was from dbench 4 and ran for 3 minutes.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__network-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Ok, but netperf-tcp has problems
===========================================================

netperf-udp
-----------
For the most part, this looks good. 2.6.34 and 3.2.9 were both bad
kernels for some reason but currently it looks fine. I tend to
find that netperf figures fluctuate easily and t

netperf-tcp
-----------
This is less healthy, it looks like there is a fairly consistent
regression of 2-5%.

tbench4
-------
Some of these tests failed to run and the logs are unclear as to
why but only happened on this machine. It's only now that I noticed.
While results are looking ok now, there were some regressions for
3.0 until 3.2 kernels that might be of concern to -stable users.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok, but netperf-tcp has problems
==========================================================

netperf-udp
-----------
This is looking great. There was a high in 3.1 that has been
lost since but it's still better overall in comparison to
2.6.32.

netperf-tcp
-----------
This is less healthy with a lot of regression. 3.4 has mostly
regressed to the tune of 2-13% versus 2.6.32.

tbench4
-------
For the most part, this is looking ok. 2 clients seems to be
particularly problematic for some reason but otherwise looks
good.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Bad, tbench is ok just otherwise poor
==========================================================

netperf-udp
-----------
This is not a happy story. There was a big drop between 3.2 and 3.3
and the regression is still there in comparison to 2.6.32

netperf-tcp
-----------
This has consistently regressed since 2.6.34 with the regression very
roughly around the 10% mark.

tbench4
-------
Unlike the other tests, this is looking reasonably good with performance
gains until the number of clients gets really high. It was interesting
to note that 2.6.34 was a particularly good kernel for tbench and
while current kernels are better then 2.6.32, they are not as good as
2.6.34.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Network performance
@ 2012-06-29 11:22     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, netdev

Configuration:	global-dhp__network-performance
Benchmarks:	netperf-udp, netperf-tcp, tbench4

Summary
=======
Some tests look good but netperf-tcp tests show a number of problems.

Benchmark notes
===============

netperf used the TCP_STREAM or UDP_STREAM tests. Server and client were bound
to CPU 0 and 1 respectively. To improve the chances of getting an accurate
reading "-i 50,6 -I 99,1" was specified on the command line.  Personally I
tend to find netperf figures a bit unreliable and can vary depending on the
exact starting conditions. This might be due to the test being run against
localhost or because there is no other machine activity to smooth outliers
related to cache coloring. Suggestions on how to mitigate this are welcome.

tbench was from dbench 4 and ran for 3 minutes.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__network-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Ok, but netperf-tcp has problems
===========================================================

netperf-udp
-----------
For the most part, this looks good. 2.6.34 and 3.2.9 were both bad
kernels for some reason but currently it looks fine. I tend to
find that netperf figures fluctuate easily and t

netperf-tcp
-----------
This is less healthy, it looks like there is a fairly consistent
regression of 2-5%.

tbench4
-------
Some of these tests failed to run and the logs are unclear as to
why but only happened on this machine. It's only now that I noticed.
While results are looking ok now, there were some regressions for
3.0 until 3.2 kernels that might be of concern to -stable users.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok, but netperf-tcp has problems
==========================================================

netperf-udp
-----------
This is looking great. There was a high in 3.1 that has been
lost since but it's still better overall in comparison to
2.6.32.

netperf-tcp
-----------
This is less healthy with a lot of regression. 3.4 has mostly
regressed to the tune of 2-13% versus 2.6.32.

tbench4
-------
For the most part, this is looking ok. 2 clients seems to be
particularly problematic for some reason but otherwise looks
good.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Bad, tbench is ok just otherwise poor
==========================================================

netperf-udp
-----------
This is not a happy story. There was a big drop between 3.2 and 3.3
and the regression is still there in comparison to 2.6.32

netperf-tcp
-----------
This has consistently regressed since 2.6.34 with the regression very
roughly around the 10% mark.

tbench4
-------
Unlike the other tests, this is looking reasonably good with performance
gains until the number of clients gets really high. It was interesting
to note that 2.6.34 was a particularly good kernel for tbench and
while current kernels are better then 2.6.32, they are not as good as
2.6.34.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-06-29 11:23     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:23 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-metadata-ext3
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======

  While the fsmark figures look ok, fsmark in single threaded mode has
  a small number of large outliers towards the min end of the scale. The
  resulting standard deviation fuzzes results but filtering is not necessarily
  the best answer. A similar effect is visible when running in threaded
  mode except that there is clustering around some values that could be
  presented better. Arithmetic mean is unsuitable for this sort of data.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.


===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Fine
===========================================================

dbench
------
  For single clients, we're doing reasonably well. There was a big spike
  for large number of clients in 2.6.34 and to a lesser extent in 2.6.39.4
  but much of this is due to the operations taking place in memory without
  reaching disk. There were also fairness issues and the indicated throughput
  figures are far higher than the disks capabilities so I do not consider
  this to be a regression.

  There was a mild dip for 3.2.x and 3.3.x that has been recovered somewhat
  in 3.4. As this was when IO-Less Dirty Throttling got merged it is hardly
  a surprise and a dip in dbench is not worth backing that out for.

  Recent kernels appear to deal worse for large number of clients. However,
  I very strongly suspect this is due to improved fairness in IO. The
  high throughput figures are due to one client making an unfair amount
  of progress while other clients stall.

fsmark-single
-------------
  Again, this is looking good. Files/sec has improved slightly with the
  exception of a dip in 3.2 and 3.3 which again may be due to IO-Less
  dirty throttling.

  I have a slight concern with the overhead measurements. Somewhere
  between 3.0.23 and 3.1.10 the overhead started deviating a lot more.
  Ideally this should be bisected because it difficult to blame
  IO-Less throttling with any certainity.
  IO-Less Throttling.

fsmark-threaded
---------------
  Looks better but due to high deviations it's hard to be 100% sure.
  If this is of interest then the thing to do is do a proper measurement
  of whether the results are significant or not although with 15 samples
  it still will be fuzzy.

  Right now, there is little to be concerned about.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
 Very similar story to the arnold machine. Big spike in 2.6.34 and then
 drops off.

fsmark-single
-------------
  Nothing really notable here. Deviations are too high to draw reasonable
  conclusions from. Looking at the raw results it's due to a small number
  of low outliers. These could be filtered but it would mask the fact that
  throughput is not consistent so strong justification would be required.

fsmark-threaded
---------------
  Similar to fsmark-single. Figures look ok but large deviations are
  a problem Unlike the single-threaded case the raw data shows that we
  cluster around two points that are very far apart from each other. It
  is worth investigating if this can be presented in some sensible
  manner such as k-means clustering because arithmetic mean with this
  sort of data is crap.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
  Same as the other two complete with spikes.

fsmark-single
-------------
  Other than overhead going crazy in 3.2 there is nothing notable either.
  As with hydra, there are a small number of outliers that result in large
  deviations

fsmark-threaded
---------------
  Similar to hydra. Figures look basically ok but deviations are high with
  some clustering going on.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on ext3
@ 2012-06-29 11:23     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:23 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-metadata-ext3
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======

  While the fsmark figures look ok, fsmark in single threaded mode has
  a small number of large outliers towards the min end of the scale. The
  resulting standard deviation fuzzes results but filtering is not necessarily
  the best answer. A similar effect is visible when running in threaded
  mode except that there is clustering around some values that could be
  presented better. Arithmetic mean is unsuitable for this sort of data.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.


===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Fine
===========================================================

dbench
------
  For single clients, we're doing reasonably well. There was a big spike
  for large number of clients in 2.6.34 and to a lesser extent in 2.6.39.4
  but much of this is due to the operations taking place in memory without
  reaching disk. There were also fairness issues and the indicated throughput
  figures are far higher than the disks capabilities so I do not consider
  this to be a regression.

  There was a mild dip for 3.2.x and 3.3.x that has been recovered somewhat
  in 3.4. As this was when IO-Less Dirty Throttling got merged it is hardly
  a surprise and a dip in dbench is not worth backing that out for.

  Recent kernels appear to deal worse for large number of clients. However,
  I very strongly suspect this is due to improved fairness in IO. The
  high throughput figures are due to one client making an unfair amount
  of progress while other clients stall.

fsmark-single
-------------
  Again, this is looking good. Files/sec has improved slightly with the
  exception of a dip in 3.2 and 3.3 which again may be due to IO-Less
  dirty throttling.

  I have a slight concern with the overhead measurements. Somewhere
  between 3.0.23 and 3.1.10 the overhead started deviating a lot more.
  Ideally this should be bisected because it difficult to blame
  IO-Less throttling with any certainity.
  IO-Less Throttling.

fsmark-threaded
---------------
  Looks better but due to high deviations it's hard to be 100% sure.
  If this is of interest then the thing to do is do a proper measurement
  of whether the results are significant or not although with 15 samples
  it still will be fuzzy.

  Right now, there is little to be concerned about.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
 Very similar story to the arnold machine. Big spike in 2.6.34 and then
 drops off.

fsmark-single
-------------
  Nothing really notable here. Deviations are too high to draw reasonable
  conclusions from. Looking at the raw results it's due to a small number
  of low outliers. These could be filtered but it would mask the fact that
  throughput is not consistent so strong justification would be required.

fsmark-threaded
---------------
  Similar to fsmark-single. Figures look ok but large deviations are
  a problem Unlike the single-threaded case the raw data shows that we
  cluster around two points that are very far apart from each other. It
  is worth investigating if this can be presented in some sensible
  manner such as k-means clustering because arithmetic mean with this
  sort of data is crap.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
  Same as the other two complete with spikes.

fsmark-single
-------------
  Other than overhead going crazy in 3.2 there is nothing notable either.
  As with hydra, there are a small number of outliers that result in large
  deviations

fsmark-threaded
---------------
  Similar to hydra. Figures look basically ok but deviations are high with
  some clustering going on.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on ext4
  2012-06-29 11:19   ` Mel Gorman
@ 2012-06-29 11:24     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:24 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-metadata-ext4
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======
  For the most part the figures look ok currently. However a number of
  tests show that we have declined since 3.0 in a number of areas. Some
  machines show that there were performance drops in the 3.2 and 3.3
  kernels that have not being fully recovered.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Fine but fsmark has declined since 3.0
===========================================================

dbench
------
  For single clients, we're doing reasonably well and this has been consistent
  with each release.

fsmark-single
-------------
  This is not as happy a story. Variations are quite high but 3.0 was a
  reasonably good kernel and we've been declining ever since with 3.4
  being marginally worse than 2.6.32.

fsmark-threaded
---------------
  The trends are very similar to fsmark-single. 3.0 was reasonably good
  but we have degraded since and are at approximately 2.6.32 levels.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Fine but fsmark has declined since 3.0
==========================================================

dbench
------
  Unlike arnold, this is looking good with solid gains in most kernels
  for the single-threaded case. The exception was 3.2.9 which saw a
  a big dip that was recovered in 3.3. For higher number of clients the
  figures still look good. It's not clear why there is such a difference
  between arnold and hydra for the single-threaded case.

fsmark-single
-------------
  This is very similar to arnold in that 3.0 performed best and we have
  declined since back to more or less the same level as 2.6.32.

fsmark-threaded
---------------
  Performance here is flat in terms of throughput. 3.4 recorded much higher
  overhead but it is not clear if this is a cause for concern.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine but there have been recent declines
==========================================================

dbench
------
  Like hydra, this is looking good with solid gains in most kernels for the
  single-threaded case. The same dip in 3.2.9 is visible but unlikely hydra
  it was not recovered until 3.4. Higher number of clients generally look
  good as well although it is interesting to see that the dip in 3.2.9 is
  not consistently visible.

fsmark-single
-------------
  Overhead went crazy in 3.3 and there is a large drop in files/sec in
  3.3 as well. 

fsmark-threaded
---------------
  The trends are similar to the single-threaded case. Looking reasonably
  good but a dip in 3.3 that has not being recovered and overhead is
  higher.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on ext4
@ 2012-06-29 11:24     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:24 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-metadata-ext4
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======
  For the most part the figures look ok currently. However a number of
  tests show that we have declined since 3.0 in a number of areas. Some
  machines show that there were performance drops in the 3.2 and 3.3
  kernels that have not being fully recovered.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Fine but fsmark has declined since 3.0
===========================================================

dbench
------
  For single clients, we're doing reasonably well and this has been consistent
  with each release.

fsmark-single
-------------
  This is not as happy a story. Variations are quite high but 3.0 was a
  reasonably good kernel and we've been declining ever since with 3.4
  being marginally worse than 2.6.32.

fsmark-threaded
---------------
  The trends are very similar to fsmark-single. 3.0 was reasonably good
  but we have degraded since and are at approximately 2.6.32 levels.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Fine but fsmark has declined since 3.0
==========================================================

dbench
------
  Unlike arnold, this is looking good with solid gains in most kernels
  for the single-threaded case. The exception was 3.2.9 which saw a
  a big dip that was recovered in 3.3. For higher number of clients the
  figures still look good. It's not clear why there is such a difference
  between arnold and hydra for the single-threaded case.

fsmark-single
-------------
  This is very similar to arnold in that 3.0 performed best and we have
  declined since back to more or less the same level as 2.6.32.

fsmark-threaded
---------------
  Performance here is flat in terms of throughput. 3.4 recorded much higher
  overhead but it is not clear if this is a cause for concern.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine but there have been recent declines
==========================================================

dbench
------
  Like hydra, this is looking good with solid gains in most kernels for the
  single-threaded case. The same dip in 3.2.9 is visible but unlikely hydra
  it was not recovered until 3.4. Higher number of clients generally look
  good as well although it is interesting to see that the dip in 3.2.9 is
  not consistently visible.

fsmark-single
-------------
  Overhead went crazy in 3.3 and there is a large drop in files/sec in
  3.3 as well. 

fsmark-threaded
---------------
  The trends are similar to the single-threaded case. Looking reasonably
  good but a dip in 3.3 that has not being recovered and overhead is
  higher.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on XFS
  2012-06-29 11:19   ` Mel Gorman
  (?)
@ 2012-06-29 11:25     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:25 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs

Configuration:	global-dhp__io-metadata-xfs
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======
Most of the figures look good and in general there has been consistent good
performance from XFS. However, fsmark-single is showing a severe performance
dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
thread took a particularly bad dive in 3.4 for two machines that is worth
examining closer. Unfortunately it is harder to easy conclusions as the
gains/losses are not consistent between machines which may be related to
the available number of CPU threads.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
	On kernels to old to support delaylog was removed. On kernels
	where it was the default, it was specified and the warning ignored.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.


===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Great
===========================================================

dbench
------
  XFS is showing steady improvements with a large gain for single client
  in 2.6.39 and more or less retained since then. This is also true for
  higher number of clients although 64 clients was suspiciously poor even
  though 128 clients looked better. I didn't re-examine the raw data to
  see why.

  In general, dbench is looking very good.

fsmark-single
-------------
  Again, this is looking good. Files/sec has improved slightly with the
  exception of a small dip in 3.2 and 3.3 which may be due to IO-Less
  dirty throttling.

  Overhead measurements are a bit all over the place. Not clear if
  this is cause for concern or not.

fsmark-threaded
---------------
  Improved since 2.6.32 and has been steadily good for some time. Overhead
  measurements are all over the place. Again, not clear if this is a cause
  for concern.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench
------
  The results here look very different to the arnold machine. This is curious
  because the disks have similar size and performance characteristics. It is
  doubtful that the difference is between 32 bit and 64 bit architectures.
  The discrepency may be more due to the different number of CPUs and how
  XFS does locking. One possibility is that fewer CPUs has the side-effect
  of better batching of some operations but this is a case.

  Figures areis showing that throughput is worse and highly variable in
  3.4 for single clients. For higher number of clients figures look better
  overall. There was a dip in 3.1-based kernels though for an unknown
  reason. This does not exactly correlate with the ext3 figures although
  it showed a dip in performance at 3.2.

fsmark-single
-------------
  While performance is better than 2.6.32, there was a dip in performance
  in 3.3 and a very large dip in 3.4. 

fsmark-threaded
---------------
  The same dip in 3.4 is visibile when multiple threads are used but it is
  not as severe.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
 Like seen on other filesystems, this data shows that there was a large dip
 in performance around 3.2 for single threads. Unlike the hydra machine,
 this was recovered in 3.4. As higher number of threads are used the gains
 and losses are inconsistent making it hard to draw a solid conclusion.

fsmark-single
-------------
  This was doing great until 3.4 where there is a large drop.

fsmark-threaded
---------------
  Unlike the single threaded case, things are looking great here.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on XFS
@ 2012-06-29 11:25     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:25 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs

Configuration:	global-dhp__io-metadata-xfs
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======
Most of the figures look good and in general there has been consistent good
performance from XFS. However, fsmark-single is showing a severe performance
dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
thread took a particularly bad dive in 3.4 for two machines that is worth
examining closer. Unfortunately it is harder to easy conclusions as the
gains/losses are not consistent between machines which may be related to
the available number of CPU threads.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
	On kernels to old to support delaylog was removed. On kernels
	where it was the default, it was specified and the warning ignored.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.


===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Great
===========================================================

dbench
------
  XFS is showing steady improvements with a large gain for single client
  in 2.6.39 and more or less retained since then. This is also true for
  higher number of clients although 64 clients was suspiciously poor even
  though 128 clients looked better. I didn't re-examine the raw data to
  see why.

  In general, dbench is looking very good.

fsmark-single
-------------
  Again, this is looking good. Files/sec has improved slightly with the
  exception of a small dip in 3.2 and 3.3 which may be due to IO-Less
  dirty throttling.

  Overhead measurements are a bit all over the place. Not clear if
  this is cause for concern or not.

fsmark-threaded
---------------
  Improved since 2.6.32 and has been steadily good for some time. Overhead
  measurements are all over the place. Again, not clear if this is a cause
  for concern.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench
------
  The results here look very different to the arnold machine. This is curious
  because the disks have similar size and performance characteristics. It is
  doubtful that the difference is between 32 bit and 64 bit architectures.
  The discrepency may be more due to the different number of CPUs and how
  XFS does locking. One possibility is that fewer CPUs has the side-effect
  of better batching of some operations but this is a case.

  Figures areis showing that throughput is worse and highly variable in
  3.4 for single clients. For higher number of clients figures look better
  overall. There was a dip in 3.1-based kernels though for an unknown
  reason. This does not exactly correlate with the ext3 figures although
  it showed a dip in performance at 3.2.

fsmark-single
-------------
  While performance is better than 2.6.32, there was a dip in performance
  in 3.3 and a very large dip in 3.4. 

fsmark-threaded
---------------
  The same dip in 3.4 is visibile when multiple threads are used but it is
  not as severe.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
 Like seen on other filesystems, this data shows that there was a large dip
 in performance around 3.2 for single threads. Unlike the hydra machine,
 this was recovered in 3.4. As higher number of threads are used the gains
 and losses are inconsistent making it hard to draw a solid conclusion.

fsmark-single
-------------
  This was doing great until 3.4 where there is a large drop.

fsmark-threaded
---------------
  Unlike the single threaded case, things are looking great here.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] IO metadata on XFS
@ 2012-06-29 11:25     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-06-29 11:25 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-kernel, xfs

Configuration:	global-dhp__io-metadata-xfs
Benchmarks:	dbench3, fsmark-single, fsmark-threaded

Summary
=======
Most of the figures look good and in general there has been consistent good
performance from XFS. However, fsmark-single is showing a severe performance
dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
thread took a particularly bad dive in 3.4 for two machines that is worth
examining closer. Unfortunately it is harder to easy conclusions as the
gains/losses are not consistent between machines which may be related to
the available number of CPU threads.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
	On kernels to old to support delaylog was removed. On kernels
	where it was the default, it was specified and the warning ignored.

dbench3 was chosen as it's metadata intensive.
  o Duration was 180 seconds
  o OSYNC, OSYNC_DIRECTORY and FSYNC were all off

  As noted in the MMTests, dbench3 can be a random number generator
  particularly when run in asynchronous mode. Even with the limitations,
  it can be useful as an early warning system and as it's still used by
  QA teams it's still worth keeping an eye on.

FSMark
  o Parallel directories were used
  o 1 Thread per CPU
  o 0 Filesize
  o 225 directories
  o 22500 files per directory
  o 50000 files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-9227/1  -D  225  -N  22500  -n  50000  -L  15  -S0  -s  0
  Thread: ./fs_mark  -d  /tmp/fsmark-9407/1  -d  /tmp/fsmark-9407/2  -D  225  -N  22500  -n  25000  -L  15  -S0  -s  0
 
  FSMark is a more realistic indicator of metadata intensive workloads.


===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Great
===========================================================

dbench
------
  XFS is showing steady improvements with a large gain for single client
  in 2.6.39 and more or less retained since then. This is also true for
  higher number of clients although 64 clients was suspiciously poor even
  though 128 clients looked better. I didn't re-examine the raw data to
  see why.

  In general, dbench is looking very good.

fsmark-single
-------------
  Again, this is looking good. Files/sec has improved slightly with the
  exception of a small dip in 3.2 and 3.3 which may be due to IO-Less
  dirty throttling.

  Overhead measurements are a bit all over the place. Not clear if
  this is cause for concern or not.

fsmark-threaded
---------------
  Improved since 2.6.32 and has been steadily good for some time. Overhead
  measurements are all over the place. Again, not clear if this is a cause
  for concern.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench
------
  The results here look very different to the arnold machine. This is curious
  because the disks have similar size and performance characteristics. It is
  doubtful that the difference is between 32 bit and 64 bit architectures.
  The discrepency may be more due to the different number of CPUs and how
  XFS does locking. One possibility is that fewer CPUs has the side-effect
  of better batching of some operations but this is a case.

  Figures areis showing that throughput is worse and highly variable in
  3.4 for single clients. For higher number of clients figures look better
  overall. There was a dip in 3.1-based kernels though for an unknown
  reason. This does not exactly correlate with the ext3 figures although
  it showed a dip in performance at 3.2.

fsmark-single
-------------
  While performance is better than 2.6.32, there was a dip in performance
  in 3.3 and a very large dip in 3.4. 

fsmark-threaded
---------------
  The same dip in 3.4 is visibile when multiple threads are used but it is
  not as severe.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Fine
==========================================================

dbench
------
 Like seen on other filesystems, this data shows that there was a large dip
 in performance around 3.2 for single threads. Unlike the hydra machine,
 this was recovered in 3.4. As higher number of threads are used the gains
 and losses are inconsistent making it hard to draw a solid conclusion.

fsmark-single
-------------
  This was doing great until 3.4 where there is a large drop.

fsmark-threaded
---------------
  Unlike the single threaded case, things are looking great here.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-06-29 11:25     ` Mel Gorman
  (?)
@ 2012-07-01 23:54       ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-01 23:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs

On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote:
> Configuration:	global-dhp__io-metadata-xfs
> Benchmarks:	dbench3, fsmark-single, fsmark-threaded
> 
> Summary
> =======
> Most of the figures look good and in general there has been consistent good
> performance from XFS. However, fsmark-single is showing a severe performance
> dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
> thread took a particularly bad dive in 3.4 for two machines that is worth
> examining closer.

That will be caused by the fact we changed all the metadata updates
to be logged, which means a transaction every time .dirty_inode is
called.

This should mostly go away when XFS is converted to use .update_time
rather than .dirty_inode to only issue transactions when the VFS
updates the atime rather than every .dirty_inode call...

> Unfortunately it is harder to easy conclusions as the
> gains/losses are not consistent between machines which may be related to
> the available number of CPU threads.

It increases the CPU overhead (dirty_inode can be called up to 4
times per write(2) call, IIRC), so with limited numbers of
threads/limited CPU power it will result in lower performance. Where
you have lots of CPU power, there will be little difference in
performance...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-01 23:54       ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-01 23:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs

On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote:
> Configuration:	global-dhp__io-metadata-xfs
> Benchmarks:	dbench3, fsmark-single, fsmark-threaded
> 
> Summary
> =======
> Most of the figures look good and in general there has been consistent good
> performance from XFS. However, fsmark-single is showing a severe performance
> dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
> thread took a particularly bad dive in 3.4 for two machines that is worth
> examining closer.

That will be caused by the fact we changed all the metadata updates
to be logged, which means a transaction every time .dirty_inode is
called.

This should mostly go away when XFS is converted to use .update_time
rather than .dirty_inode to only issue transactions when the VFS
updates the atime rather than every .dirty_inode call...

> Unfortunately it is harder to easy conclusions as the
> gains/losses are not consistent between machines which may be related to
> the available number of CPU threads.

It increases the CPU overhead (dirty_inode can be called up to 4
times per write(2) call, IIRC), so with limited numbers of
threads/limited CPU power it will result in lower performance. Where
you have lots of CPU power, there will be little difference in
performance...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-01 23:54       ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-01 23:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-fsdevel, linux-mm, linux-kernel, xfs

On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote:
> Configuration:	global-dhp__io-metadata-xfs
> Benchmarks:	dbench3, fsmark-single, fsmark-threaded
> 
> Summary
> =======
> Most of the figures look good and in general there has been consistent good
> performance from XFS. However, fsmark-single is showing a severe performance
> dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
> thread took a particularly bad dive in 3.4 for two machines that is worth
> examining closer.

That will be caused by the fact we changed all the metadata updates
to be logged, which means a transaction every time .dirty_inode is
called.

This should mostly go away when XFS is converted to use .update_time
rather than .dirty_inode to only issue transactions when the VFS
updates the atime rather than every .dirty_inode call...

> Unfortunately it is harder to easy conclusions as the
> gains/losses are not consistent between machines which may be related to
> the available number of CPU threads.

It increases the CPU overhead (dirty_inode can be called up to 4
times per write(2) call, IIRC), so with limited numbers of
threads/limited CPU power it will result in lower performance. Where
you have lots of CPU power, there will be little difference in
performance...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-01 23:54       ` Dave Chinner
  (?)
@ 2012-07-02  6:32         ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2012-07-02  6:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mel Gorman, linux-mm, linux-kernel, linux-fsdevel, xfs

On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote:
> That will be caused by the fact we changed all the metadata updates
> to be logged, which means a transaction every time .dirty_inode is
> called.
> 
> This should mostly go away when XFS is converted to use .update_time
> rather than .dirty_inode to only issue transactions when the VFS
> updates the atime rather than every .dirty_inode call...

I think the patch to do that conversion still needs review..

> It increases the CPU overhead (dirty_inode can be called up to 4
> times per write(2) call, IIRC), so with limited numbers of
> threads/limited CPU power it will result in lower performance. Where
> you have lots of CPU power, there will be little difference in
> performance...

When I checked it it could only be called twice, and we'd already
optimize away the second call.  I'd defintively like to track down where
the performance changes happend, at least to a major version but even
better to a -rc or git commit.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02  6:32         ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2012-07-02  6:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mel Gorman, linux-mm, linux-kernel, linux-fsdevel, xfs

On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote:
> That will be caused by the fact we changed all the metadata updates
> to be logged, which means a transaction every time .dirty_inode is
> called.
> 
> This should mostly go away when XFS is converted to use .update_time
> rather than .dirty_inode to only issue transactions when the VFS
> updates the atime rather than every .dirty_inode call...

I think the patch to do that conversion still needs review..

> It increases the CPU overhead (dirty_inode can be called up to 4
> times per write(2) call, IIRC), so with limited numbers of
> threads/limited CPU power it will result in lower performance. Where
> you have lots of CPU power, there will be little difference in
> performance...

When I checked it it could only be called twice, and we'd already
optimize away the second call.  I'd defintively like to track down where
the performance changes happend, at least to a major version but even
better to a -rc or git commit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02  6:32         ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2012-07-02  6:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs, Mel Gorman, linux-kernel

On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote:
> That will be caused by the fact we changed all the metadata updates
> to be logged, which means a transaction every time .dirty_inode is
> called.
> 
> This should mostly go away when XFS is converted to use .update_time
> rather than .dirty_inode to only issue transactions when the VFS
> updates the atime rather than every .dirty_inode call...

I think the patch to do that conversion still needs review..

> It increases the CPU overhead (dirty_inode can be called up to 4
> times per write(2) call, IIRC), so with limited numbers of
> threads/limited CPU power it will result in lower performance. Where
> you have lots of CPU power, there will be little difference in
> performance...

When I checked it it could only be called twice, and we'd already
optimize away the second call.  I'd defintively like to track down where
the performance changes happend, at least to a major version but even
better to a -rc or git commit.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-01 23:54       ` Dave Chinner
  (?)
@ 2012-07-02 13:30         ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs

On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote:
> On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote:
> > Configuration:	global-dhp__io-metadata-xfs
> > Benchmarks:	dbench3, fsmark-single, fsmark-threaded
> > 
> > Summary
> > =======
> > Most of the figures look good and in general there has been consistent good
> > performance from XFS. However, fsmark-single is showing a severe performance
> > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
> > thread took a particularly bad dive in 3.4 for two machines that is worth
> > examining closer.
> 
> That will be caused by the fact we changed all the metadata updates
> to be logged, which means a transaction every time .dirty_inode is
> called.
> 

Ok.

> This should mostly go away when XFS is converted to use .update_time
> rather than .dirty_inode to only issue transactions when the VFS
> updates the atime rather than every .dirty_inode call...
> 

Sound. I'll keep an eye out for it in the future.  If you want to
use the same test configuration then be sure you set the partition
configuration correctly. For example, these are the values I used for
config-global-dhp__io-metadata configuration file.

export TESTDISK_PARTITION=/dev/sda6
export TESTDISK_FILESYSTEM=xfs
export TESTDISK_MKFS_PARAM="-f -d agcount=8"
export TESTDISK_MOUNT_ARGS=inode64,delaylog,logbsize=262144,nobarrier

> > Unfortunately it is harder to easy conclusions as the
> > gains/losses are not consistent between machines which may be related to
> > the available number of CPU threads.
> 
> It increases the CPU overhead (dirty_inode can be called up to 4
> times per write(2) call, IIRC), so with limited numbers of
> threads/limited CPU power it will result in lower performance. Where
> you have lots of CPU power, there will be little difference in
> performance...
> 

Thanks for that clarification.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02 13:30         ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs

On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote:
> On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote:
> > Configuration:	global-dhp__io-metadata-xfs
> > Benchmarks:	dbench3, fsmark-single, fsmark-threaded
> > 
> > Summary
> > =======
> > Most of the figures look good and in general there has been consistent good
> > performance from XFS. However, fsmark-single is showing a severe performance
> > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
> > thread took a particularly bad dive in 3.4 for two machines that is worth
> > examining closer.
> 
> That will be caused by the fact we changed all the metadata updates
> to be logged, which means a transaction every time .dirty_inode is
> called.
> 

Ok.

> This should mostly go away when XFS is converted to use .update_time
> rather than .dirty_inode to only issue transactions when the VFS
> updates the atime rather than every .dirty_inode call...
> 

Sound. I'll keep an eye out for it in the future.  If you want to
use the same test configuration then be sure you set the partition
configuration correctly. For example, these are the values I used for
config-global-dhp__io-metadata configuration file.

export TESTDISK_PARTITION=/dev/sda6
export TESTDISK_FILESYSTEM=xfs
export TESTDISK_MKFS_PARAM="-f -d agcount=8"
export TESTDISK_MOUNT_ARGS=inode64,delaylog,logbsize=262144,nobarrier

> > Unfortunately it is harder to easy conclusions as the
> > gains/losses are not consistent between machines which may be related to
> > the available number of CPU threads.
> 
> It increases the CPU overhead (dirty_inode can be called up to 4
> times per write(2) call, IIRC), so with limited numbers of
> threads/limited CPU power it will result in lower performance. Where
> you have lots of CPU power, there will be little difference in
> performance...
> 

Thanks for that clarification.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02 13:30         ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel, xfs

On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote:
> On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote:
> > Configuration:	global-dhp__io-metadata-xfs
> > Benchmarks:	dbench3, fsmark-single, fsmark-threaded
> > 
> > Summary
> > =======
> > Most of the figures look good and in general there has been consistent good
> > performance from XFS. However, fsmark-single is showing a severe performance
> > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single
> > thread took a particularly bad dive in 3.4 for two machines that is worth
> > examining closer.
> 
> That will be caused by the fact we changed all the metadata updates
> to be logged, which means a transaction every time .dirty_inode is
> called.
> 

Ok.

> This should mostly go away when XFS is converted to use .update_time
> rather than .dirty_inode to only issue transactions when the VFS
> updates the atime rather than every .dirty_inode call...
> 

Sound. I'll keep an eye out for it in the future.  If you want to
use the same test configuration then be sure you set the partition
configuration correctly. For example, these are the values I used for
config-global-dhp__io-metadata configuration file.

export TESTDISK_PARTITION=/dev/sda6
export TESTDISK_FILESYSTEM=xfs
export TESTDISK_MKFS_PARAM="-f -d agcount=8"
export TESTDISK_MOUNT_ARGS=inode64,delaylog,logbsize=262144,nobarrier

> > Unfortunately it is harder to easy conclusions as the
> > gains/losses are not consistent between machines which may be related to
> > the available number of CPU threads.
> 
> It increases the CPU overhead (dirty_inode can be called up to 4
> times per write(2) call, IIRC), so with limited numbers of
> threads/limited CPU power it will result in lower performance. Where
> you have lots of CPU power, there will be little difference in
> performance...
> 

Thanks for that clarification.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-02  6:32         ` Christoph Hellwig
  (?)
@ 2012-07-02 14:32           ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 14:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs

On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > It increases the CPU overhead (dirty_inode can be called up to 4
> > times per write(2) call, IIRC), so with limited numbers of
> > threads/limited CPU power it will result in lower performance. Where
> > you have lots of CPU power, there will be little difference in
> > performance...
> 
> When I checked it it could only be called twice, and we'd already
> optimize away the second call.  I'd defintively like to track down where
> the performance changes happend, at least to a major version but even
> better to a -rc or git commit.
> 

By all means feel free to run the test yourself and run the bisection :)

It's rare but on this occasion the test machine is idle so I started an
automated git bisection. As you know the milage with an automated bisect
varies so it may or may not find the right commit. Test machine is sandy so
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
is the report of interest. The script is doing a full search between v3.3 and
v3.4 for a point where average files/sec for fsmark-single drops below 25000.
I did not limit the search to fs/xfs on the off-chance that it is an
apparently unrelated patch that caused the problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02 14:32           ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 14:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs

On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > It increases the CPU overhead (dirty_inode can be called up to 4
> > times per write(2) call, IIRC), so with limited numbers of
> > threads/limited CPU power it will result in lower performance. Where
> > you have lots of CPU power, there will be little difference in
> > performance...
> 
> When I checked it it could only be called twice, and we'd already
> optimize away the second call.  I'd defintively like to track down where
> the performance changes happend, at least to a major version but even
> better to a -rc or git commit.
> 

By all means feel free to run the test yourself and run the bisection :)

It's rare but on this occasion the test machine is idle so I started an
automated git bisection. As you know the milage with an automated bisect
varies so it may or may not find the right commit. Test machine is sandy so
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
is the report of interest. The script is doing a full search between v3.3 and
v3.4 for a point where average files/sec for fsmark-single drops below 25000.
I did not limit the search to fs/xfs on the off-chance that it is an
apparently unrelated patch that caused the problem.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02 14:32           ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 14:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel, xfs

On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > It increases the CPU overhead (dirty_inode can be called up to 4
> > times per write(2) call, IIRC), so with limited numbers of
> > threads/limited CPU power it will result in lower performance. Where
> > you have lots of CPU power, there will be little difference in
> > performance...
> 
> When I checked it it could only be called twice, and we'd already
> optimize away the second call.  I'd defintively like to track down where
> the performance changes happend, at least to a major version but even
> better to a -rc or git commit.
> 

By all means feel free to run the test yourself and run the bisection :)

It's rare but on this occasion the test machine is idle so I started an
automated git bisection. As you know the milage with an automated bisect
varies so it may or may not find the right commit. Test machine is sandy so
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
is the report of interest. The script is doing a full search between v3.3 and
v3.4 for a point where average files/sec for fsmark-single drops below 25000.
I did not limit the search to fs/xfs on the off-chance that it is an
apparently unrelated patch that caused the problem.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-02 14:32           ` Mel Gorman
  (?)
@ 2012-07-02 19:35             ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 19:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

Adding dri-devel and a few others because an i915 patch contributed to
the regression.

On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > times per write(2) call, IIRC), so with limited numbers of
> > > threads/limited CPU power it will result in lower performance. Where
> > > you have lots of CPU power, there will be little difference in
> > > performance...
> > 
> > When I checked it it could only be called twice, and we'd already
> > optimize away the second call.  I'd defintively like to track down where
> > the performance changes happend, at least to a major version but even
> > better to a -rc or git commit.
> > 
> 
> By all means feel free to run the test yourself and run the bisection :)
> 
> It's rare but on this occasion the test machine is idle so I started an
> automated git bisection. As you know the milage with an automated bisect
> varies so it may or may not find the right commit. Test machine is sandy so
> http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> is the report of interest. The script is doing a full search between v3.3 and
> v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> I did not limit the search to fs/xfs on the off-chance that it is an
> apparently unrelated patch that caused the problem.
> 

It was obvious very quickly that there were two distinct regression so I
ran two bisections. One led to a XFS and the other led to an i915 patch
that enables RC6 to reduce power usage.

[c999a223: xfs: introduce an allocation workqueue]
[aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

gdm was running on the machine so i915 would have been in use.  In case it
is of interest this is the log of the bisection. Lines beginning with #
are notes I made and all other lines are from the bisection script. The
second-last column is the files/sec recorded by fsmark.

# MARK v3.3..v3.4 Search for BAD files/sec -lt 28000
# BAD 16536
# GOOD 34757
Mon Jul 2 15:46:13 IST 2012 sandy xfsbisect 141124c02059eee9dbc5c86ea797b1ca888e77f7 37454 good
Mon Jul 2 15:56:06 IST 2012 sandy xfsbisect 55a320308902f7a0746569ee57eeb3f254e6ed16 25192 bad
Mon Jul 2 16:08:34 IST 2012 sandy xfsbisect 281b05392fc2cb26209b4d85abaf4889ab1991f3 38807 good
Mon Jul 2 16:18:02 IST 2012 sandy xfsbisect a8364d5555b2030d093cde0f07951628e55454e1 37553 good
Mon Jul 2 16:27:22 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 36676 good
Mon Jul 2 16:36:48 IST 2012 sandy xfsbisect 2e7580b0e75d771d93e24e681031a165b1d31071 37756 good
Mon Jul 2 16:46:36 IST 2012 sandy xfsbisect 532bfc851a7475fb6a36c1e953aa395798a7cca7 25416 bad
Mon Jul 2 16:56:10 IST 2012 sandy xfsbisect 0c9aac08261512d70d7d4817bd222abca8b6bdd6 38486 good
Mon Jul 2 17:05:40 IST 2012 sandy xfsbisect 0fc9d1040313047edf6a39fd4d7c7defdca97c62 37970 good
Mon Jul 2 17:16:01 IST 2012 sandy xfsbisect 5a5881cdeec2c019b5c9a307800218ee029f7f61 24493 bad
Mon Jul 2 17:21:15 IST 2012 sandy xfsbisect f616137519feb17b849894fcbe634a021d3fa7db 24405 bad
Mon Jul 2 17:26:16 IST 2012 sandy xfsbisect 5575acc7807595687288b3bbac15103f2a5462e1 37336 good
Mon Jul 2 17:31:25 IST 2012 sandy xfsbisect c999a223c2f0d31c64ef7379814cea1378b2b800 24552 bad
Mon Jul 2 17:36:34 IST 2012 sandy xfsbisect 1a1d772433d42aaff7315b3468fef5951604f5c6 36872 good
# c999a223c2f0d31c64ef7379814cea1378b2b800 is the first bad commit
# [c999a223: xfs: introduce an allocation workqueue]
#
# MARK c999a223c2f0d31c64ef7379814cea1378b2b800..v3.4 Search for BAD files/sec -lt 20000
# BAD  16536
# GOOD 24552
Mon Jul 2 17:48:39 IST 2012 sandy xfsbisect b2094ef840697bc8ca5d17a83b7e30fad5f1e9fa 37435 good
Mon Jul 2 17:58:12 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 38303 good
Mon Jul 2 18:08:18 IST 2012 sandy xfsbisect 5d32c88f0b94061b3af2e3ade92422407282eb12 16718 bad
Mon Jul 2 18:18:02 IST 2012 sandy xfsbisect 2f7fa1be66dce77608330c5eb918d6360b5525f2 24964 good
Mon Jul 2 18:24:14 IST 2012 sandy xfsbisect 923f79743c76583ed4684e2c80c8da51a7268af3 24963 good
Mon Jul 2 18:33:49 IST 2012 sandy xfsbisect b61c37f57988567c84359645f8202a7c84bc798a 24824 good
Mon Jul 2 18:40:20 IST 2012 sandy xfsbisect 20a2a811602b16c42ce88bada3d52712cdfb988b 17155 bad
Mon Jul 2 18:50:12 IST 2012 sandy xfsbisect 78fb72f7936c01d5b426c03a691eca082b03f2b9 38494 good
Mon Jul 2 19:00:24 IST 2012 sandy xfsbisect e1a7eb08ee097e97e928062a242b0de5b2599a11 25033 good
Mon Jul 2 19:10:24 IST 2012 sandy xfsbisect 97effadb65ed08809e1720c8d3ee80b73a93665c 16520 bad
Mon Jul 2 19:16:16 IST 2012 sandy xfsbisect 25e341cfc33d94435472983825163e97fe370a6c 16748 bad
Mon Jul 2 19:21:52 IST 2012 sandy xfsbisect 7dd4906586274f3945f2aeaaa5a33b451c3b4bba 24957 good
Mon Jul 2 19:27:35 IST 2012 sandy xfsbisect aa46419186992e6b8b8010319f0ca7f40a0d13f5 17088 bad
Mon Jul 2 19:32:54 IST 2012 sandy xfsbisect 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d 25667 good
# aa46419186992e6b8b8010319f0ca7f40a0d13f5 is the first bad commit
# [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

I tested plain reverts of the patches individually and together and got
the following results 

FS-Mark Single Threaded
                                        3.4.0                3.4.0                 3.4.0
                 3.4.0-vanilla          revert-aa464191      revert-c999a223       revert-both
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)    25108.00 (77.11%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)   38169.97 (127.43%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)     5344.99 (430.65%)   5599.65 (455.93%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)   47918.10 (159.36%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)   247396.00 (58.35%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)   287141.73 (54.98%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)   175001.08 (-141.58%)   102018.14 (-40.83%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)   637932.00 (25.44%)
MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14

Individually reverting either patch makes a difference to both files/sec
and overhead. Reverting both is not as dramatic as reverting each individual
patch would indicate but it's still a major improvement.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02 19:35             ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 19:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

Adding dri-devel and a few others because an i915 patch contributed to
the regression.

On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > times per write(2) call, IIRC), so with limited numbers of
> > > threads/limited CPU power it will result in lower performance. Where
> > > you have lots of CPU power, there will be little difference in
> > > performance...
> > 
> > When I checked it it could only be called twice, and we'd already
> > optimize away the second call.  I'd defintively like to track down where
> > the performance changes happend, at least to a major version but even
> > better to a -rc or git commit.
> > 
> 
> By all means feel free to run the test yourself and run the bisection :)
> 
> It's rare but on this occasion the test machine is idle so I started an
> automated git bisection. As you know the milage with an automated bisect
> varies so it may or may not find the right commit. Test machine is sandy so
> http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> is the report of interest. The script is doing a full search between v3.3 and
> v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> I did not limit the search to fs/xfs on the off-chance that it is an
> apparently unrelated patch that caused the problem.
> 

It was obvious very quickly that there were two distinct regression so I
ran two bisections. One led to a XFS and the other led to an i915 patch
that enables RC6 to reduce power usage.

[c999a223: xfs: introduce an allocation workqueue]
[aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

gdm was running on the machine so i915 would have been in use.  In case it
is of interest this is the log of the bisection. Lines beginning with #
are notes I made and all other lines are from the bisection script. The
second-last column is the files/sec recorded by fsmark.

# MARK v3.3..v3.4 Search for BAD files/sec -lt 28000
# BAD 16536
# GOOD 34757
Mon Jul 2 15:46:13 IST 2012 sandy xfsbisect 141124c02059eee9dbc5c86ea797b1ca888e77f7 37454 good
Mon Jul 2 15:56:06 IST 2012 sandy xfsbisect 55a320308902f7a0746569ee57eeb3f254e6ed16 25192 bad
Mon Jul 2 16:08:34 IST 2012 sandy xfsbisect 281b05392fc2cb26209b4d85abaf4889ab1991f3 38807 good
Mon Jul 2 16:18:02 IST 2012 sandy xfsbisect a8364d5555b2030d093cde0f07951628e55454e1 37553 good
Mon Jul 2 16:27:22 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 36676 good
Mon Jul 2 16:36:48 IST 2012 sandy xfsbisect 2e7580b0e75d771d93e24e681031a165b1d31071 37756 good
Mon Jul 2 16:46:36 IST 2012 sandy xfsbisect 532bfc851a7475fb6a36c1e953aa395798a7cca7 25416 bad
Mon Jul 2 16:56:10 IST 2012 sandy xfsbisect 0c9aac08261512d70d7d4817bd222abca8b6bdd6 38486 good
Mon Jul 2 17:05:40 IST 2012 sandy xfsbisect 0fc9d1040313047edf6a39fd4d7c7defdca97c62 37970 good
Mon Jul 2 17:16:01 IST 2012 sandy xfsbisect 5a5881cdeec2c019b5c9a307800218ee029f7f61 24493 bad
Mon Jul 2 17:21:15 IST 2012 sandy xfsbisect f616137519feb17b849894fcbe634a021d3fa7db 24405 bad
Mon Jul 2 17:26:16 IST 2012 sandy xfsbisect 5575acc7807595687288b3bbac15103f2a5462e1 37336 good
Mon Jul 2 17:31:25 IST 2012 sandy xfsbisect c999a223c2f0d31c64ef7379814cea1378b2b800 24552 bad
Mon Jul 2 17:36:34 IST 2012 sandy xfsbisect 1a1d772433d42aaff7315b3468fef5951604f5c6 36872 good
# c999a223c2f0d31c64ef7379814cea1378b2b800 is the first bad commit
# [c999a223: xfs: introduce an allocation workqueue]
#
# MARK c999a223c2f0d31c64ef7379814cea1378b2b800..v3.4 Search for BAD files/sec -lt 20000
# BAD  16536
# GOOD 24552
Mon Jul 2 17:48:39 IST 2012 sandy xfsbisect b2094ef840697bc8ca5d17a83b7e30fad5f1e9fa 37435 good
Mon Jul 2 17:58:12 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 38303 good
Mon Jul 2 18:08:18 IST 2012 sandy xfsbisect 5d32c88f0b94061b3af2e3ade92422407282eb12 16718 bad
Mon Jul 2 18:18:02 IST 2012 sandy xfsbisect 2f7fa1be66dce77608330c5eb918d6360b5525f2 24964 good
Mon Jul 2 18:24:14 IST 2012 sandy xfsbisect 923f79743c76583ed4684e2c80c8da51a7268af3 24963 good
Mon Jul 2 18:33:49 IST 2012 sandy xfsbisect b61c37f57988567c84359645f8202a7c84bc798a 24824 good
Mon Jul 2 18:40:20 IST 2012 sandy xfsbisect 20a2a811602b16c42ce88bada3d52712cdfb988b 17155 bad
Mon Jul 2 18:50:12 IST 2012 sandy xfsbisect 78fb72f7936c01d5b426c03a691eca082b03f2b9 38494 good
Mon Jul 2 19:00:24 IST 2012 sandy xfsbisect e1a7eb08ee097e97e928062a242b0de5b2599a11 25033 good
Mon Jul 2 19:10:24 IST 2012 sandy xfsbisect 97effadb65ed08809e1720c8d3ee80b73a93665c 16520 bad
Mon Jul 2 19:16:16 IST 2012 sandy xfsbisect 25e341cfc33d94435472983825163e97fe370a6c 16748 bad
Mon Jul 2 19:21:52 IST 2012 sandy xfsbisect 7dd4906586274f3945f2aeaaa5a33b451c3b4bba 24957 good
Mon Jul 2 19:27:35 IST 2012 sandy xfsbisect aa46419186992e6b8b8010319f0ca7f40a0d13f5 17088 bad
Mon Jul 2 19:32:54 IST 2012 sandy xfsbisect 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d 25667 good
# aa46419186992e6b8b8010319f0ca7f40a0d13f5 is the first bad commit
# [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

I tested plain reverts of the patches individually and together and got
the following results 

FS-Mark Single Threaded
                                        3.4.0                3.4.0                 3.4.0
                 3.4.0-vanilla          revert-aa464191      revert-c999a223       revert-both
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)    25108.00 (77.11%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)   38169.97 (127.43%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)     5344.99 (430.65%)   5599.65 (455.93%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)   47918.10 (159.36%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)   247396.00 (58.35%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)   287141.73 (54.98%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)   175001.08 (-141.58%)   102018.14 (-40.83%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)   637932.00 (25.44%)
MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14

Individually reverting either patch makes a difference to both files/sec
and overhead. Reverting both is not as dramatic as reverting each individual
patch would indicate but it's still a major improvement.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-02 19:35             ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-02 19:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, linux-mm, linux-fsdevel, Eugeni Dodonov

Adding dri-devel and a few others because an i915 patch contributed to
the regression.

On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > times per write(2) call, IIRC), so with limited numbers of
> > > threads/limited CPU power it will result in lower performance. Where
> > > you have lots of CPU power, there will be little difference in
> > > performance...
> > 
> > When I checked it it could only be called twice, and we'd already
> > optimize away the second call.  I'd defintively like to track down where
> > the performance changes happend, at least to a major version but even
> > better to a -rc or git commit.
> > 
> 
> By all means feel free to run the test yourself and run the bisection :)
> 
> It's rare but on this occasion the test machine is idle so I started an
> automated git bisection. As you know the milage with an automated bisect
> varies so it may or may not find the right commit. Test machine is sandy so
> http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> is the report of interest. The script is doing a full search between v3.3 and
> v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> I did not limit the search to fs/xfs on the off-chance that it is an
> apparently unrelated patch that caused the problem.
> 

It was obvious very quickly that there were two distinct regression so I
ran two bisections. One led to a XFS and the other led to an i915 patch
that enables RC6 to reduce power usage.

[c999a223: xfs: introduce an allocation workqueue]
[aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

gdm was running on the machine so i915 would have been in use.  In case it
is of interest this is the log of the bisection. Lines beginning with #
are notes I made and all other lines are from the bisection script. The
second-last column is the files/sec recorded by fsmark.

# MARK v3.3..v3.4 Search for BAD files/sec -lt 28000
# BAD 16536
# GOOD 34757
Mon Jul 2 15:46:13 IST 2012 sandy xfsbisect 141124c02059eee9dbc5c86ea797b1ca888e77f7 37454 good
Mon Jul 2 15:56:06 IST 2012 sandy xfsbisect 55a320308902f7a0746569ee57eeb3f254e6ed16 25192 bad
Mon Jul 2 16:08:34 IST 2012 sandy xfsbisect 281b05392fc2cb26209b4d85abaf4889ab1991f3 38807 good
Mon Jul 2 16:18:02 IST 2012 sandy xfsbisect a8364d5555b2030d093cde0f07951628e55454e1 37553 good
Mon Jul 2 16:27:22 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 36676 good
Mon Jul 2 16:36:48 IST 2012 sandy xfsbisect 2e7580b0e75d771d93e24e681031a165b1d31071 37756 good
Mon Jul 2 16:46:36 IST 2012 sandy xfsbisect 532bfc851a7475fb6a36c1e953aa395798a7cca7 25416 bad
Mon Jul 2 16:56:10 IST 2012 sandy xfsbisect 0c9aac08261512d70d7d4817bd222abca8b6bdd6 38486 good
Mon Jul 2 17:05:40 IST 2012 sandy xfsbisect 0fc9d1040313047edf6a39fd4d7c7defdca97c62 37970 good
Mon Jul 2 17:16:01 IST 2012 sandy xfsbisect 5a5881cdeec2c019b5c9a307800218ee029f7f61 24493 bad
Mon Jul 2 17:21:15 IST 2012 sandy xfsbisect f616137519feb17b849894fcbe634a021d3fa7db 24405 bad
Mon Jul 2 17:26:16 IST 2012 sandy xfsbisect 5575acc7807595687288b3bbac15103f2a5462e1 37336 good
Mon Jul 2 17:31:25 IST 2012 sandy xfsbisect c999a223c2f0d31c64ef7379814cea1378b2b800 24552 bad
Mon Jul 2 17:36:34 IST 2012 sandy xfsbisect 1a1d772433d42aaff7315b3468fef5951604f5c6 36872 good
# c999a223c2f0d31c64ef7379814cea1378b2b800 is the first bad commit
# [c999a223: xfs: introduce an allocation workqueue]
#
# MARK c999a223c2f0d31c64ef7379814cea1378b2b800..v3.4 Search for BAD files/sec -lt 20000
# BAD  16536
# GOOD 24552
Mon Jul 2 17:48:39 IST 2012 sandy xfsbisect b2094ef840697bc8ca5d17a83b7e30fad5f1e9fa 37435 good
Mon Jul 2 17:58:12 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 38303 good
Mon Jul 2 18:08:18 IST 2012 sandy xfsbisect 5d32c88f0b94061b3af2e3ade92422407282eb12 16718 bad
Mon Jul 2 18:18:02 IST 2012 sandy xfsbisect 2f7fa1be66dce77608330c5eb918d6360b5525f2 24964 good
Mon Jul 2 18:24:14 IST 2012 sandy xfsbisect 923f79743c76583ed4684e2c80c8da51a7268af3 24963 good
Mon Jul 2 18:33:49 IST 2012 sandy xfsbisect b61c37f57988567c84359645f8202a7c84bc798a 24824 good
Mon Jul 2 18:40:20 IST 2012 sandy xfsbisect 20a2a811602b16c42ce88bada3d52712cdfb988b 17155 bad
Mon Jul 2 18:50:12 IST 2012 sandy xfsbisect 78fb72f7936c01d5b426c03a691eca082b03f2b9 38494 good
Mon Jul 2 19:00:24 IST 2012 sandy xfsbisect e1a7eb08ee097e97e928062a242b0de5b2599a11 25033 good
Mon Jul 2 19:10:24 IST 2012 sandy xfsbisect 97effadb65ed08809e1720c8d3ee80b73a93665c 16520 bad
Mon Jul 2 19:16:16 IST 2012 sandy xfsbisect 25e341cfc33d94435472983825163e97fe370a6c 16748 bad
Mon Jul 2 19:21:52 IST 2012 sandy xfsbisect 7dd4906586274f3945f2aeaaa5a33b451c3b4bba 24957 good
Mon Jul 2 19:27:35 IST 2012 sandy xfsbisect aa46419186992e6b8b8010319f0ca7f40a0d13f5 17088 bad
Mon Jul 2 19:32:54 IST 2012 sandy xfsbisect 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d 25667 good
# aa46419186992e6b8b8010319f0ca7f40a0d13f5 is the first bad commit
# [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

I tested plain reverts of the patches individually and together and got
the following results 

FS-Mark Single Threaded
                                        3.4.0                3.4.0                 3.4.0
                 3.4.0-vanilla          revert-aa464191      revert-c999a223       revert-both
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)    25108.00 (77.11%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)   38169.97 (127.43%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)     5344.99 (430.65%)   5599.65 (455.93%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)   47918.10 (159.36%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)   247396.00 (58.35%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)   287141.73 (54.98%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)   175001.08 (-141.58%)   102018.14 (-40.83%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)   637932.00 (25.44%)
MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14

Individually reverting either patch makes a difference to both files/sec
and overhead. Reverting both is not as dramatic as reverting each individual
patch would indicate but it's still a major improvement.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-02 19:35             ` Mel Gorman
  (?)
@ 2012-07-03  0:19               ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-03  0:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> Adding dri-devel and a few others because an i915 patch contributed to
> the regression.
> 
> On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > times per write(2) call, IIRC), so with limited numbers of
> > > > threads/limited CPU power it will result in lower performance. Where
> > > > you have lots of CPU power, there will be little difference in
> > > > performance...
> > > 
> > > When I checked it it could only be called twice, and we'd already
> > > optimize away the second call.  I'd defintively like to track down where
> > > the performance changes happend, at least to a major version but even
> > > better to a -rc or git commit.
> > > 
> > 
> > By all means feel free to run the test yourself and run the bisection :)
> > 
> > It's rare but on this occasion the test machine is idle so I started an
> > automated git bisection. As you know the milage with an automated bisect
> > varies so it may or may not find the right commit. Test machine is sandy so
> > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > is the report of interest. The script is doing a full search between v3.3 and
> > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > I did not limit the search to fs/xfs on the off-chance that it is an
> > apparently unrelated patch that caused the problem.
> > 
> 
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

Doesn't seem to be the major cause of the regression. By itself, it
has impact, but the majority comes from the XFS change...

> [c999a223: xfs: introduce an allocation workqueue]

Which indicates that there is workqueue scheduling issues, I think.
The same amount of work is being done, but half of it is being
pushed off into a workqueue to avoid stack overflow issues (*).  I
tested the above patch in anger on an 8p machine, similar to the
machine you saw no regressions on, but the workload didn't drive it
to being completely CPU bound (only about 90%) so the allocation
work was probably always scheduled quickly.

How many worker threads have been spawned on these machines
that are showing the regression? What is the context switch rate on
the machines whenteh test is running? Can you run latencytop to see
if there is excessive starvation/wait times for allocation
completion? A pert top profile comparison might be informative,
too...

(*) The stack usage below submit_bio() can be more than 5k (DM, MD,
SCSI, driver, memory allocation), so it's really not safe to do
allocation anywhere below about 3k of kernel stack being used. e.g.
on a relatively trivial storage setup without the above commit:

[142296.384921] flush-253:4 used greatest stack depth: 360 bytes left

Fundamentally, 8k stacks on x86-64 are too small for our
increasingly complex storage layers and the 100+ function deep call
chains that occur.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03  0:19               ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-03  0:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> Adding dri-devel and a few others because an i915 patch contributed to
> the regression.
> 
> On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > times per write(2) call, IIRC), so with limited numbers of
> > > > threads/limited CPU power it will result in lower performance. Where
> > > > you have lots of CPU power, there will be little difference in
> > > > performance...
> > > 
> > > When I checked it it could only be called twice, and we'd already
> > > optimize away the second call.  I'd defintively like to track down where
> > > the performance changes happend, at least to a major version but even
> > > better to a -rc or git commit.
> > > 
> > 
> > By all means feel free to run the test yourself and run the bisection :)
> > 
> > It's rare but on this occasion the test machine is idle so I started an
> > automated git bisection. As you know the milage with an automated bisect
> > varies so it may or may not find the right commit. Test machine is sandy so
> > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > is the report of interest. The script is doing a full search between v3.3 and
> > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > I did not limit the search to fs/xfs on the off-chance that it is an
> > apparently unrelated patch that caused the problem.
> > 
> 
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

Doesn't seem to be the major cause of the regression. By itself, it
has impact, but the majority comes from the XFS change...

> [c999a223: xfs: introduce an allocation workqueue]

Which indicates that there is workqueue scheduling issues, I think.
The same amount of work is being done, but half of it is being
pushed off into a workqueue to avoid stack overflow issues (*).  I
tested the above patch in anger on an 8p machine, similar to the
machine you saw no regressions on, but the workload didn't drive it
to being completely CPU bound (only about 90%) so the allocation
work was probably always scheduled quickly.

How many worker threads have been spawned on these machines
that are showing the regression? What is the context switch rate on
the machines whenteh test is running? Can you run latencytop to see
if there is excessive starvation/wait times for allocation
completion? A pert top profile comparison might be informative,
too...

(*) The stack usage below submit_bio() can be more than 5k (DM, MD,
SCSI, driver, memory allocation), so it's really not safe to do
allocation anywhere below about 3k of kernel stack being used. e.g.
on a relatively trivial storage setup without the above commit:

[142296.384921] flush-253:4 used greatest stack depth: 360 bytes left

Fundamentally, 8k stacks on x86-64 are too small for our
increasingly complex storage layers and the 100+ function deep call
chains that occur.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03  0:19               ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-03  0:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> Adding dri-devel and a few others because an i915 patch contributed to
> the regression.
> 
> On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > times per write(2) call, IIRC), so with limited numbers of
> > > > threads/limited CPU power it will result in lower performance. Where
> > > > you have lots of CPU power, there will be little difference in
> > > > performance...
> > > 
> > > When I checked it it could only be called twice, and we'd already
> > > optimize away the second call.  I'd defintively like to track down where
> > > the performance changes happend, at least to a major version but even
> > > better to a -rc or git commit.
> > > 
> > 
> > By all means feel free to run the test yourself and run the bisection :)
> > 
> > It's rare but on this occasion the test machine is idle so I started an
> > automated git bisection. As you know the milage with an automated bisect
> > varies so it may or may not find the right commit. Test machine is sandy so
> > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > is the report of interest. The script is doing a full search between v3.3 and
> > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > I did not limit the search to fs/xfs on the off-chance that it is an
> > apparently unrelated patch that caused the problem.
> > 
> 
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

Doesn't seem to be the major cause of the regression. By itself, it
has impact, but the majority comes from the XFS change...

> [c999a223: xfs: introduce an allocation workqueue]

Which indicates that there is workqueue scheduling issues, I think.
The same amount of work is being done, but half of it is being
pushed off into a workqueue to avoid stack overflow issues (*).  I
tested the above patch in anger on an 8p machine, similar to the
machine you saw no regressions on, but the workload didn't drive it
to being completely CPU bound (only about 90%) so the allocation
work was probably always scheduled quickly.

How many worker threads have been spawned on these machines
that are showing the regression? What is the context switch rate on
the machines whenteh test is running? Can you run latencytop to see
if there is excessive starvation/wait times for allocation
completion? A pert top profile comparison might be informative,
too...

(*) The stack usage below submit_bio() can be more than 5k (DM, MD,
SCSI, driver, memory allocation), so it's really not safe to do
allocation anywhere below about 3k of kernel stack being used. e.g.
on a relatively trivial storage setup without the above commit:

[142296.384921] flush-253:4 used greatest stack depth: 360 bytes left

Fundamentally, 8k stacks on x86-64 are too small for our
increasingly complex storage layers and the 100+ function deep call
chains that occur.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03  0:19               ` Dave Chinner
  (?)
@ 2012-07-03 10:59                 ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 10:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > Adding dri-devel and a few others because an i915 patch contributed to
> > the regression.
> > 
> > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > you have lots of CPU power, there will be little difference in
> > > > > performance...
> > > > 
> > > > When I checked it it could only be called twice, and we'd already
> > > > optimize away the second call.  I'd defintively like to track down where
> > > > the performance changes happend, at least to a major version but even
> > > > better to a -rc or git commit.
> > > > 
> > > 
> > > By all means feel free to run the test yourself and run the bisection :)
> > > 
> > > It's rare but on this occasion the test machine is idle so I started an
> > > automated git bisection. As you know the milage with an automated bisect
> > > varies so it may or may not find the right commit. Test machine is sandy so
> > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > is the report of interest. The script is doing a full search between v3.3 and
> > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > apparently unrelated patch that caused the problem.
> > > 
> > 
> > It was obvious very quickly that there were two distinct regression so I
> > ran two bisections. One led to a XFS and the other led to an i915 patch
> > that enables RC6 to reduce power usage.
> > 
> > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> 
> Doesn't seem to be the major cause of the regression. By itself, it
> has impact, but the majority comes from the XFS change...
> 

The fact it has an impact at all is weird but lets see what the DRI
folks think about it.

> > [c999a223: xfs: introduce an allocation workqueue]
> 
> Which indicates that there is workqueue scheduling issues, I think.
> The same amount of work is being done, but half of it is being
> pushed off into a workqueue to avoid stack overflow issues (*).  I
> tested the above patch in anger on an 8p machine, similar to the
> machine you saw no regressions on, but the workload didn't drive it
> to being completely CPU bound (only about 90%) so the allocation
> work was probably always scheduled quickly.
> 

What test were you using?

> How many worker threads have been spawned on these machines
> that are showing the regression?

20 or 21 generally. An example list as spotted by top looks like

kworker/0:0        
kworker/0:1        
kworker/0:2        
kworker/1:0        
kworker/1:1        
kworker/1:2        
kworker/2:0        
kworker/2:1        
kworker/2:2        
kworker/3:0        
kworker/3:1        
kworker/3:2        
kworker/4:0        
kworker/4:1        
kworker/5:0        
kworker/5:1        
kworker/6:0        
kworker/6:1        
kworker/6:2        
kworker/7:0        
kworker/7:1

There were 8 unbound workers.

> What is the context switch rate on the machines whenteh test is running?

This is vmstat from a vanilla kernel. The actual vmstat is after the --.
The information before that is recorded by mmtests to try and detect if
there was jitter in the vmstat output. It's showing that there is little
or no jitter in this test.

VANILLA
 1341306582.6713   1.8109     1.8109 --  0  0      0 16050784  11448 104056    0    0   376     0  209  526  0  0 99  1  0
 1341306584.6715   3.8112     2.0003 --  1  0      0 16050628  11448 104064    0    0     0     0  121  608  0  0 100  0  0
 1341306586.6718   5.8114     2.0003 --  0  0      0 16047432  11460 104288    0    0   102    45  227  999  0  0 99  1  0
 1341306588.6721   7.8117     2.0003 --  1  0      0 16046944  11460 104292    0    0     0     0  120  663  0  0 100  0  0
 1341306590.6723   9.8119     2.0002 --  0  2      0 16045788  11476 104296    0    0    12    40  190  754  0  0 99  0  0
 1341306592.6725  11.8121     2.0002 --  0  1      0 15990236  12600 141724    0    0 19054    30 1400 2937  2  1 88  9  0
 1341306594.6727  13.8124     2.0002 --  1  0      0 15907628  12600 186360    0    0  1653     0 3117 6406  2  9 88  1  0
 1341306596.6730  15.8127     2.0003 --  0  0      0 15825964  12608 226636    0    0    15 11024 3073 6350  2  9 89  0  0
 1341306598.6733  17.8130     2.0003 --  1  0      0 15730420  12608 271632    0    0     0  3072 3461 7179  2 10 88  0  0
 1341306600.6736  19.8132     2.0003 --  1  0      0 15686200  12608 310816    0    0     0 12416 3093 6198  2  9 89  0  0
 1341306602.6738  21.8135     2.0003 --  2  0      0 15593588  12616 354928    0    0     0    32 3482 7146  2 11 87  0  0
 1341306604.6741  23.8138     2.0003 --  2  0      0 15562032  12616 393772    0    0     0 12288 3129 6330  2 10 89  0  0
 1341306606.6744  25.8140     2.0002 --  1  0      0 15458316  12624 438004    0    0     0    26 3471 7107  2 11 87  0  0
 1341306608.6746  27.8142     2.0002 --  1  0      0 15432024  12624 474244    0    0     0 12416 3011 6017  1 10 89  0  0
 1341306610.6749  29.8145     2.0003 --  2  0      0 15343280  12624 517696    0    0     0    24 3393 6826  2 11 87  0  0
 1341306612.6751  31.8148     2.0002 --  1  0      0 15311136  12632 551816    0    0     0 16502 2818 5653  2  9 88  1  0
 1341306614.6754  33.8151     2.0003 --  1  0      0 15220648  12632 594936    0    0     0  3584 3451 6779  2 11 87  0  0
 1341306616.6755  35.8152     2.0001 --  4  0      0 15221252  12632 649296    0    0     0 38559 4846 8709  2 15 78  6  0
 1341306618.6758  37.8155     2.0003 --  1  0      0 15177724  12640 668476    0    0    20 40679 2204 4067  1  5 89  5  0
 1341306620.6761  39.8158     2.0003 --  1  0      0 15090204  12640 711752    0    0     0     0 3316 6788  2 11 88  0  0
 1341306622.6764  41.8160     2.0003 --  1  0      0 15005356  12640 748532    0    0     0 12288 3073 6132  2 10 89  0  0
 1341306624.6766  43.8163     2.0002 --  2  0      0 14913088  12648 791952    0    0     0    28 3408 6806  2 11 87  0  0
 1341306626.6769  45.8166     2.0003 --  1  0      0 14891512  12648 826328    0    0     0 12420 2906 5710  1  9 90  0  0
 1341306628.6772  47.8168     2.0003 --  1  0      0 14794316  12656 868936    0    0     0    26 3367 6798  2 11 87  0  0
 1341306630.6774  49.8171     2.0003 --  1  0      0 14769188  12656 905016    0    0    30 12324 3029 5876  2 10 89  0  0
 1341306632.6777  51.8173     2.0002 --  1  0      0 14679544  12656 947712    0    0     0     0 3399 6868  2 11 87  0  0
 1341306634.6780  53.8176     2.0003 --  1  0      0 14646156  12664 982032    0    0     0 14658 2987 5761  1 10 89  0  0
 1341306636.6782  55.8179     2.0003 --  1  0      0 14560504  12664 1023816    0    0     0  4404 3454 6876  2 11 87  0  0
 1341306638.6783  57.8180     2.0001 --  2  0      0 14533384  12664 1056812    0    0     0 15810 3002 5581  1 10 89  0  0
 1341306640.6785  59.8182     2.0002 --  1  0      0 14593332  12672 1027392    0    0     0 31790 3504 1811  1 13 78  8  0
 1341306642.6787  61.8183     2.0001 --  1  0      0 14686968  12672 1007604    0    0     0 14621 2434 1248  1 10 89  0  0
 1341306644.6789  63.8185     2.0002 --  1  1      0 15042476  12680 788104    0    0     0 36564 2809 1484  1 12 86  1  0
 1341306646.6790  65.8187     2.0002 --  1  0      0 15128292  12680 757948    0    0     0 26395 3050 1313  1 13 86  1  0
 1341306648.6792  67.8189     2.0002 --  1  0      0 15160036  12680 727964    0    0     0  5463 2752  910  1 12 87  0  0
 1341306650.6795  69.8192     2.0003 --  0  0      0 15633256  12688 332572    0    0  1156 12308 2117 2346  1  7 91  1  0
 1341306652.6797  71.8194     2.0002 --  0  0      0 15633892  12688 332652    0    0     0     0  224  758  0  0 100  0  0
 1341306654.6800  73.8197     2.0003 --  0  0      0 15633900  12688 332524    0    0     0     0  231 1009  0  0 100  0  0
 1341306656.6803  75.8199     2.0003 --  0  0      0 15637436  12696 332504    0    0     0    38  266  713  0  0 99  0  0
 1341306658.6805  77.8202     2.0003 --  0  0      0 15654180  12696 332352    0    0     0     0  270  821  0  0 100  0  0

REVERT-XFS
 1341307733.8702   1.7941     1.7941 --  0  0      0 16050640  12036 103996    0    0   372     0  216  752  0  0 99  1  0
 1341307735.8704   3.7944     2.0002 --  0  0      0 16050864  12036 104028    0    0     0     0  132  857  0  0 100  0  0
 1341307737.8707   5.7946     2.0002 --  0  0      0 16047492  12048 104252    0    0   102    37  255  938  0  0 99  1  0
 1341307739.8709   7.7949     2.0003 --  0  0      0 16047600  12072 104324    0    0    32     2  129  658  0  0 100  0  0
 1341307741.8712   9.7951     2.0002 --  1  1      0 16046676  12080 104328    0    0     0    32  165  729  0  0 100  0  0
 1341307743.8714  11.7954     2.0003 --  0  1      0 15990840  13216 142612    0    0 19422    30 1467 3015  2  1 89  8  0
 1341307745.8717  13.7956     2.0002 --  0  0      0 15825496  13216 226396    0    0  1310 11214 2217 1348  2  8 89  1  0
 1341307747.8717  15.7957     2.0001 --  1  0      0 15677816  13224 314672    0    0     4 15294 2307 1173  2  9 89  0  0
 1341307749.8719  17.7959     2.0002 --  1  0      0 15524372  13224 409728    0    0     0 12288 2466  888  1 10 89  0  0
 1341307751.8721  19.7960     2.0002 --  1  0      0 15368424  13224 502552    0    0     0 12416 2312  878  1 10 89  0  0
 1341307753.8722  21.7962     2.0002 --  1  0      0 15225216  13232 593092    0    0     0 12448 2539 1380  1 10 88  0  0
 1341307755.8724  23.7963     2.0002 --  2  0      0 15163712  13232 664768    0    0     0 32160 2184 1177  1  8 90  0  0
 1341307757.8727  25.7967     2.0003 --  1  0      0 14973888  13240 755080    0    0     0 12316 2482 1219  1 10 89  0  0
 1341307759.8728  27.7968     2.0001 --  1  0      0 14883580  13240 840036    0    0     0 44471 2711 1234  2 10 88  0  0
 1341307761.8730  29.7970     2.0002 --  1  0      0 14800304  13240 920504    0    0     0 42554 2571 1050  1 10 89  0  0
 1341307763.8734  31.7973     2.0003 --  0  0      0 14642504  13248 995004    0    0     0  3232 2276 1081  1  8 90  0  0
 1341307765.8737  33.7976     2.0003 --  1  0      0 14545072  13248 1052536    0    0     0 18688 2628 1114  1  9 89  0  0
 1341307767.8739  35.7979     2.0003 --  1  0      0 14783848  13248 926824    0    0     0 59559 2409 1308  0 10 89  1  0
 1341307769.8740  37.7980     2.0001 --  2  0      0 14854800  13256 896832    0    0     0  9172 2419 1004  1 10 89  1  0
 1341307771.8742  39.7981     2.0002 --  2  0      0 14835084  13256 875612    0    0     0 12288 2524  812  0 11 89  0  0
 1341307773.8743  41.7983     2.0002 --  2  0      0 15126252  13256 745844    0    0     0 10297 2714 1163  1 12 88  0  0
 1341307775.8745  43.7985     2.0002 --  1  0      0 15108800  13264 724544    0    0     0 12316 2499  931  1 11 88  0  0
 1341307777.8746  45.7986     2.0001 --  2  0      0 15226236  13264 694580    0    0     0 12416 2700 1194  1 12 88  0  0
 1341307779.8750  47.7989     2.0003 --  1  0      0 15697632  13264 300716    0    0  1156     0  934 1701  0  2 96  1  0
 1341307781.8752  49.7992     2.0003 --  0  0      0 15697508  13272 300720    0    0     0    66  166  641  0  0 100  0  0
 1341307783.8755  51.7995     2.0003 --  0  0      0 15699008  13272 300524    0    0     0     0  248  865  0  0 100  0  0
 1341307785.8758  53.7997     2.0003 --  0  0      0 15702452  13272 300520    0    0     0     0  285  960  0  0 99  0  0
 1341307787.8760  55.7999     2.0002 --  0  0      0 15719404  13280 300436    0    0     0    26  136  590  0  0 99  0  0

Vanilla average context switch rate	4278.53
Revert average context switch rate	1095

> Can you run latencytop to see
> if there is excessive starvation/wait times for allocation
> completion?

I'm not sure what format you are looking for.  latencytop is shit for
capturing information throughout a test and it does not easily allow you to
record a snapshot of a test. You can record all the console output of course
but that's a complete mess. I tried capturing /proc/latency_stats over time
instead because that can be trivially sorted on a system-wide basis but
as I write this I find that latency_stats was bust. It was just spitting out

Latency Top version : v0.1

and nothing else.  Either latency_stats is broken or my config is. Not sure
which it is right now and won't get enough time on this today to pinpoint it.

> A pert top profile comparison might be informative,
> too...
> 

I'm not sure if this is what you really wanted. I thought an oprofile or
perf report would have made more sense but I recorded perf top over time
anyway and it's at the end of the mail.  The timestamp information is poor
because the perf top information was buffered so it would receive a bunch
of updates at once. Each sample should be roughly 2 seconds apart. This
buffering can be dealt with, I just failed to do it in advance and I do
not think it's necessary to rerun the tests for it.

> (*) The stack usage below submit_bio() can be more than 5k (DM, MD,
> SCSI, driver, memory allocation), so it's really not safe to do
> allocation anywhere below about 3k of kernel stack being used. e.g.
> on a relatively trivial storage setup without the above commit:
> 
> [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left
> 
> Fundamentally, 8k stacks on x86-64 are too small for our
> increasingly complex storage layers and the 100+ function deep call
> chains that occur.
> 

I understand the patches motivation. For these tests I'm being deliberately
a bit of a dummy and just capturing information. This might allow me to
actually get through all the results and identify some of the problems
and spread them around a bit. Either that or I need to clone myself a few
times to tackle each of the problems in a reasonable timeframe :)

For just these XFS tests I've uploaded a tarball of the logs to
http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

For results with no monitor you can find them somewhere like this

default/no-monitor/sandy/fsmark-single-3.4.0-vanilla/noprofile/fsmark.log

Results with monitors attached are in run-monitor. You
can read the iostat logs for example from

default/run-monitor/sandy/iostat-3.4.0-vanilla-fsmark-single

Some of the monitor logs are gzipped.

This is perf top over time for the vanilla kernel

time: 1341306570

time: 1341306579
   PerfTop:       1 irqs/sec  kernel: 0.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    61.85%  [kernel]        [k] __rmqueue  
    38.15%  libc-2.11.3.so  [.] _IO_vfscanf

time: 1341306579
   PerfTop:       3 irqs/sec  kernel:66.7%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    19.88%  [kernel]        [k] _raw_spin_lock_irqsave  
    17.14%  [kernel]        [k] __rmqueue               
    16.96%  [kernel]        [k] format_decode           
    15.37%  libc-2.11.3.so  [.] __tzfile_compute        
    13.55%  [kernel]        [k] copy_user_generic_string
    10.57%  libc-2.11.3.so  [.] _IO_vfscanf             
     6.53%  [kernel]        [k] find_first_bit          

time: 1341306579
   PerfTop:       0 irqs/sec  kernel:-nan%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    17.51%  [kernel]        [k] _raw_spin_lock_irqsave  
    15.10%  [kernel]        [k] __rmqueue               
    14.94%  [kernel]        [k] format_decode           
    13.54%  libc-2.11.3.so  [.] __tzfile_compute        
    11.94%  [kernel]        [k] copy_user_generic_string
    11.90%  [kernel]        [k] _raw_spin_lock          
     9.31%  libc-2.11.3.so  [.] _IO_vfscanf             
     5.75%  [kernel]        [k] find_first_bit          

time: 1341306579
   PerfTop:      41 irqs/sec  kernel:58.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    13.62%  [kernel]          [k] _raw_spin_lock_irqsave   
    11.02%  [kernel]          [k] __rmqueue                
    10.91%  [kernel]          [k] format_decode            
     9.89%  libc-2.11.3.so    [.] __tzfile_compute         
     8.72%  [kernel]          [k] copy_user_generic_string 
     8.69%  [kernel]          [k] _raw_spin_lock           
     7.15%  libc-2.11.3.so    [.] _IO_vfscanf              
     4.20%  [kernel]          [k] find_first_bit           
     1.47%  libc-2.11.3.so    [.] __strcmp_sse42           
     1.37%  libc-2.11.3.so    [.] __strchr_sse42           
     1.19%  sed               [.] 0x0000000000009f7d       
     0.90%  libc-2.11.3.so    [.] vfprintf                 
     0.84%  [kernel]          [k] hrtimer_interrupt        
     0.84%  libc-2.11.3.so    [.] re_string_realloc_buffers
     0.76%  [kernel]          [k] enqueue_entity           
     0.66%  [kernel]          [k] __switch_to              
     0.65%  libc-2.11.3.so    [.] _IO_default_xsputn       
     0.62%  [kernel]          [k] do_vfs_ioctl             
     0.59%  [kernel]          [k] perf_event_mmap_event    
     0.56%  gzip              [.] 0x0000000000007b96       
     0.55%  libc-2.11.3.so    [.] bsearch                  

time: 1341306579
   PerfTop:      35 irqs/sec  kernel:62.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.50%  [kernel]          [k] _raw_spin_lock_irqsave   
     9.22%  [kernel]          [k] __rmqueue                
     9.13%  [kernel]          [k] format_decode            
     8.27%  libc-2.11.3.so    [.] __tzfile_compute         
     7.92%  [kernel]          [k] copy_user_generic_string 
     7.74%  [kernel]          [k] _raw_spin_lock           
     6.21%  libc-2.11.3.so    [.] _IO_vfscanf              
     3.51%  [kernel]          [k] find_first_bit           
     1.44%  gzip              [.] 0x0000000000007b96       
     1.23%  libc-2.11.3.so    [.] __strcmp_sse42           
     1.15%  libc-2.11.3.so    [.] __strchr_sse42           
     1.06%  libc-2.11.3.so    [.] vfprintf                 
     0.99%  sed               [.] 0x0000000000009f7d       
     0.92%  [unknown]         [.] 0x00007f84a7766b99       
     0.70%  [kernel]          [k] hrtimer_interrupt        
     0.70%  libc-2.11.3.so    [.] re_string_realloc_buffers
     0.64%  [kernel]          [k] enqueue_entity           
     0.58%  libtcl8.5.so      [.] 0x000000000006fe86       
     0.55%  [kernel]          [k] __switch_to              
     0.54%  libc-2.11.3.so    [.] _IO_default_xsputn       
     0.53%  [kernel]          [k] __d_lookup_rcu           

time: 1341306585
   PerfTop:     100 irqs/sec  kernel:59.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     8.61%  [kernel]          [k] _raw_spin_lock_irqsave         
     5.92%  [kernel]          [k] __rmqueue                      
     5.86%  [kernel]          [k] format_decode                  
     5.31%  libc-2.11.3.so    [.] __tzfile_compute               
     5.30%  [kernel]          [k] copy_user_generic_string       
     5.27%  [kernel]          [k] _raw_spin_lock                 
     3.99%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.45%  [unknown]         [.] 0x00007f84a7766b99             
     2.26%  [kernel]          [k] find_first_bit                 
     1.68%  [kernel]          [k] page_fault                     
     1.45%  libc-2.11.3.so    [.] _int_malloc                    
     1.28%  gzip              [.] 0x0000000000007b96             
     1.13%  libc-2.11.3.so    [.] vfprintf                       
     1.06%  libc-2.11.3.so    [.] __strchr_sse42                 
     1.02%  perl              [.] 0x0000000000044505             
     0.79%  libc-2.11.3.so    [.] __strcmp_sse42                 
     0.79%  [kernel]          [k] do_task_stat                   
     0.77%  [kernel]          [k] zap_pte_range                  
     0.72%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     0.70%  libc-2.11.3.so    [.] malloc                         
     0.70%  libc-2.11.3.so    [.] __mbrtowc                      

time: 1341306585
   PerfTop:      19 irqs/sec  kernel:78.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.97%  [kernel]          [k] _raw_spin_lock_irqsave         
     5.48%  [kernel]          [k] __rmqueue                      
     5.43%  [kernel]          [k] format_decode                  
     5.24%  [kernel]          [k] copy_user_generic_string       
     5.18%  [kernel]          [k] _raw_spin_lock                 
     4.92%  libc-2.11.3.so    [.] __tzfile_compute               
     4.25%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.33%  [unknown]         [.] 0x00007f84a7766b99             
     2.12%  [kernel]          [k] page_fault                     
     2.09%  [kernel]          [k] find_first_bit                 
     1.34%  libc-2.11.3.so    [.] _int_malloc                    
     1.19%  gzip              [.] 0x0000000000007b96             
     1.05%  libc-2.11.3.so    [.] vfprintf                       
     0.98%  libc-2.11.3.so    [.] __strchr_sse42                 
     0.94%  perl              [.] 0x0000000000044505             
     0.94%  libc-2.11.3.so    [.] _dl_addr                       
     0.91%  [kernel]          [k] zap_pte_range                  
     0.74%  [kernel]          [k] s_show                         
     0.73%  libc-2.11.3.so    [.] __strcmp_sse42                 
     0.73%  [kernel]          [k] do_task_stat                   
     0.67%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal

time: 1341306585
   PerfTop:      38 irqs/sec  kernel:68.4%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.64%  [kernel]          [k] _raw_spin_lock_irqsave     
     4.89%  [kernel]          [k] _raw_spin_lock             
     4.77%  [kernel]          [k] __rmqueue                  
     4.72%  [kernel]          [k] format_decode              
     4.56%  [kernel]          [k] copy_user_generic_string   
     4.53%  libc-2.11.3.so    [.] _IO_vfscanf                
     4.28%  libc-2.11.3.so    [.] __tzfile_compute           
     2.52%  [unknown]         [.] 0x00007f84a7766b99         
no symbols found in /bin/sort, maybe install a debug package?
     2.10%  [kernel]          [k] page_fault                 
     1.82%  [kernel]          [k] find_first_bit             
     1.31%  libc-2.11.3.so    [.] _int_malloc                
     1.14%  libc-2.11.3.so    [.] vfprintf                   
     1.08%  libc-2.11.3.so    [.] _dl_addr                   
     1.07%  [kernel]          [k] s_show                     
     1.05%  libc-2.11.3.so    [.] __strchr_sse42             
     1.03%  gzip              [.] 0x0000000000007b96         
     0.82%  [kernel]          [k] do_task_stat               
     0.82%  perl              [.] 0x0000000000044505         
     0.79%  [kernel]          [k] zap_pte_range              
     0.70%  [kernel]          [k] seq_put_decimal_ull        
     0.69%  [kernel]          [k] find_busiest_group         

time: 1341306591
   PerfTop:      66 irqs/sec  kernel:59.1%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     6.52%  [kernel]          [k] _raw_spin_lock_irqsave         
     4.11%  libc-2.11.3.so    [.] _IO_vfscanf                    
     3.91%  [kernel]          [k] _raw_spin_lock                 
     3.50%  [kernel]          [k] copy_user_generic_string       
     3.41%  [kernel]          [k] __rmqueue                      
     3.38%  [kernel]          [k] format_decode                  
     3.06%  libc-2.11.3.so    [.] __tzfile_compute               
     2.90%  [unknown]         [.] 0x00007f84a7766b99             
     2.30%  [kernel]          [k] page_fault                     
     2.20%  perl              [.] 0x0000000000044505             
     1.83%  libc-2.11.3.so    [.] vfprintf                       
     1.61%  libc-2.11.3.so    [.] _int_malloc                    
     1.30%  [kernel]          [k] find_first_bit                 
     1.22%  libc-2.11.3.so    [.] _dl_addr                       
     1.19%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     1.10%  libc-2.11.3.so    [.] __strchr_sse42                 
     1.01%  [kernel]          [k] zap_pte_range                  
     0.99%  [kernel]          [k] s_show                         
     0.98%  [kernel]          [k] __percpu_counter_add           
     0.86%  [kernel]          [k] __strnlen_user                 
     0.75%  ld-2.11.3.so      [.] do_lookup_x                    

time: 1341306591
   PerfTop:      39 irqs/sec  kernel:69.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     6.26%  [kernel]          [k] _raw_spin_lock_irqsave         
     4.05%  [kernel]          [k] _raw_spin_lock                 
     3.86%  libc-2.11.3.so    [.] _IO_vfscanf                    
     3.21%  [kernel]          [k] copy_user_generic_string       
     3.03%  [kernel]          [k] __rmqueue                      
     3.00%  [kernel]          [k] format_decode                  
     2.93%  [unknown]         [.] 0x00007f84a7766b99             
     2.72%  libc-2.11.3.so    [.] __tzfile_compute               
     2.20%  [kernel]          [k] page_fault                     
     1.96%  perl              [.] 0x0000000000044505             
     1.77%  libc-2.11.3.so    [.] vfprintf                       
     1.43%  libc-2.11.3.so    [.] _int_malloc                    
     1.16%  [kernel]          [k] find_first_bit                 
     1.09%  libc-2.11.3.so    [.] _dl_addr                       
     1.06%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     1.02%  [kernel]          [k] s_show                         
     0.98%  libc-2.11.3.so    [.] __strchr_sse42                 
     0.93%  gzip              [.] 0x0000000000007b96             
     0.90%  [kernel]          [k] zap_pte_range                  
     0.87%  [kernel]          [k] __percpu_counter_add           
     0.76%  [kernel]          [k] __strnlen_user                 

time: 1341306591
   PerfTop:     185 irqs/sec  kernel:70.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     4.81%  [kernel]          [k] _raw_spin_lock_irqsave         
     3.60%  [unknown]         [.] 0x00007f84a7766b99             
     3.10%  [kernel]          [k] _raw_spin_lock                 
     3.04%  [kernel]          [k] page_fault                     
     2.66%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.14%  [kernel]          [k] copy_user_generic_string       
     2.11%  [kernel]          [k] format_decode                  
     1.96%  [kernel]          [k] __rmqueue                      
     1.86%  libc-2.11.3.so    [.] _dl_addr                       
     1.76%  libc-2.11.3.so    [.] __tzfile_compute               
     1.26%  perl              [.] 0x0000000000044505             
     1.19%  libc-2.11.3.so    [.] __mbrtowc                      
     1.14%  libc-2.11.3.so    [.] vfprintf                       
     1.12%  libc-2.11.3.so    [.] _int_malloc                    
     1.09%  gzip              [.] 0x0000000000007b96             
     0.95%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     0.88%  [kernel]          [k] _raw_spin_unlock_irqrestore    
     0.87%  [kernel]          [k] __strnlen_user                 
     0.82%  [kernel]          [k] clear_page_c                   
     0.77%  [kernel]          [k] __schedule                     
     0.76%  [kernel]          [k] find_get_page                  

time: 1341306595
   PerfTop:     385 irqs/sec  kernel:48.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    27.20%  cc1               [.] 0x0000000000210978         
     3.01%  [unknown]         [.] 0x00007f84a7766b99         
     2.18%  [kernel]          [k] page_fault                 
     1.96%  libbfd-2.21.so    [.] 0x00000000000b9cdd         
     1.95%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.91%  ld.bfd            [.] 0x000000000000e3b9         
     1.85%  [kernel]          [k] _raw_spin_lock             
     1.31%  [kernel]          [k] copy_user_generic_string   
     1.20%  libbfd-2.21.so    [.] bfd_hash_lookup            
     1.10%  libc-2.11.3.so    [.] __strcmp_sse42             
     0.93%  libc-2.11.3.so    [.] _IO_vfscanf                
     0.85%  [kernel]          [k] _raw_spin_unlock_irqrestore
     0.82%  libc-2.11.3.so    [.] _int_malloc                
     0.80%  [kernel]          [k] __rmqueue                  
     0.79%  [kernel]          [k] kmem_cache_alloc           
     0.74%  [kernel]          [k] format_decode              
     0.71%  libc-2.11.3.so    [.] _dl_addr                   
     0.62%  libbfd-2.21.so    [.] _bfd_final_link_relocate   
     0.61%  libc-2.11.3.so    [.] __tzfile_compute           
     0.61%  libc-2.11.3.so    [.] vfprintf                   
     0.59%  [kernel]          [k] find_busiest_group         

time: 1341306595
   PerfTop:    1451 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     9.75%  cc1               [.] 0x0000000000210978         
     8.81%  [unknown]         [.] 0x00007f84a7766b99         
     4.62%  [kernel]          [k] page_fault                 
     3.61%  [kernel]          [k] _raw_spin_lock             
     2.67%  [kernel]          [k] memcpy                     
     2.03%  [kernel]          [k] _raw_spin_lock_irqsave     
     2.00%  [kernel]          [k] kmem_cache_alloc           
     1.64%  [xfs]             [k] _xfs_buf_find              
     1.31%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.16%  [kernel]          [k] kmem_cache_free            
     1.15%  [xfs]             [k] xfs_next_bit               
     0.98%  [kernel]          [k] __d_lookup                 
     0.89%  [xfs]             [k] xfs_da_do_buf              
     0.83%  [kernel]          [k] memset                     
     0.80%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.79%  [kernel]          [k] link_path_walk             
     0.76%  [xfs]             [k] xfs_buf_item_size          
no symbols found in /usr/bin/tee, maybe install a debug package?
no symbols found in /bin/date, maybe install a debug package?
     0.73%  [xfs]             [k] xfs_buf_offset             
     0.71%  [kernel]          [k] __kmalloc                  
     0.70%  [kernel]          [k] kfree                      
     0.70%  libbfd-2.21.so    [.] 0x00000000000b9cdd         

time: 1341306601
   PerfTop:    1267 irqs/sec  kernel:85.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.81%  [unknown]         [.] 0x00007f84a7766b99         
     5.98%  cc1               [.] 0x0000000000210978         
     5.20%  [kernel]          [k] page_fault                 
     3.54%  [kernel]          [k] _raw_spin_lock             
     3.37%  [kernel]          [k] memcpy                     
     2.03%  [kernel]          [k] kmem_cache_alloc           
     1.91%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.75%  [xfs]             [k] _xfs_buf_find              
     1.35%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.28%  [xfs]             [k] xfs_next_bit               
     1.14%  [kernel]          [k] kmem_cache_free            
     1.13%  [kernel]          [k] __kmalloc                  
     1.12%  [kernel]          [k] __d_lookup                 
     0.97%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.96%  [xfs]             [k] xfs_buf_offset             
     0.95%  [kernel]          [k] memset                     
     0.91%  [kernel]          [k] link_path_walk             
     0.88%  [xfs]             [k] xfs_da_do_buf              
     0.85%  [kernel]          [k] kfree                      
     0.84%  [xfs]             [k] xfs_buf_item_size          
     0.74%  [xfs]             [k] xfs_btree_lookup           

time: 1341306601
   PerfTop:    1487 irqs/sec  kernel:85.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.84%  [unknown]         [.] 0x00007f84a7766b99         
     5.15%  [kernel]          [k] page_fault                 
     3.93%  cc1               [.] 0x0000000000210978         
     3.76%  [kernel]          [k] _raw_spin_lock             
     3.50%  [kernel]          [k] memcpy                     
     2.13%  [kernel]          [k] kmem_cache_alloc           
     1.91%  [xfs]             [k] _xfs_buf_find              
     1.79%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.52%  [kernel]          [k] __kmalloc                  
     1.33%  [kernel]          [k] kmem_cache_free            
     1.32%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.29%  [kernel]          [k] __d_lookup                 
     1.27%  [xfs]             [k] xfs_next_bit               
     1.11%  [kernel]          [k] link_path_walk             
     1.01%  [xfs]             [k] xfs_buf_offset             
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_da_do_buf              
     0.97%  [kernel]          [k] kfree                      
     0.96%  [kernel]          [k] memset                     
     0.84%  [xfs]             [k] xfs_btree_lookup           
     0.82%  [xfs]             [k] xfs_buf_item_format        

time: 1341306601
   PerfTop:    1291 irqs/sec  kernel:85.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.21%  [unknown]         [.] 0x00007f84a7766b99         
     5.18%  [kernel]          [k] page_fault                 
     3.83%  [kernel]          [k] _raw_spin_lock             
     3.67%  [kernel]          [k] memcpy                     
     2.92%  cc1               [.] 0x0000000000210978         
     2.28%  [kernel]          [k] kmem_cache_alloc           
     2.18%  [xfs]             [k] _xfs_buf_find              
     1.66%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.56%  [kernel]          [k] __kmalloc                  
     1.43%  [kernel]          [k] __d_lookup                 
     1.43%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.40%  [kernel]          [k] kmem_cache_free            
     1.29%  [xfs]             [k] xfs_next_bit               
     1.13%  [xfs]             [k] xfs_buf_offset             
     1.07%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_da_do_buf              
     1.01%  [kernel]          [k] memset                     
     1.01%  [kernel]          [k] kfree                      
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.87%  [xfs]             [k] xfs_buf_item_size          
     0.84%  [xfs]             [k] xfs_btree_lookup           

time: 1341306607
   PerfTop:    1435 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.06%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] page_fault                 
     3.88%  [kernel]          [k] _raw_spin_lock             
     3.83%  [kernel]          [k] memcpy                     
     2.41%  [xfs]             [k] _xfs_buf_find              
     2.35%  [kernel]          [k] kmem_cache_alloc           
     2.19%  cc1               [.] 0x0000000000210978         
     1.68%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.55%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] __d_lookup                 
     1.43%  [kernel]          [k] kmem_cache_free            
     1.42%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.37%  [xfs]             [k] xfs_next_bit               
     1.27%  [xfs]             [k] xfs_buf_offset             
     1.12%  [kernel]          [k] link_path_walk             
     1.09%  [kernel]          [k] kfree                      
     1.08%  [kernel]          [k] memset                     
     1.04%  [xfs]             [k] xfs_da_do_buf              
     0.99%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.92%  [xfs]             [k] xfs_buf_item_size          
     0.89%  [xfs]             [k] xfs_btree_lookup           

time: 1341306607
   PerfTop:    1281 irqs/sec  kernel:87.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.00%  [unknown]         [.] 0x00007f84a7766b99         
     5.44%  [kernel]          [k] page_fault                 
     4.04%  [kernel]          [k] _raw_spin_lock             
     3.94%  [kernel]          [k] memcpy                     
     2.51%  [xfs]             [k] _xfs_buf_find              
     2.32%  [kernel]          [k] kmem_cache_alloc           
     1.75%  cc1               [.] 0x0000000000210978         
     1.66%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [kernel]          [k] __d_lookup                 
     1.56%  [kernel]          [k] __kmalloc                  
     1.46%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] kmem_cache_free            
     1.41%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.34%  [xfs]             [k] xfs_buf_offset             
     1.20%  [kernel]          [k] link_path_walk             
     1.16%  [kernel]          [k] kfree                      
     1.11%  [kernel]          [k] memset                     
     1.04%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.94%  [xfs]             [k] xfs_da_do_buf              
     0.92%  [xfs]             [k] xfs_btree_lookup           
     0.89%  [xfs]             [k] xfs_buf_item_size          

time: 1341306607
   PerfTop:    1455 irqs/sec  kernel:86.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.14%  [unknown]         [.] 0x00007f84a7766b99         
     5.36%  [kernel]          [k] page_fault                 
     4.12%  [kernel]          [k] _raw_spin_lock             
     4.02%  [kernel]          [k] memcpy                     
     2.54%  [xfs]             [k] _xfs_buf_find              
     2.41%  [kernel]          [k] kmem_cache_alloc           
     1.69%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.56%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.47%  [kernel]          [k] __d_lookup                 
     1.42%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.39%  [xfs]             [k] xfs_buf_offset             
     1.39%  cc1               [.] 0x0000000000210978         
     1.37%  [kernel]          [k] kmem_cache_free            
     1.24%  [kernel]          [k] link_path_walk             
     1.17%  [kernel]          [k] memset                     
     1.16%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.91%  [xfs]             [k] xfs_btree_lookup           

time: 1341306613
   PerfTop:    1245 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.05%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] page_fault                 
     4.10%  [kernel]          [k] _raw_spin_lock             
     4.06%  [kernel]          [k] memcpy                     
     2.74%  [xfs]             [k] _xfs_buf_find              
     2.40%  [kernel]          [k] kmem_cache_alloc           
     1.64%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [xfs]             [k] xfs_next_bit               
     1.54%  [kernel]          [k] __kmalloc                  
     1.49%  [kernel]          [k] __d_lookup                 
     1.45%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.41%  [kernel]          [k] kmem_cache_free            
     1.35%  [xfs]             [k] xfs_buf_offset             
     1.25%  [kernel]          [k] link_path_walk             
     1.22%  [kernel]          [k] kfree                      
     1.16%  [kernel]          [k] memset                     
     1.15%  cc1               [.] 0x0000000000210978         
     1.02%  [xfs]             [k] xfs_buf_item_size          
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.92%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_da_do_buf              

time: 1341306613
   PerfTop:    1433 irqs/sec  kernel:87.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.04%  [unknown]         [.] 0x00007f84a7766b99         
     5.30%  [kernel]          [k] page_fault                 
     4.08%  [kernel]          [k] memcpy                     
     4.07%  [kernel]          [k] _raw_spin_lock             
     2.88%  [xfs]             [k] _xfs_buf_find              
     2.50%  [kernel]          [k] kmem_cache_alloc           
     1.72%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.68%  [xfs]             [k] xfs_next_bit               
     1.56%  [kernel]          [k] __d_lookup                 
     1.54%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.46%  [kernel]          [k] kmem_cache_free            
     1.40%  [xfs]             [k] xfs_buf_offset             
     1.25%  [kernel]          [k] link_path_walk             
     1.21%  [kernel]          [k] memset                     
     1.18%  [kernel]          [k] kfree                      
     1.04%  [xfs]             [k] xfs_buf_item_size          
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.95%  [xfs]             [k] xfs_btree_lookup           
     0.94%  cc1               [.] 0x0000000000210978         
     0.90%  [xfs]             [k] xfs_da_do_buf              

time: 1341306613
   PerfTop:    1118 irqs/sec  kernel:87.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

no symbols found in /usr/bin/vmstat, maybe install a debug package?
    12.03%  [unknown]         [.] 0x00007f84a7766b99         
     5.48%  [kernel]          [k] page_fault                 
     4.21%  [kernel]          [k] memcpy                     
     4.11%  [kernel]          [k] _raw_spin_lock             
     2.98%  [xfs]             [k] _xfs_buf_find              
     2.47%  [kernel]          [k] kmem_cache_alloc           
     1.81%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.72%  [xfs]             [k] xfs_next_bit               
     1.51%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] kmem_cache_free            
     1.48%  [kernel]          [k] __d_lookup                 
     1.47%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.39%  [xfs]             [k] xfs_buf_offset             
     1.23%  [kernel]          [k] link_path_walk             
     1.19%  [kernel]          [k] memset                     
     1.19%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_buf_item_size          
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.93%  [xfs]             [k] xfs_buf_item_format        
     0.91%  [xfs]             [k] xfs_da_do_buf              

time: 1341306617
   PerfTop:    1454 irqs/sec  kernel:87.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.93%  [unknown]         [.] 0x00007f84a7766b99         
     5.42%  [kernel]          [k] page_fault                 
     4.28%  [kernel]          [k] memcpy                     
     4.20%  [kernel]          [k] _raw_spin_lock             
     3.15%  [xfs]             [k] _xfs_buf_find              
     2.52%  [kernel]          [k] kmem_cache_alloc           
     1.76%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.72%  [xfs]             [k] xfs_next_bit               
     1.59%  [kernel]          [k] __d_lookup                 
     1.51%  [kernel]          [k] __kmalloc                  
     1.49%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.48%  [kernel]          [k] kmem_cache_free            
     1.40%  [xfs]             [k] xfs_buf_offset             
     1.29%  [kernel]          [k] memset                     
     1.20%  [kernel]          [k] link_path_walk             
     1.17%  [kernel]          [k] kfree                      
     1.09%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_buf_item_size          
     0.95%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              
     0.91%  [xfs]             [k] xfs_buf_item_format        

time: 1341306617
   PerfTop:    1758 irqs/sec  kernel:90.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.99%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.04%  [kernel]          [k] memcpy                     
     3.86%  [xfs]             [k] _xfs_buf_find              
     2.31%  [kernel]          [k] kmem_cache_alloc           
     2.03%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.67%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.60%  [kernel]          [k] __d_lookup                 
     1.60%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] __kmalloc                  
     1.36%  [xfs]             [k] xfs_buf_offset             
     1.35%  [kernel]          [k] kmem_cache_free            
     1.17%  [kernel]          [k] kfree                      
     1.16%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] link_path_walk             
     0.98%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.97%  [xfs]             [k] xfs_btree_lookup           
     0.92%  [xfs]             [k] xfs_perag_put              
     0.90%  [xfs]             [k] xfs_buf_item_size          
     0.84%  [xfs]             [k] xfs_da_do_buf              

time: 1341306623
   PerfTop:    1022 irqs/sec  kernel:88.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.93%  [unknown]         [.] 0x00007f84a7766b99         
     5.34%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.01%  [kernel]          [k] memcpy                     
     4.01%  [xfs]             [k] _xfs_buf_find              
     2.28%  [kernel]          [k] kmem_cache_alloc           
     2.00%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.68%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.60%  [xfs]             [k] xfs_next_bit               
     1.59%  [kernel]          [k] __d_lookup                 
     1.41%  [kernel]          [k] __kmalloc                  
     1.39%  [kernel]          [k] kmem_cache_free            
     1.35%  [xfs]             [k] xfs_buf_offset             
     1.15%  [kernel]          [k] kfree                      
     1.15%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] link_path_walk             
     0.98%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_perag_put              
     0.88%  [xfs]             [k] xfs_buf_item_size          
     0.86%  [xfs]             [k] xfs_da_do_buf              

time: 1341306623
   PerfTop:    1430 irqs/sec  kernel:87.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.05%  [unknown]         [.] 0x00007f84a7766b99         
     5.24%  [kernel]          [k] _raw_spin_lock             
     4.89%  [kernel]          [k] page_fault                 
     4.13%  [kernel]          [k] memcpy                     
     3.96%  [xfs]             [k] _xfs_buf_find              
     2.35%  [kernel]          [k] kmem_cache_alloc           
     1.95%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.63%  [kernel]          [k] __d_lookup                 
     1.54%  [xfs]             [k] xfs_next_bit               
     1.42%  [kernel]          [k] __kmalloc                  
     1.41%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.16%  [kernel]          [k] memset                     
     1.11%  [kernel]          [k] kfree                      
     1.10%  [kernel]          [k] link_path_walk             
     1.05%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_buf_item_size          
     0.87%  [xfs]             [k] xfs_da_do_buf              
     0.87%  [xfs]             [k] xfs_perag_put              

time: 1341306623
   PerfTop:    1267 irqs/sec  kernel:87.1%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.20%  [unknown]         [.] 0x00007f84a7766b99         
     5.08%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.12%  [kernel]          [k] memcpy                     
     3.96%  [xfs]             [k] _xfs_buf_find              
     2.41%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.61%  [kernel]          [k] __d_lookup                 
     1.50%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] __kmalloc                  
     1.40%  [kernel]          [k] kmem_cache_free            
     1.31%  [xfs]             [k] xfs_buf_offset             
no symbols found in /usr/bin/iostat, maybe install a debug package?
     1.16%  [kernel]          [k] memset                     
     1.11%  [kernel]          [k] kfree                      
     1.06%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_btree_lookup           
     0.95%  [xfs]             [k] xfs_buf_item_size          
     0.90%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_perag_put              

time: 1341306629
   PerfTop:    1399 irqs/sec  kernel:88.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.12%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.23%  [kernel]          [k] memcpy                     
     4.03%  [xfs]             [k] _xfs_buf_find              
     2.37%  [kernel]          [k] kmem_cache_alloc           
     1.96%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.69%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.63%  [kernel]          [k] __d_lookup                 
     1.50%  [xfs]             [k] xfs_next_bit               
     1.45%  [kernel]          [k] __kmalloc                  
     1.35%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.17%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.07%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_btree_lookup           
     1.02%  [xfs]             [k] xfs_buf_item_size          
     0.93%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_buf_item_format        

time: 1341306629
   PerfTop:    1225 irqs/sec  kernel:87.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.15%  [unknown]         [.] 0x00007f84a7766b99         
     5.02%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.22%  [kernel]          [k] memcpy                     
     4.19%  [xfs]             [k] _xfs_buf_find              
     2.32%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.71%  [kernel]          [k] __d_lookup                 
     1.68%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.54%  [xfs]             [k] xfs_next_bit               
     1.51%  [kernel]          [k] __kmalloc                  
     1.36%  [kernel]          [k] kmem_cache_free            
     1.28%  [xfs]             [k] xfs_buf_offset             
     1.14%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] kfree                      
     1.06%  [kernel]          [k] link_path_walk             
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.86%  [kernel]          [k] s_show                     

time: 1341306629
   PerfTop:    1400 irqs/sec  kernel:87.4%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.23%  [unknown]         [.] 0x00007f84a7766b99         
     5.07%  [kernel]          [k] _raw_spin_lock             
     4.87%  [kernel]          [k] page_fault                 
     4.27%  [xfs]             [k] _xfs_buf_find              
     4.18%  [kernel]          [k] memcpy                     
     2.31%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.73%  [kernel]          [k] __d_lookup                 
     1.66%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.49%  [xfs]             [k] xfs_next_bit               
     1.49%  [kernel]          [k] __kmalloc                  
     1.40%  [kernel]          [k] kmem_cache_free            
     1.29%  [xfs]             [k] xfs_buf_offset             
     1.11%  [kernel]          [k] kfree                      
     1.07%  [kernel]          [k] memset                     
     1.07%  [kernel]          [k] link_path_walk             
     1.05%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.97%  [xfs]             [k] xfs_btree_lookup           
     0.93%  [xfs]             [k] xfs_da_do_buf              
     0.89%  [kernel]          [k] s_show                     

time: 1341306635
   PerfTop:    1251 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.20%  [unknown]         [.] 0x00007f84a7766b99         
     5.10%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.29%  [xfs]             [k] _xfs_buf_find              
     4.19%  [kernel]          [k] memcpy                     
     2.26%  [kernel]          [k] kmem_cache_alloc           
     1.87%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] __d_lookup                 
     1.64%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.53%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.41%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.10%  [kernel]          [k] link_path_walk             
     1.09%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] kfree                      
     1.03%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.96%  [kernel]          [k] s_show                     
     0.93%  [xfs]             [k] xfs_da_do_buf              

time: 1341306635
   PerfTop:    1429 irqs/sec  kernel:88.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.18%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.28%  [xfs]             [k] _xfs_buf_find              
     4.21%  [kernel]          [k] memcpy                     
     2.23%  [kernel]          [k] kmem_cache_alloc           
     1.90%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] __d_lookup                 
     1.67%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.52%  [kernel]          [k] __kmalloc                  
     1.52%  [xfs]             [k] xfs_next_bit               
     1.36%  [kernel]          [k] kmem_cache_free            
     1.34%  [xfs]             [k] xfs_buf_offset             
     1.11%  [kernel]          [k] link_path_walk             
     1.11%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] kfree                      
     1.03%  [xfs]             [k] xfs_buf_item_size          
     1.03%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [kernel]          [k] s_show                     
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306635
   PerfTop:    1232 irqs/sec  kernel:88.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.11%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.87%  [kernel]          [k] page_fault                 
     4.33%  [xfs]             [k] _xfs_buf_find              
     4.16%  [kernel]          [k] memcpy                     
     2.24%  [kernel]          [k] kmem_cache_alloc           
     1.84%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.82%  [kernel]          [k] __d_lookup                 
     1.65%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.50%  [xfs]             [k] xfs_next_bit               
     1.49%  [kernel]          [k] __kmalloc                  
     1.34%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.13%  [kernel]          [k] link_path_walk             
     1.13%  [kernel]          [k] kfree                      
     1.11%  [kernel]          [k] memset                     
     1.06%  [kernel]          [k] s_show                     
     1.03%  [xfs]             [k] xfs_buf_item_size          
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306639
   PerfTop:    1444 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.19%  [unknown]         [.] 0x00007f84a7766b99         
     5.10%  [kernel]          [k] _raw_spin_lock             
     4.95%  [kernel]          [k] page_fault                 
     4.40%  [xfs]             [k] _xfs_buf_find              
     4.10%  [kernel]          [k] memcpy                     
     2.20%  [kernel]          [k] kmem_cache_alloc           
     1.93%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.81%  [kernel]          [k] __d_lookup                 
     1.59%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.50%  [xfs]             [k] xfs_next_bit               
     1.48%  [kernel]          [k] __kmalloc                  
     1.37%  [kernel]          [k] kmem_cache_free            
     1.36%  [xfs]             [k] xfs_buf_offset             
     1.15%  [kernel]          [k] memset                     
     1.12%  [kernel]          [k] s_show                     
     1.12%  [kernel]          [k] link_path_walk             
     1.10%  [kernel]          [k] kfree                      
     1.02%  [xfs]             [k] xfs_buf_item_size          
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.97%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306639
   PerfTop:    1195 irqs/sec  kernel:90.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.00%  [unknown]         [.] 0x00007f84a7766b99         
     5.17%  [kernel]          [k] _raw_spin_lock             
     4.44%  [xfs]             [k] _xfs_buf_find              
     4.37%  [kernel]          [k] page_fault                 
     4.37%  [kernel]          [k] memcpy                     
     2.30%  [kernel]          [k] kmem_cache_alloc           
     1.90%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.63%  [kernel]          [k] __d_lookup                 
     1.62%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.59%  [xfs]             [k] xfs_buf_offset             
     1.50%  [kernel]          [k] kmem_cache_free            
     1.50%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.33%  [kernel]          [k] memset                     
     1.28%  [kernel]          [k] kfree                      
     1.11%  [xfs]             [k] xfs_buf_item_size          
     1.07%  [kernel]          [k] s_show                     
     1.07%  [kernel]          [k] link_path_walk             
     0.93%  [xfs]             [k] xfs_btree_lookup           
     0.90%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_dir2_node_addname_int  

time: 1341306645
   PerfTop:    1097 irqs/sec  kernel:95.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.51%  [unknown]         [.] 0x00007f84a7766b99         
     5.02%  [kernel]          [k] _raw_spin_lock             
     4.63%  [kernel]          [k] memcpy                     
     4.51%  [xfs]             [k] _xfs_buf_find              
     3.32%  [kernel]          [k] page_fault                 
     2.37%  [kernel]          [k] kmem_cache_alloc           
     1.87%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [xfs]             [k] xfs_buf_offset             
     1.75%  [kernel]          [k] __kmalloc                  
     1.73%  [xfs]             [k] xfs_next_bit               
     1.65%  [kernel]          [k] memset                     
     1.60%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.53%  [kernel]          [k] kfree                      
     1.52%  [kernel]          [k] kmem_cache_free            
     1.44%  [xfs]             [k] xfs_trans_ail_cursor_first 
     1.26%  [kernel]          [k] __d_lookup                 
     1.26%  [xfs]             [k] xfs_buf_item_size          
     1.05%  [kernel]          [k] s_show                     
     1.03%  [xfs]             [k] xfs_buf_item_format        
     0.92%  [kernel]          [k] __d_lookup_rcu             
     0.87%  [kernel]          [k] link_path_walk             

time: 1341306645
   PerfTop:    1038 irqs/sec  kernel:95.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.89%  [unknown]         [.] 0x00007f84a7766b99         
     5.18%  [kernel]          [k] memcpy                     
     4.60%  [xfs]             [k] _xfs_buf_find              
     4.52%  [kernel]          [k] _raw_spin_lock             
     2.60%  [kernel]          [k] page_fault                 
     2.42%  [kernel]          [k] kmem_cache_alloc           
     1.99%  [kernel]          [k] __kmalloc                  
     1.96%  [xfs]             [k] xfs_next_bit               
     1.93%  [xfs]             [k] xfs_buf_offset             
     1.84%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] memset                     
     1.80%  [kernel]          [k] kmem_cache_free            
     1.68%  [kernel]          [k] kfree                      
     1.47%  [xfs]             [k] xfs_buf_item_size          
     1.45%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.18%  [kernel]          [k] __d_lookup_rcu             
     1.13%  [xfs]             [k] xfs_trans_ail_cursor_first 
     1.12%  [xfs]             [k] xfs_buf_item_format        
     1.04%  [kernel]          [k] s_show                     
     1.01%  [kernel]          [k] __d_lookup                 
     0.93%  [xfs]             [k] xfs_da_do_buf              

time: 1341306645
   PerfTop:    1087 irqs/sec  kernel:96.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.27%  [kernel]          [k] memcpy                     
     4.77%  [unknown]         [.] 0x00007f84a7766b99         
     4.69%  [xfs]             [k] _xfs_buf_find              
     4.56%  [kernel]          [k] _raw_spin_lock             
     2.47%  [kernel]          [k] kmem_cache_alloc           
     2.18%  [xfs]             [k] xfs_next_bit               
     2.11%  [kernel]          [k] page_fault                 
     2.00%  [xfs]             [k] xfs_buf_offset             
     1.99%  [kernel]          [k] __kmalloc                  
     1.96%  [kernel]          [k] kmem_cache_free            
     1.85%  [kernel]          [k] kfree                      
     1.82%  [kernel]          [k] memset                     
     1.75%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [xfs]             [k] xfs_buf_item_size          
     1.41%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.23%  [kernel]          [k] __d_lookup_rcu             
     1.21%  [xfs]             [k] xfs_buf_item_format        
     0.99%  [kernel]          [k] s_show                     
     0.97%  [xfs]             [k] xfs_perag_put              
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.92%  [xfs]             [k] xfs_trans_ail_cursor_first 

time: 1341306651
   PerfTop:    1157 irqs/sec  kernel:96.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.53%  [kernel]          [k] memcpy                     
     4.61%  [xfs]             [k] _xfs_buf_find              
     4.40%  [kernel]          [k] _raw_spin_lock             
     3.83%  [unknown]         [.] 0x00007f84a7766b99         
     2.67%  [kernel]          [k] kmem_cache_alloc           
     2.32%  [xfs]             [k] xfs_next_bit               
     2.21%  [kernel]          [k] __kmalloc                  
     2.21%  [xfs]             [k] xfs_buf_offset             
     2.19%  [kernel]          [k] kmem_cache_free            
     1.92%  [kernel]          [k] memset                     
     1.89%  [kernel]          [k] kfree                      
     1.80%  [xfs]             [k] xfs_buf_item_size          
     1.70%  [kernel]          [k] page_fault                 
     1.62%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.50%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.30%  [xfs]             [k] xfs_buf_item_format        
     1.27%  [kernel]          [k] __d_lookup_rcu             
     0.97%  [xfs]             [k] xfs_da_do_buf              
     0.96%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.93%  [kernel]          [k] s_show                     
     0.93%  [xfs]             [k] xfs_perag_put              

time: 1341306651
   PerfTop:    1073 irqs/sec  kernel:95.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.76%  [kernel]          [k] memcpy                     
     4.54%  [xfs]             [k] _xfs_buf_find              
     4.32%  [kernel]          [k] _raw_spin_lock             
     3.15%  [unknown]         [.] 0x00007f84a7766b99         
     2.77%  [kernel]          [k] kmem_cache_alloc           
     2.49%  [xfs]             [k] xfs_next_bit               
     2.36%  [kernel]          [k] __kmalloc                  
     2.27%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
     1.88%  [kernel]          [k] memset                     
     1.88%  [kernel]          [k] kfree                      
     1.77%  [xfs]             [k] xfs_buf_item_size          
     1.62%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.42%  [xfs]             [k] xfs_buf_item_format        
     1.40%  [kernel]          [k] page_fault                 
     1.39%  [kernel]          [k] __d_lookup_rcu             
     0.99%  [xfs]             [k] xfs_da_do_buf              
     0.96%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.88%  [kernel]          [k] s_show                     
     0.87%  [xfs]             [k] xfs_perag_put              

time: 1341306651
   PerfTop:     492 irqs/sec  kernel:85.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.74%  [kernel]          [k] memcpy                     
     4.48%  [xfs]             [k] _xfs_buf_find              
     4.27%  [kernel]          [k] _raw_spin_lock             
     3.00%  [unknown]         [.] 0x00007f84a7766b99         
     2.76%  [kernel]          [k] kmem_cache_alloc           
     2.54%  [xfs]             [k] xfs_next_bit               
     2.39%  [kernel]          [k] __kmalloc                  
     2.30%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
no symbols found in /bin/ps, maybe install a debug package?
     1.96%  [kernel]          [k] kfree                      
     1.92%  [kernel]          [k] memset                     
     1.75%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.44%  [xfs]             [k] xfs_buf_item_format        
     1.39%  [kernel]          [k] __d_lookup_rcu             
     1.36%  [kernel]          [k] page_fault                 
     0.99%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.86%  [kernel]          [k] s_show                     
     0.85%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      70 irqs/sec  kernel:72.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.73%  [kernel]          [k] memcpy                     
     4.47%  [xfs]             [k] _xfs_buf_find              
     4.27%  [kernel]          [k] _raw_spin_lock             
     2.99%  [unknown]         [.] 0x00007f84a7766b99         
     2.75%  [kernel]          [k] kmem_cache_alloc           
     2.53%  [xfs]             [k] xfs_next_bit               
     2.39%  [kernel]          [k] __kmalloc                  
     2.30%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
     1.96%  [kernel]          [k] kfree                      
     1.92%  [kernel]          [k] memset                     
     1.75%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.49%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.43%  [xfs]             [k] xfs_buf_item_format        
     1.38%  [kernel]          [k] __d_lookup_rcu             
     1.37%  [kernel]          [k] page_fault                 
     0.98%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.89%  [kernel]          [k] s_show                     
     0.85%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      87 irqs/sec  kernel:71.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.72%  [kernel]          [k] memcpy                     
     4.45%  [xfs]             [k] _xfs_buf_find              
     4.25%  [kernel]          [k] _raw_spin_lock             
     2.99%  [unknown]         [.] 0x00007f84a7766b99         
     2.74%  [kernel]          [k] kmem_cache_alloc           
     2.52%  [xfs]             [k] xfs_next_bit               
     2.38%  [kernel]          [k] __kmalloc                  
     2.29%  [kernel]          [k] kmem_cache_free            
     2.19%  [xfs]             [k] xfs_buf_offset             
     1.95%  [kernel]          [k] kfree                      
     1.91%  [kernel]          [k] memset                     
     1.74%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.43%  [xfs]             [k] xfs_buf_item_format        
     1.38%  [kernel]          [k] page_fault                 
     1.38%  [kernel]          [k] __d_lookup_rcu             
     0.98%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.93%  [kernel]          [k] s_show                     
     0.84%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      88 irqs/sec  kernel:68.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.69%  [kernel]          [k] memcpy                     
     4.42%  [xfs]             [k] _xfs_buf_find              
     4.25%  [kernel]          [k] _raw_spin_lock             
     2.98%  [unknown]         [.] 0x00007f84a7766b99         

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 10:59                 ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 10:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > Adding dri-devel and a few others because an i915 patch contributed to
> > the regression.
> > 
> > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > you have lots of CPU power, there will be little difference in
> > > > > performance...
> > > > 
> > > > When I checked it it could only be called twice, and we'd already
> > > > optimize away the second call.  I'd defintively like to track down where
> > > > the performance changes happend, at least to a major version but even
> > > > better to a -rc or git commit.
> > > > 
> > > 
> > > By all means feel free to run the test yourself and run the bisection :)
> > > 
> > > It's rare but on this occasion the test machine is idle so I started an
> > > automated git bisection. As you know the milage with an automated bisect
> > > varies so it may or may not find the right commit. Test machine is sandy so
> > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > is the report of interest. The script is doing a full search between v3.3 and
> > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > apparently unrelated patch that caused the problem.
> > > 
> > 
> > It was obvious very quickly that there were two distinct regression so I
> > ran two bisections. One led to a XFS and the other led to an i915 patch
> > that enables RC6 to reduce power usage.
> > 
> > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> 
> Doesn't seem to be the major cause of the regression. By itself, it
> has impact, but the majority comes from the XFS change...
> 

The fact it has an impact at all is weird but lets see what the DRI
folks think about it.

> > [c999a223: xfs: introduce an allocation workqueue]
> 
> Which indicates that there is workqueue scheduling issues, I think.
> The same amount of work is being done, but half of it is being
> pushed off into a workqueue to avoid stack overflow issues (*).  I
> tested the above patch in anger on an 8p machine, similar to the
> machine you saw no regressions on, but the workload didn't drive it
> to being completely CPU bound (only about 90%) so the allocation
> work was probably always scheduled quickly.
> 

What test were you using?

> How many worker threads have been spawned on these machines
> that are showing the regression?

20 or 21 generally. An example list as spotted by top looks like

kworker/0:0        
kworker/0:1        
kworker/0:2        
kworker/1:0        
kworker/1:1        
kworker/1:2        
kworker/2:0        
kworker/2:1        
kworker/2:2        
kworker/3:0        
kworker/3:1        
kworker/3:2        
kworker/4:0        
kworker/4:1        
kworker/5:0        
kworker/5:1        
kworker/6:0        
kworker/6:1        
kworker/6:2        
kworker/7:0        
kworker/7:1

There were 8 unbound workers.

> What is the context switch rate on the machines whenteh test is running?

This is vmstat from a vanilla kernel. The actual vmstat is after the --.
The information before that is recorded by mmtests to try and detect if
there was jitter in the vmstat output. It's showing that there is little
or no jitter in this test.

VANILLA
 1341306582.6713   1.8109     1.8109 --  0  0      0 16050784  11448 104056    0    0   376     0  209  526  0  0 99  1  0
 1341306584.6715   3.8112     2.0003 --  1  0      0 16050628  11448 104064    0    0     0     0  121  608  0  0 100  0  0
 1341306586.6718   5.8114     2.0003 --  0  0      0 16047432  11460 104288    0    0   102    45  227  999  0  0 99  1  0
 1341306588.6721   7.8117     2.0003 --  1  0      0 16046944  11460 104292    0    0     0     0  120  663  0  0 100  0  0
 1341306590.6723   9.8119     2.0002 --  0  2      0 16045788  11476 104296    0    0    12    40  190  754  0  0 99  0  0
 1341306592.6725  11.8121     2.0002 --  0  1      0 15990236  12600 141724    0    0 19054    30 1400 2937  2  1 88  9  0
 1341306594.6727  13.8124     2.0002 --  1  0      0 15907628  12600 186360    0    0  1653     0 3117 6406  2  9 88  1  0
 1341306596.6730  15.8127     2.0003 --  0  0      0 15825964  12608 226636    0    0    15 11024 3073 6350  2  9 89  0  0
 1341306598.6733  17.8130     2.0003 --  1  0      0 15730420  12608 271632    0    0     0  3072 3461 7179  2 10 88  0  0
 1341306600.6736  19.8132     2.0003 --  1  0      0 15686200  12608 310816    0    0     0 12416 3093 6198  2  9 89  0  0
 1341306602.6738  21.8135     2.0003 --  2  0      0 15593588  12616 354928    0    0     0    32 3482 7146  2 11 87  0  0
 1341306604.6741  23.8138     2.0003 --  2  0      0 15562032  12616 393772    0    0     0 12288 3129 6330  2 10 89  0  0
 1341306606.6744  25.8140     2.0002 --  1  0      0 15458316  12624 438004    0    0     0    26 3471 7107  2 11 87  0  0
 1341306608.6746  27.8142     2.0002 --  1  0      0 15432024  12624 474244    0    0     0 12416 3011 6017  1 10 89  0  0
 1341306610.6749  29.8145     2.0003 --  2  0      0 15343280  12624 517696    0    0     0    24 3393 6826  2 11 87  0  0
 1341306612.6751  31.8148     2.0002 --  1  0      0 15311136  12632 551816    0    0     0 16502 2818 5653  2  9 88  1  0
 1341306614.6754  33.8151     2.0003 --  1  0      0 15220648  12632 594936    0    0     0  3584 3451 6779  2 11 87  0  0
 1341306616.6755  35.8152     2.0001 --  4  0      0 15221252  12632 649296    0    0     0 38559 4846 8709  2 15 78  6  0
 1341306618.6758  37.8155     2.0003 --  1  0      0 15177724  12640 668476    0    0    20 40679 2204 4067  1  5 89  5  0
 1341306620.6761  39.8158     2.0003 --  1  0      0 15090204  12640 711752    0    0     0     0 3316 6788  2 11 88  0  0
 1341306622.6764  41.8160     2.0003 --  1  0      0 15005356  12640 748532    0    0     0 12288 3073 6132  2 10 89  0  0
 1341306624.6766  43.8163     2.0002 --  2  0      0 14913088  12648 791952    0    0     0    28 3408 6806  2 11 87  0  0
 1341306626.6769  45.8166     2.0003 --  1  0      0 14891512  12648 826328    0    0     0 12420 2906 5710  1  9 90  0  0
 1341306628.6772  47.8168     2.0003 --  1  0      0 14794316  12656 868936    0    0     0    26 3367 6798  2 11 87  0  0
 1341306630.6774  49.8171     2.0003 --  1  0      0 14769188  12656 905016    0    0    30 12324 3029 5876  2 10 89  0  0
 1341306632.6777  51.8173     2.0002 --  1  0      0 14679544  12656 947712    0    0     0     0 3399 6868  2 11 87  0  0
 1341306634.6780  53.8176     2.0003 --  1  0      0 14646156  12664 982032    0    0     0 14658 2987 5761  1 10 89  0  0
 1341306636.6782  55.8179     2.0003 --  1  0      0 14560504  12664 1023816    0    0     0  4404 3454 6876  2 11 87  0  0
 1341306638.6783  57.8180     2.0001 --  2  0      0 14533384  12664 1056812    0    0     0 15810 3002 5581  1 10 89  0  0
 1341306640.6785  59.8182     2.0002 --  1  0      0 14593332  12672 1027392    0    0     0 31790 3504 1811  1 13 78  8  0
 1341306642.6787  61.8183     2.0001 --  1  0      0 14686968  12672 1007604    0    0     0 14621 2434 1248  1 10 89  0  0
 1341306644.6789  63.8185     2.0002 --  1  1      0 15042476  12680 788104    0    0     0 36564 2809 1484  1 12 86  1  0
 1341306646.6790  65.8187     2.0002 --  1  0      0 15128292  12680 757948    0    0     0 26395 3050 1313  1 13 86  1  0
 1341306648.6792  67.8189     2.0002 --  1  0      0 15160036  12680 727964    0    0     0  5463 2752  910  1 12 87  0  0
 1341306650.6795  69.8192     2.0003 --  0  0      0 15633256  12688 332572    0    0  1156 12308 2117 2346  1  7 91  1  0
 1341306652.6797  71.8194     2.0002 --  0  0      0 15633892  12688 332652    0    0     0     0  224  758  0  0 100  0  0
 1341306654.6800  73.8197     2.0003 --  0  0      0 15633900  12688 332524    0    0     0     0  231 1009  0  0 100  0  0
 1341306656.6803  75.8199     2.0003 --  0  0      0 15637436  12696 332504    0    0     0    38  266  713  0  0 99  0  0
 1341306658.6805  77.8202     2.0003 --  0  0      0 15654180  12696 332352    0    0     0     0  270  821  0  0 100  0  0

REVERT-XFS
 1341307733.8702   1.7941     1.7941 --  0  0      0 16050640  12036 103996    0    0   372     0  216  752  0  0 99  1  0
 1341307735.8704   3.7944     2.0002 --  0  0      0 16050864  12036 104028    0    0     0     0  132  857  0  0 100  0  0
 1341307737.8707   5.7946     2.0002 --  0  0      0 16047492  12048 104252    0    0   102    37  255  938  0  0 99  1  0
 1341307739.8709   7.7949     2.0003 --  0  0      0 16047600  12072 104324    0    0    32     2  129  658  0  0 100  0  0
 1341307741.8712   9.7951     2.0002 --  1  1      0 16046676  12080 104328    0    0     0    32  165  729  0  0 100  0  0
 1341307743.8714  11.7954     2.0003 --  0  1      0 15990840  13216 142612    0    0 19422    30 1467 3015  2  1 89  8  0
 1341307745.8717  13.7956     2.0002 --  0  0      0 15825496  13216 226396    0    0  1310 11214 2217 1348  2  8 89  1  0
 1341307747.8717  15.7957     2.0001 --  1  0      0 15677816  13224 314672    0    0     4 15294 2307 1173  2  9 89  0  0
 1341307749.8719  17.7959     2.0002 --  1  0      0 15524372  13224 409728    0    0     0 12288 2466  888  1 10 89  0  0
 1341307751.8721  19.7960     2.0002 --  1  0      0 15368424  13224 502552    0    0     0 12416 2312  878  1 10 89  0  0
 1341307753.8722  21.7962     2.0002 --  1  0      0 15225216  13232 593092    0    0     0 12448 2539 1380  1 10 88  0  0
 1341307755.8724  23.7963     2.0002 --  2  0      0 15163712  13232 664768    0    0     0 32160 2184 1177  1  8 90  0  0
 1341307757.8727  25.7967     2.0003 --  1  0      0 14973888  13240 755080    0    0     0 12316 2482 1219  1 10 89  0  0
 1341307759.8728  27.7968     2.0001 --  1  0      0 14883580  13240 840036    0    0     0 44471 2711 1234  2 10 88  0  0
 1341307761.8730  29.7970     2.0002 --  1  0      0 14800304  13240 920504    0    0     0 42554 2571 1050  1 10 89  0  0
 1341307763.8734  31.7973     2.0003 --  0  0      0 14642504  13248 995004    0    0     0  3232 2276 1081  1  8 90  0  0
 1341307765.8737  33.7976     2.0003 --  1  0      0 14545072  13248 1052536    0    0     0 18688 2628 1114  1  9 89  0  0
 1341307767.8739  35.7979     2.0003 --  1  0      0 14783848  13248 926824    0    0     0 59559 2409 1308  0 10 89  1  0
 1341307769.8740  37.7980     2.0001 --  2  0      0 14854800  13256 896832    0    0     0  9172 2419 1004  1 10 89  1  0
 1341307771.8742  39.7981     2.0002 --  2  0      0 14835084  13256 875612    0    0     0 12288 2524  812  0 11 89  0  0
 1341307773.8743  41.7983     2.0002 --  2  0      0 15126252  13256 745844    0    0     0 10297 2714 1163  1 12 88  0  0
 1341307775.8745  43.7985     2.0002 --  1  0      0 15108800  13264 724544    0    0     0 12316 2499  931  1 11 88  0  0
 1341307777.8746  45.7986     2.0001 --  2  0      0 15226236  13264 694580    0    0     0 12416 2700 1194  1 12 88  0  0
 1341307779.8750  47.7989     2.0003 --  1  0      0 15697632  13264 300716    0    0  1156     0  934 1701  0  2 96  1  0
 1341307781.8752  49.7992     2.0003 --  0  0      0 15697508  13272 300720    0    0     0    66  166  641  0  0 100  0  0
 1341307783.8755  51.7995     2.0003 --  0  0      0 15699008  13272 300524    0    0     0     0  248  865  0  0 100  0  0
 1341307785.8758  53.7997     2.0003 --  0  0      0 15702452  13272 300520    0    0     0     0  285  960  0  0 99  0  0
 1341307787.8760  55.7999     2.0002 --  0  0      0 15719404  13280 300436    0    0     0    26  136  590  0  0 99  0  0

Vanilla average context switch rate	4278.53
Revert average context switch rate	1095

> Can you run latencytop to see
> if there is excessive starvation/wait times for allocation
> completion?

I'm not sure what format you are looking for.  latencytop is shit for
capturing information throughout a test and it does not easily allow you to
record a snapshot of a test. You can record all the console output of course
but that's a complete mess. I tried capturing /proc/latency_stats over time
instead because that can be trivially sorted on a system-wide basis but
as I write this I find that latency_stats was bust. It was just spitting out

Latency Top version : v0.1

and nothing else.  Either latency_stats is broken or my config is. Not sure
which it is right now and won't get enough time on this today to pinpoint it.

> A pert top profile comparison might be informative,
> too...
> 

I'm not sure if this is what you really wanted. I thought an oprofile or
perf report would have made more sense but I recorded perf top over time
anyway and it's at the end of the mail.  The timestamp information is poor
because the perf top information was buffered so it would receive a bunch
of updates at once. Each sample should be roughly 2 seconds apart. This
buffering can be dealt with, I just failed to do it in advance and I do
not think it's necessary to rerun the tests for it.

> (*) The stack usage below submit_bio() can be more than 5k (DM, MD,
> SCSI, driver, memory allocation), so it's really not safe to do
> allocation anywhere below about 3k of kernel stack being used. e.g.
> on a relatively trivial storage setup without the above commit:
> 
> [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left
> 
> Fundamentally, 8k stacks on x86-64 are too small for our
> increasingly complex storage layers and the 100+ function deep call
> chains that occur.
> 

I understand the patches motivation. For these tests I'm being deliberately
a bit of a dummy and just capturing information. This might allow me to
actually get through all the results and identify some of the problems
and spread them around a bit. Either that or I need to clone myself a few
times to tackle each of the problems in a reasonable timeframe :)

For just these XFS tests I've uploaded a tarball of the logs to
http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

For results with no monitor you can find them somewhere like this

default/no-monitor/sandy/fsmark-single-3.4.0-vanilla/noprofile/fsmark.log

Results with monitors attached are in run-monitor. You
can read the iostat logs for example from

default/run-monitor/sandy/iostat-3.4.0-vanilla-fsmark-single

Some of the monitor logs are gzipped.

This is perf top over time for the vanilla kernel

time: 1341306570

time: 1341306579
   PerfTop:       1 irqs/sec  kernel: 0.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    61.85%  [kernel]        [k] __rmqueue  
    38.15%  libc-2.11.3.so  [.] _IO_vfscanf

time: 1341306579
   PerfTop:       3 irqs/sec  kernel:66.7%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    19.88%  [kernel]        [k] _raw_spin_lock_irqsave  
    17.14%  [kernel]        [k] __rmqueue               
    16.96%  [kernel]        [k] format_decode           
    15.37%  libc-2.11.3.so  [.] __tzfile_compute        
    13.55%  [kernel]        [k] copy_user_generic_string
    10.57%  libc-2.11.3.so  [.] _IO_vfscanf             
     6.53%  [kernel]        [k] find_first_bit          

time: 1341306579
   PerfTop:       0 irqs/sec  kernel:-nan%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    17.51%  [kernel]        [k] _raw_spin_lock_irqsave  
    15.10%  [kernel]        [k] __rmqueue               
    14.94%  [kernel]        [k] format_decode           
    13.54%  libc-2.11.3.so  [.] __tzfile_compute        
    11.94%  [kernel]        [k] copy_user_generic_string
    11.90%  [kernel]        [k] _raw_spin_lock          
     9.31%  libc-2.11.3.so  [.] _IO_vfscanf             
     5.75%  [kernel]        [k] find_first_bit          

time: 1341306579
   PerfTop:      41 irqs/sec  kernel:58.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    13.62%  [kernel]          [k] _raw_spin_lock_irqsave   
    11.02%  [kernel]          [k] __rmqueue                
    10.91%  [kernel]          [k] format_decode            
     9.89%  libc-2.11.3.so    [.] __tzfile_compute         
     8.72%  [kernel]          [k] copy_user_generic_string 
     8.69%  [kernel]          [k] _raw_spin_lock           
     7.15%  libc-2.11.3.so    [.] _IO_vfscanf              
     4.20%  [kernel]          [k] find_first_bit           
     1.47%  libc-2.11.3.so    [.] __strcmp_sse42           
     1.37%  libc-2.11.3.so    [.] __strchr_sse42           
     1.19%  sed               [.] 0x0000000000009f7d       
     0.90%  libc-2.11.3.so    [.] vfprintf                 
     0.84%  [kernel]          [k] hrtimer_interrupt        
     0.84%  libc-2.11.3.so    [.] re_string_realloc_buffers
     0.76%  [kernel]          [k] enqueue_entity           
     0.66%  [kernel]          [k] __switch_to              
     0.65%  libc-2.11.3.so    [.] _IO_default_xsputn       
     0.62%  [kernel]          [k] do_vfs_ioctl             
     0.59%  [kernel]          [k] perf_event_mmap_event    
     0.56%  gzip              [.] 0x0000000000007b96       
     0.55%  libc-2.11.3.so    [.] bsearch                  

time: 1341306579
   PerfTop:      35 irqs/sec  kernel:62.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.50%  [kernel]          [k] _raw_spin_lock_irqsave   
     9.22%  [kernel]          [k] __rmqueue                
     9.13%  [kernel]          [k] format_decode            
     8.27%  libc-2.11.3.so    [.] __tzfile_compute         
     7.92%  [kernel]          [k] copy_user_generic_string 
     7.74%  [kernel]          [k] _raw_spin_lock           
     6.21%  libc-2.11.3.so    [.] _IO_vfscanf              
     3.51%  [kernel]          [k] find_first_bit           
     1.44%  gzip              [.] 0x0000000000007b96       
     1.23%  libc-2.11.3.so    [.] __strcmp_sse42           
     1.15%  libc-2.11.3.so    [.] __strchr_sse42           
     1.06%  libc-2.11.3.so    [.] vfprintf                 
     0.99%  sed               [.] 0x0000000000009f7d       
     0.92%  [unknown]         [.] 0x00007f84a7766b99       
     0.70%  [kernel]          [k] hrtimer_interrupt        
     0.70%  libc-2.11.3.so    [.] re_string_realloc_buffers
     0.64%  [kernel]          [k] enqueue_entity           
     0.58%  libtcl8.5.so      [.] 0x000000000006fe86       
     0.55%  [kernel]          [k] __switch_to              
     0.54%  libc-2.11.3.so    [.] _IO_default_xsputn       
     0.53%  [kernel]          [k] __d_lookup_rcu           

time: 1341306585
   PerfTop:     100 irqs/sec  kernel:59.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     8.61%  [kernel]          [k] _raw_spin_lock_irqsave         
     5.92%  [kernel]          [k] __rmqueue                      
     5.86%  [kernel]          [k] format_decode                  
     5.31%  libc-2.11.3.so    [.] __tzfile_compute               
     5.30%  [kernel]          [k] copy_user_generic_string       
     5.27%  [kernel]          [k] _raw_spin_lock                 
     3.99%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.45%  [unknown]         [.] 0x00007f84a7766b99             
     2.26%  [kernel]          [k] find_first_bit                 
     1.68%  [kernel]          [k] page_fault                     
     1.45%  libc-2.11.3.so    [.] _int_malloc                    
     1.28%  gzip              [.] 0x0000000000007b96             
     1.13%  libc-2.11.3.so    [.] vfprintf                       
     1.06%  libc-2.11.3.so    [.] __strchr_sse42                 
     1.02%  perl              [.] 0x0000000000044505             
     0.79%  libc-2.11.3.so    [.] __strcmp_sse42                 
     0.79%  [kernel]          [k] do_task_stat                   
     0.77%  [kernel]          [k] zap_pte_range                  
     0.72%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     0.70%  libc-2.11.3.so    [.] malloc                         
     0.70%  libc-2.11.3.so    [.] __mbrtowc                      

time: 1341306585
   PerfTop:      19 irqs/sec  kernel:78.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.97%  [kernel]          [k] _raw_spin_lock_irqsave         
     5.48%  [kernel]          [k] __rmqueue                      
     5.43%  [kernel]          [k] format_decode                  
     5.24%  [kernel]          [k] copy_user_generic_string       
     5.18%  [kernel]          [k] _raw_spin_lock                 
     4.92%  libc-2.11.3.so    [.] __tzfile_compute               
     4.25%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.33%  [unknown]         [.] 0x00007f84a7766b99             
     2.12%  [kernel]          [k] page_fault                     
     2.09%  [kernel]          [k] find_first_bit                 
     1.34%  libc-2.11.3.so    [.] _int_malloc                    
     1.19%  gzip              [.] 0x0000000000007b96             
     1.05%  libc-2.11.3.so    [.] vfprintf                       
     0.98%  libc-2.11.3.so    [.] __strchr_sse42                 
     0.94%  perl              [.] 0x0000000000044505             
     0.94%  libc-2.11.3.so    [.] _dl_addr                       
     0.91%  [kernel]          [k] zap_pte_range                  
     0.74%  [kernel]          [k] s_show                         
     0.73%  libc-2.11.3.so    [.] __strcmp_sse42                 
     0.73%  [kernel]          [k] do_task_stat                   
     0.67%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal

time: 1341306585
   PerfTop:      38 irqs/sec  kernel:68.4%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.64%  [kernel]          [k] _raw_spin_lock_irqsave     
     4.89%  [kernel]          [k] _raw_spin_lock             
     4.77%  [kernel]          [k] __rmqueue                  
     4.72%  [kernel]          [k] format_decode              
     4.56%  [kernel]          [k] copy_user_generic_string   
     4.53%  libc-2.11.3.so    [.] _IO_vfscanf                
     4.28%  libc-2.11.3.so    [.] __tzfile_compute           
     2.52%  [unknown]         [.] 0x00007f84a7766b99         
no symbols found in /bin/sort, maybe install a debug package?
     2.10%  [kernel]          [k] page_fault                 
     1.82%  [kernel]          [k] find_first_bit             
     1.31%  libc-2.11.3.so    [.] _int_malloc                
     1.14%  libc-2.11.3.so    [.] vfprintf                   
     1.08%  libc-2.11.3.so    [.] _dl_addr                   
     1.07%  [kernel]          [k] s_show                     
     1.05%  libc-2.11.3.so    [.] __strchr_sse42             
     1.03%  gzip              [.] 0x0000000000007b96         
     0.82%  [kernel]          [k] do_task_stat               
     0.82%  perl              [.] 0x0000000000044505         
     0.79%  [kernel]          [k] zap_pte_range              
     0.70%  [kernel]          [k] seq_put_decimal_ull        
     0.69%  [kernel]          [k] find_busiest_group         

time: 1341306591
   PerfTop:      66 irqs/sec  kernel:59.1%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     6.52%  [kernel]          [k] _raw_spin_lock_irqsave         
     4.11%  libc-2.11.3.so    [.] _IO_vfscanf                    
     3.91%  [kernel]          [k] _raw_spin_lock                 
     3.50%  [kernel]          [k] copy_user_generic_string       
     3.41%  [kernel]          [k] __rmqueue                      
     3.38%  [kernel]          [k] format_decode                  
     3.06%  libc-2.11.3.so    [.] __tzfile_compute               
     2.90%  [unknown]         [.] 0x00007f84a7766b99             
     2.30%  [kernel]          [k] page_fault                     
     2.20%  perl              [.] 0x0000000000044505             
     1.83%  libc-2.11.3.so    [.] vfprintf                       
     1.61%  libc-2.11.3.so    [.] _int_malloc                    
     1.30%  [kernel]          [k] find_first_bit                 
     1.22%  libc-2.11.3.so    [.] _dl_addr                       
     1.19%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     1.10%  libc-2.11.3.so    [.] __strchr_sse42                 
     1.01%  [kernel]          [k] zap_pte_range                  
     0.99%  [kernel]          [k] s_show                         
     0.98%  [kernel]          [k] __percpu_counter_add           
     0.86%  [kernel]          [k] __strnlen_user                 
     0.75%  ld-2.11.3.so      [.] do_lookup_x                    

time: 1341306591
   PerfTop:      39 irqs/sec  kernel:69.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     6.26%  [kernel]          [k] _raw_spin_lock_irqsave         
     4.05%  [kernel]          [k] _raw_spin_lock                 
     3.86%  libc-2.11.3.so    [.] _IO_vfscanf                    
     3.21%  [kernel]          [k] copy_user_generic_string       
     3.03%  [kernel]          [k] __rmqueue                      
     3.00%  [kernel]          [k] format_decode                  
     2.93%  [unknown]         [.] 0x00007f84a7766b99             
     2.72%  libc-2.11.3.so    [.] __tzfile_compute               
     2.20%  [kernel]          [k] page_fault                     
     1.96%  perl              [.] 0x0000000000044505             
     1.77%  libc-2.11.3.so    [.] vfprintf                       
     1.43%  libc-2.11.3.so    [.] _int_malloc                    
     1.16%  [kernel]          [k] find_first_bit                 
     1.09%  libc-2.11.3.so    [.] _dl_addr                       
     1.06%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     1.02%  [kernel]          [k] s_show                         
     0.98%  libc-2.11.3.so    [.] __strchr_sse42                 
     0.93%  gzip              [.] 0x0000000000007b96             
     0.90%  [kernel]          [k] zap_pte_range                  
     0.87%  [kernel]          [k] __percpu_counter_add           
     0.76%  [kernel]          [k] __strnlen_user                 

time: 1341306591
   PerfTop:     185 irqs/sec  kernel:70.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     4.81%  [kernel]          [k] _raw_spin_lock_irqsave         
     3.60%  [unknown]         [.] 0x00007f84a7766b99             
     3.10%  [kernel]          [k] _raw_spin_lock                 
     3.04%  [kernel]          [k] page_fault                     
     2.66%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.14%  [kernel]          [k] copy_user_generic_string       
     2.11%  [kernel]          [k] format_decode                  
     1.96%  [kernel]          [k] __rmqueue                      
     1.86%  libc-2.11.3.so    [.] _dl_addr                       
     1.76%  libc-2.11.3.so    [.] __tzfile_compute               
     1.26%  perl              [.] 0x0000000000044505             
     1.19%  libc-2.11.3.so    [.] __mbrtowc                      
     1.14%  libc-2.11.3.so    [.] vfprintf                       
     1.12%  libc-2.11.3.so    [.] _int_malloc                    
     1.09%  gzip              [.] 0x0000000000007b96             
     0.95%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     0.88%  [kernel]          [k] _raw_spin_unlock_irqrestore    
     0.87%  [kernel]          [k] __strnlen_user                 
     0.82%  [kernel]          [k] clear_page_c                   
     0.77%  [kernel]          [k] __schedule                     
     0.76%  [kernel]          [k] find_get_page                  

time: 1341306595
   PerfTop:     385 irqs/sec  kernel:48.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    27.20%  cc1               [.] 0x0000000000210978         
     3.01%  [unknown]         [.] 0x00007f84a7766b99         
     2.18%  [kernel]          [k] page_fault                 
     1.96%  libbfd-2.21.so    [.] 0x00000000000b9cdd         
     1.95%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.91%  ld.bfd            [.] 0x000000000000e3b9         
     1.85%  [kernel]          [k] _raw_spin_lock             
     1.31%  [kernel]          [k] copy_user_generic_string   
     1.20%  libbfd-2.21.so    [.] bfd_hash_lookup            
     1.10%  libc-2.11.3.so    [.] __strcmp_sse42             
     0.93%  libc-2.11.3.so    [.] _IO_vfscanf                
     0.85%  [kernel]          [k] _raw_spin_unlock_irqrestore
     0.82%  libc-2.11.3.so    [.] _int_malloc                
     0.80%  [kernel]          [k] __rmqueue                  
     0.79%  [kernel]          [k] kmem_cache_alloc           
     0.74%  [kernel]          [k] format_decode              
     0.71%  libc-2.11.3.so    [.] _dl_addr                   
     0.62%  libbfd-2.21.so    [.] _bfd_final_link_relocate   
     0.61%  libc-2.11.3.so    [.] __tzfile_compute           
     0.61%  libc-2.11.3.so    [.] vfprintf                   
     0.59%  [kernel]          [k] find_busiest_group         

time: 1341306595
   PerfTop:    1451 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     9.75%  cc1               [.] 0x0000000000210978         
     8.81%  [unknown]         [.] 0x00007f84a7766b99         
     4.62%  [kernel]          [k] page_fault                 
     3.61%  [kernel]          [k] _raw_spin_lock             
     2.67%  [kernel]          [k] memcpy                     
     2.03%  [kernel]          [k] _raw_spin_lock_irqsave     
     2.00%  [kernel]          [k] kmem_cache_alloc           
     1.64%  [xfs]             [k] _xfs_buf_find              
     1.31%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.16%  [kernel]          [k] kmem_cache_free            
     1.15%  [xfs]             [k] xfs_next_bit               
     0.98%  [kernel]          [k] __d_lookup                 
     0.89%  [xfs]             [k] xfs_da_do_buf              
     0.83%  [kernel]          [k] memset                     
     0.80%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.79%  [kernel]          [k] link_path_walk             
     0.76%  [xfs]             [k] xfs_buf_item_size          
no symbols found in /usr/bin/tee, maybe install a debug package?
no symbols found in /bin/date, maybe install a debug package?
     0.73%  [xfs]             [k] xfs_buf_offset             
     0.71%  [kernel]          [k] __kmalloc                  
     0.70%  [kernel]          [k] kfree                      
     0.70%  libbfd-2.21.so    [.] 0x00000000000b9cdd         

time: 1341306601
   PerfTop:    1267 irqs/sec  kernel:85.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.81%  [unknown]         [.] 0x00007f84a7766b99         
     5.98%  cc1               [.] 0x0000000000210978         
     5.20%  [kernel]          [k] page_fault                 
     3.54%  [kernel]          [k] _raw_spin_lock             
     3.37%  [kernel]          [k] memcpy                     
     2.03%  [kernel]          [k] kmem_cache_alloc           
     1.91%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.75%  [xfs]             [k] _xfs_buf_find              
     1.35%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.28%  [xfs]             [k] xfs_next_bit               
     1.14%  [kernel]          [k] kmem_cache_free            
     1.13%  [kernel]          [k] __kmalloc                  
     1.12%  [kernel]          [k] __d_lookup                 
     0.97%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.96%  [xfs]             [k] xfs_buf_offset             
     0.95%  [kernel]          [k] memset                     
     0.91%  [kernel]          [k] link_path_walk             
     0.88%  [xfs]             [k] xfs_da_do_buf              
     0.85%  [kernel]          [k] kfree                      
     0.84%  [xfs]             [k] xfs_buf_item_size          
     0.74%  [xfs]             [k] xfs_btree_lookup           

time: 1341306601
   PerfTop:    1487 irqs/sec  kernel:85.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.84%  [unknown]         [.] 0x00007f84a7766b99         
     5.15%  [kernel]          [k] page_fault                 
     3.93%  cc1               [.] 0x0000000000210978         
     3.76%  [kernel]          [k] _raw_spin_lock             
     3.50%  [kernel]          [k] memcpy                     
     2.13%  [kernel]          [k] kmem_cache_alloc           
     1.91%  [xfs]             [k] _xfs_buf_find              
     1.79%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.52%  [kernel]          [k] __kmalloc                  
     1.33%  [kernel]          [k] kmem_cache_free            
     1.32%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.29%  [kernel]          [k] __d_lookup                 
     1.27%  [xfs]             [k] xfs_next_bit               
     1.11%  [kernel]          [k] link_path_walk             
     1.01%  [xfs]             [k] xfs_buf_offset             
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_da_do_buf              
     0.97%  [kernel]          [k] kfree                      
     0.96%  [kernel]          [k] memset                     
     0.84%  [xfs]             [k] xfs_btree_lookup           
     0.82%  [xfs]             [k] xfs_buf_item_format        

time: 1341306601
   PerfTop:    1291 irqs/sec  kernel:85.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.21%  [unknown]         [.] 0x00007f84a7766b99         
     5.18%  [kernel]          [k] page_fault                 
     3.83%  [kernel]          [k] _raw_spin_lock             
     3.67%  [kernel]          [k] memcpy                     
     2.92%  cc1               [.] 0x0000000000210978         
     2.28%  [kernel]          [k] kmem_cache_alloc           
     2.18%  [xfs]             [k] _xfs_buf_find              
     1.66%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.56%  [kernel]          [k] __kmalloc                  
     1.43%  [kernel]          [k] __d_lookup                 
     1.43%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.40%  [kernel]          [k] kmem_cache_free            
     1.29%  [xfs]             [k] xfs_next_bit               
     1.13%  [xfs]             [k] xfs_buf_offset             
     1.07%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_da_do_buf              
     1.01%  [kernel]          [k] memset                     
     1.01%  [kernel]          [k] kfree                      
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.87%  [xfs]             [k] xfs_buf_item_size          
     0.84%  [xfs]             [k] xfs_btree_lookup           

time: 1341306607
   PerfTop:    1435 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.06%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] page_fault                 
     3.88%  [kernel]          [k] _raw_spin_lock             
     3.83%  [kernel]          [k] memcpy                     
     2.41%  [xfs]             [k] _xfs_buf_find              
     2.35%  [kernel]          [k] kmem_cache_alloc           
     2.19%  cc1               [.] 0x0000000000210978         
     1.68%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.55%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] __d_lookup                 
     1.43%  [kernel]          [k] kmem_cache_free            
     1.42%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.37%  [xfs]             [k] xfs_next_bit               
     1.27%  [xfs]             [k] xfs_buf_offset             
     1.12%  [kernel]          [k] link_path_walk             
     1.09%  [kernel]          [k] kfree                      
     1.08%  [kernel]          [k] memset                     
     1.04%  [xfs]             [k] xfs_da_do_buf              
     0.99%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.92%  [xfs]             [k] xfs_buf_item_size          
     0.89%  [xfs]             [k] xfs_btree_lookup           

time: 1341306607
   PerfTop:    1281 irqs/sec  kernel:87.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.00%  [unknown]         [.] 0x00007f84a7766b99         
     5.44%  [kernel]          [k] page_fault                 
     4.04%  [kernel]          [k] _raw_spin_lock             
     3.94%  [kernel]          [k] memcpy                     
     2.51%  [xfs]             [k] _xfs_buf_find              
     2.32%  [kernel]          [k] kmem_cache_alloc           
     1.75%  cc1               [.] 0x0000000000210978         
     1.66%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [kernel]          [k] __d_lookup                 
     1.56%  [kernel]          [k] __kmalloc                  
     1.46%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] kmem_cache_free            
     1.41%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.34%  [xfs]             [k] xfs_buf_offset             
     1.20%  [kernel]          [k] link_path_walk             
     1.16%  [kernel]          [k] kfree                      
     1.11%  [kernel]          [k] memset                     
     1.04%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.94%  [xfs]             [k] xfs_da_do_buf              
     0.92%  [xfs]             [k] xfs_btree_lookup           
     0.89%  [xfs]             [k] xfs_buf_item_size          

time: 1341306607
   PerfTop:    1455 irqs/sec  kernel:86.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.14%  [unknown]         [.] 0x00007f84a7766b99         
     5.36%  [kernel]          [k] page_fault                 
     4.12%  [kernel]          [k] _raw_spin_lock             
     4.02%  [kernel]          [k] memcpy                     
     2.54%  [xfs]             [k] _xfs_buf_find              
     2.41%  [kernel]          [k] kmem_cache_alloc           
     1.69%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.56%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.47%  [kernel]          [k] __d_lookup                 
     1.42%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.39%  [xfs]             [k] xfs_buf_offset             
     1.39%  cc1               [.] 0x0000000000210978         
     1.37%  [kernel]          [k] kmem_cache_free            
     1.24%  [kernel]          [k] link_path_walk             
     1.17%  [kernel]          [k] memset                     
     1.16%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.91%  [xfs]             [k] xfs_btree_lookup           

time: 1341306613
   PerfTop:    1245 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.05%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] page_fault                 
     4.10%  [kernel]          [k] _raw_spin_lock             
     4.06%  [kernel]          [k] memcpy                     
     2.74%  [xfs]             [k] _xfs_buf_find              
     2.40%  [kernel]          [k] kmem_cache_alloc           
     1.64%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [xfs]             [k] xfs_next_bit               
     1.54%  [kernel]          [k] __kmalloc                  
     1.49%  [kernel]          [k] __d_lookup                 
     1.45%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.41%  [kernel]          [k] kmem_cache_free            
     1.35%  [xfs]             [k] xfs_buf_offset             
     1.25%  [kernel]          [k] link_path_walk             
     1.22%  [kernel]          [k] kfree                      
     1.16%  [kernel]          [k] memset                     
     1.15%  cc1               [.] 0x0000000000210978         
     1.02%  [xfs]             [k] xfs_buf_item_size          
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.92%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_da_do_buf              

time: 1341306613
   PerfTop:    1433 irqs/sec  kernel:87.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.04%  [unknown]         [.] 0x00007f84a7766b99         
     5.30%  [kernel]          [k] page_fault                 
     4.08%  [kernel]          [k] memcpy                     
     4.07%  [kernel]          [k] _raw_spin_lock             
     2.88%  [xfs]             [k] _xfs_buf_find              
     2.50%  [kernel]          [k] kmem_cache_alloc           
     1.72%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.68%  [xfs]             [k] xfs_next_bit               
     1.56%  [kernel]          [k] __d_lookup                 
     1.54%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.46%  [kernel]          [k] kmem_cache_free            
     1.40%  [xfs]             [k] xfs_buf_offset             
     1.25%  [kernel]          [k] link_path_walk             
     1.21%  [kernel]          [k] memset                     
     1.18%  [kernel]          [k] kfree                      
     1.04%  [xfs]             [k] xfs_buf_item_size          
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.95%  [xfs]             [k] xfs_btree_lookup           
     0.94%  cc1               [.] 0x0000000000210978         
     0.90%  [xfs]             [k] xfs_da_do_buf              

time: 1341306613
   PerfTop:    1118 irqs/sec  kernel:87.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

no symbols found in /usr/bin/vmstat, maybe install a debug package?
    12.03%  [unknown]         [.] 0x00007f84a7766b99         
     5.48%  [kernel]          [k] page_fault                 
     4.21%  [kernel]          [k] memcpy                     
     4.11%  [kernel]          [k] _raw_spin_lock             
     2.98%  [xfs]             [k] _xfs_buf_find              
     2.47%  [kernel]          [k] kmem_cache_alloc           
     1.81%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.72%  [xfs]             [k] xfs_next_bit               
     1.51%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] kmem_cache_free            
     1.48%  [kernel]          [k] __d_lookup                 
     1.47%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.39%  [xfs]             [k] xfs_buf_offset             
     1.23%  [kernel]          [k] link_path_walk             
     1.19%  [kernel]          [k] memset                     
     1.19%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_buf_item_size          
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.93%  [xfs]             [k] xfs_buf_item_format        
     0.91%  [xfs]             [k] xfs_da_do_buf              

time: 1341306617
   PerfTop:    1454 irqs/sec  kernel:87.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.93%  [unknown]         [.] 0x00007f84a7766b99         
     5.42%  [kernel]          [k] page_fault                 
     4.28%  [kernel]          [k] memcpy                     
     4.20%  [kernel]          [k] _raw_spin_lock             
     3.15%  [xfs]             [k] _xfs_buf_find              
     2.52%  [kernel]          [k] kmem_cache_alloc           
     1.76%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.72%  [xfs]             [k] xfs_next_bit               
     1.59%  [kernel]          [k] __d_lookup                 
     1.51%  [kernel]          [k] __kmalloc                  
     1.49%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.48%  [kernel]          [k] kmem_cache_free            
     1.40%  [xfs]             [k] xfs_buf_offset             
     1.29%  [kernel]          [k] memset                     
     1.20%  [kernel]          [k] link_path_walk             
     1.17%  [kernel]          [k] kfree                      
     1.09%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_buf_item_size          
     0.95%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              
     0.91%  [xfs]             [k] xfs_buf_item_format        

time: 1341306617
   PerfTop:    1758 irqs/sec  kernel:90.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.99%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.04%  [kernel]          [k] memcpy                     
     3.86%  [xfs]             [k] _xfs_buf_find              
     2.31%  [kernel]          [k] kmem_cache_alloc           
     2.03%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.67%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.60%  [kernel]          [k] __d_lookup                 
     1.60%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] __kmalloc                  
     1.36%  [xfs]             [k] xfs_buf_offset             
     1.35%  [kernel]          [k] kmem_cache_free            
     1.17%  [kernel]          [k] kfree                      
     1.16%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] link_path_walk             
     0.98%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.97%  [xfs]             [k] xfs_btree_lookup           
     0.92%  [xfs]             [k] xfs_perag_put              
     0.90%  [xfs]             [k] xfs_buf_item_size          
     0.84%  [xfs]             [k] xfs_da_do_buf              

time: 1341306623
   PerfTop:    1022 irqs/sec  kernel:88.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.93%  [unknown]         [.] 0x00007f84a7766b99         
     5.34%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.01%  [kernel]          [k] memcpy                     
     4.01%  [xfs]             [k] _xfs_buf_find              
     2.28%  [kernel]          [k] kmem_cache_alloc           
     2.00%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.68%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.60%  [xfs]             [k] xfs_next_bit               
     1.59%  [kernel]          [k] __d_lookup                 
     1.41%  [kernel]          [k] __kmalloc                  
     1.39%  [kernel]          [k] kmem_cache_free            
     1.35%  [xfs]             [k] xfs_buf_offset             
     1.15%  [kernel]          [k] kfree                      
     1.15%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] link_path_walk             
     0.98%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_perag_put              
     0.88%  [xfs]             [k] xfs_buf_item_size          
     0.86%  [xfs]             [k] xfs_da_do_buf              

time: 1341306623
   PerfTop:    1430 irqs/sec  kernel:87.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.05%  [unknown]         [.] 0x00007f84a7766b99         
     5.24%  [kernel]          [k] _raw_spin_lock             
     4.89%  [kernel]          [k] page_fault                 
     4.13%  [kernel]          [k] memcpy                     
     3.96%  [xfs]             [k] _xfs_buf_find              
     2.35%  [kernel]          [k] kmem_cache_alloc           
     1.95%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.63%  [kernel]          [k] __d_lookup                 
     1.54%  [xfs]             [k] xfs_next_bit               
     1.42%  [kernel]          [k] __kmalloc                  
     1.41%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.16%  [kernel]          [k] memset                     
     1.11%  [kernel]          [k] kfree                      
     1.10%  [kernel]          [k] link_path_walk             
     1.05%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_buf_item_size          
     0.87%  [xfs]             [k] xfs_da_do_buf              
     0.87%  [xfs]             [k] xfs_perag_put              

time: 1341306623
   PerfTop:    1267 irqs/sec  kernel:87.1%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.20%  [unknown]         [.] 0x00007f84a7766b99         
     5.08%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.12%  [kernel]          [k] memcpy                     
     3.96%  [xfs]             [k] _xfs_buf_find              
     2.41%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.61%  [kernel]          [k] __d_lookup                 
     1.50%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] __kmalloc                  
     1.40%  [kernel]          [k] kmem_cache_free            
     1.31%  [xfs]             [k] xfs_buf_offset             
no symbols found in /usr/bin/iostat, maybe install a debug package?
     1.16%  [kernel]          [k] memset                     
     1.11%  [kernel]          [k] kfree                      
     1.06%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_btree_lookup           
     0.95%  [xfs]             [k] xfs_buf_item_size          
     0.90%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_perag_put              

time: 1341306629
   PerfTop:    1399 irqs/sec  kernel:88.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.12%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.23%  [kernel]          [k] memcpy                     
     4.03%  [xfs]             [k] _xfs_buf_find              
     2.37%  [kernel]          [k] kmem_cache_alloc           
     1.96%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.69%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.63%  [kernel]          [k] __d_lookup                 
     1.50%  [xfs]             [k] xfs_next_bit               
     1.45%  [kernel]          [k] __kmalloc                  
     1.35%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.17%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.07%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_btree_lookup           
     1.02%  [xfs]             [k] xfs_buf_item_size          
     0.93%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_buf_item_format        

time: 1341306629
   PerfTop:    1225 irqs/sec  kernel:87.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.15%  [unknown]         [.] 0x00007f84a7766b99         
     5.02%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.22%  [kernel]          [k] memcpy                     
     4.19%  [xfs]             [k] _xfs_buf_find              
     2.32%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.71%  [kernel]          [k] __d_lookup                 
     1.68%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.54%  [xfs]             [k] xfs_next_bit               
     1.51%  [kernel]          [k] __kmalloc                  
     1.36%  [kernel]          [k] kmem_cache_free            
     1.28%  [xfs]             [k] xfs_buf_offset             
     1.14%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] kfree                      
     1.06%  [kernel]          [k] link_path_walk             
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.86%  [kernel]          [k] s_show                     

time: 1341306629
   PerfTop:    1400 irqs/sec  kernel:87.4%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.23%  [unknown]         [.] 0x00007f84a7766b99         
     5.07%  [kernel]          [k] _raw_spin_lock             
     4.87%  [kernel]          [k] page_fault                 
     4.27%  [xfs]             [k] _xfs_buf_find              
     4.18%  [kernel]          [k] memcpy                     
     2.31%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.73%  [kernel]          [k] __d_lookup                 
     1.66%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.49%  [xfs]             [k] xfs_next_bit               
     1.49%  [kernel]          [k] __kmalloc                  
     1.40%  [kernel]          [k] kmem_cache_free            
     1.29%  [xfs]             [k] xfs_buf_offset             
     1.11%  [kernel]          [k] kfree                      
     1.07%  [kernel]          [k] memset                     
     1.07%  [kernel]          [k] link_path_walk             
     1.05%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.97%  [xfs]             [k] xfs_btree_lookup           
     0.93%  [xfs]             [k] xfs_da_do_buf              
     0.89%  [kernel]          [k] s_show                     

time: 1341306635
   PerfTop:    1251 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.20%  [unknown]         [.] 0x00007f84a7766b99         
     5.10%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.29%  [xfs]             [k] _xfs_buf_find              
     4.19%  [kernel]          [k] memcpy                     
     2.26%  [kernel]          [k] kmem_cache_alloc           
     1.87%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] __d_lookup                 
     1.64%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.53%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.41%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.10%  [kernel]          [k] link_path_walk             
     1.09%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] kfree                      
     1.03%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.96%  [kernel]          [k] s_show                     
     0.93%  [xfs]             [k] xfs_da_do_buf              

time: 1341306635
   PerfTop:    1429 irqs/sec  kernel:88.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.18%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.28%  [xfs]             [k] _xfs_buf_find              
     4.21%  [kernel]          [k] memcpy                     
     2.23%  [kernel]          [k] kmem_cache_alloc           
     1.90%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] __d_lookup                 
     1.67%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.52%  [kernel]          [k] __kmalloc                  
     1.52%  [xfs]             [k] xfs_next_bit               
     1.36%  [kernel]          [k] kmem_cache_free            
     1.34%  [xfs]             [k] xfs_buf_offset             
     1.11%  [kernel]          [k] link_path_walk             
     1.11%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] kfree                      
     1.03%  [xfs]             [k] xfs_buf_item_size          
     1.03%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [kernel]          [k] s_show                     
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306635
   PerfTop:    1232 irqs/sec  kernel:88.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.11%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.87%  [kernel]          [k] page_fault                 
     4.33%  [xfs]             [k] _xfs_buf_find              
     4.16%  [kernel]          [k] memcpy                     
     2.24%  [kernel]          [k] kmem_cache_alloc           
     1.84%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.82%  [kernel]          [k] __d_lookup                 
     1.65%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.50%  [xfs]             [k] xfs_next_bit               
     1.49%  [kernel]          [k] __kmalloc                  
     1.34%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.13%  [kernel]          [k] link_path_walk             
     1.13%  [kernel]          [k] kfree                      
     1.11%  [kernel]          [k] memset                     
     1.06%  [kernel]          [k] s_show                     
     1.03%  [xfs]             [k] xfs_buf_item_size          
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306639
   PerfTop:    1444 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.19%  [unknown]         [.] 0x00007f84a7766b99         
     5.10%  [kernel]          [k] _raw_spin_lock             
     4.95%  [kernel]          [k] page_fault                 
     4.40%  [xfs]             [k] _xfs_buf_find              
     4.10%  [kernel]          [k] memcpy                     
     2.20%  [kernel]          [k] kmem_cache_alloc           
     1.93%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.81%  [kernel]          [k] __d_lookup                 
     1.59%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.50%  [xfs]             [k] xfs_next_bit               
     1.48%  [kernel]          [k] __kmalloc                  
     1.37%  [kernel]          [k] kmem_cache_free            
     1.36%  [xfs]             [k] xfs_buf_offset             
     1.15%  [kernel]          [k] memset                     
     1.12%  [kernel]          [k] s_show                     
     1.12%  [kernel]          [k] link_path_walk             
     1.10%  [kernel]          [k] kfree                      
     1.02%  [xfs]             [k] xfs_buf_item_size          
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.97%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306639
   PerfTop:    1195 irqs/sec  kernel:90.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.00%  [unknown]         [.] 0x00007f84a7766b99         
     5.17%  [kernel]          [k] _raw_spin_lock             
     4.44%  [xfs]             [k] _xfs_buf_find              
     4.37%  [kernel]          [k] page_fault                 
     4.37%  [kernel]          [k] memcpy                     
     2.30%  [kernel]          [k] kmem_cache_alloc           
     1.90%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.63%  [kernel]          [k] __d_lookup                 
     1.62%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.59%  [xfs]             [k] xfs_buf_offset             
     1.50%  [kernel]          [k] kmem_cache_free            
     1.50%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.33%  [kernel]          [k] memset                     
     1.28%  [kernel]          [k] kfree                      
     1.11%  [xfs]             [k] xfs_buf_item_size          
     1.07%  [kernel]          [k] s_show                     
     1.07%  [kernel]          [k] link_path_walk             
     0.93%  [xfs]             [k] xfs_btree_lookup           
     0.90%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_dir2_node_addname_int  

time: 1341306645
   PerfTop:    1097 irqs/sec  kernel:95.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.51%  [unknown]         [.] 0x00007f84a7766b99         
     5.02%  [kernel]          [k] _raw_spin_lock             
     4.63%  [kernel]          [k] memcpy                     
     4.51%  [xfs]             [k] _xfs_buf_find              
     3.32%  [kernel]          [k] page_fault                 
     2.37%  [kernel]          [k] kmem_cache_alloc           
     1.87%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [xfs]             [k] xfs_buf_offset             
     1.75%  [kernel]          [k] __kmalloc                  
     1.73%  [xfs]             [k] xfs_next_bit               
     1.65%  [kernel]          [k] memset                     
     1.60%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.53%  [kernel]          [k] kfree                      
     1.52%  [kernel]          [k] kmem_cache_free            
     1.44%  [xfs]             [k] xfs_trans_ail_cursor_first 
     1.26%  [kernel]          [k] __d_lookup                 
     1.26%  [xfs]             [k] xfs_buf_item_size          
     1.05%  [kernel]          [k] s_show                     
     1.03%  [xfs]             [k] xfs_buf_item_format        
     0.92%  [kernel]          [k] __d_lookup_rcu             
     0.87%  [kernel]          [k] link_path_walk             

time: 1341306645
   PerfTop:    1038 irqs/sec  kernel:95.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.89%  [unknown]         [.] 0x00007f84a7766b99         
     5.18%  [kernel]          [k] memcpy                     
     4.60%  [xfs]             [k] _xfs_buf_find              
     4.52%  [kernel]          [k] _raw_spin_lock             
     2.60%  [kernel]          [k] page_fault                 
     2.42%  [kernel]          [k] kmem_cache_alloc           
     1.99%  [kernel]          [k] __kmalloc                  
     1.96%  [xfs]             [k] xfs_next_bit               
     1.93%  [xfs]             [k] xfs_buf_offset             
     1.84%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] memset                     
     1.80%  [kernel]          [k] kmem_cache_free            
     1.68%  [kernel]          [k] kfree                      
     1.47%  [xfs]             [k] xfs_buf_item_size          
     1.45%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.18%  [kernel]          [k] __d_lookup_rcu             
     1.13%  [xfs]             [k] xfs_trans_ail_cursor_first 
     1.12%  [xfs]             [k] xfs_buf_item_format        
     1.04%  [kernel]          [k] s_show                     
     1.01%  [kernel]          [k] __d_lookup                 
     0.93%  [xfs]             [k] xfs_da_do_buf              

time: 1341306645
   PerfTop:    1087 irqs/sec  kernel:96.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.27%  [kernel]          [k] memcpy                     
     4.77%  [unknown]         [.] 0x00007f84a7766b99         
     4.69%  [xfs]             [k] _xfs_buf_find              
     4.56%  [kernel]          [k] _raw_spin_lock             
     2.47%  [kernel]          [k] kmem_cache_alloc           
     2.18%  [xfs]             [k] xfs_next_bit               
     2.11%  [kernel]          [k] page_fault                 
     2.00%  [xfs]             [k] xfs_buf_offset             
     1.99%  [kernel]          [k] __kmalloc                  
     1.96%  [kernel]          [k] kmem_cache_free            
     1.85%  [kernel]          [k] kfree                      
     1.82%  [kernel]          [k] memset                     
     1.75%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [xfs]             [k] xfs_buf_item_size          
     1.41%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.23%  [kernel]          [k] __d_lookup_rcu             
     1.21%  [xfs]             [k] xfs_buf_item_format        
     0.99%  [kernel]          [k] s_show                     
     0.97%  [xfs]             [k] xfs_perag_put              
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.92%  [xfs]             [k] xfs_trans_ail_cursor_first 

time: 1341306651
   PerfTop:    1157 irqs/sec  kernel:96.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.53%  [kernel]          [k] memcpy                     
     4.61%  [xfs]             [k] _xfs_buf_find              
     4.40%  [kernel]          [k] _raw_spin_lock             
     3.83%  [unknown]         [.] 0x00007f84a7766b99         
     2.67%  [kernel]          [k] kmem_cache_alloc           
     2.32%  [xfs]             [k] xfs_next_bit               
     2.21%  [kernel]          [k] __kmalloc                  
     2.21%  [xfs]             [k] xfs_buf_offset             
     2.19%  [kernel]          [k] kmem_cache_free            
     1.92%  [kernel]          [k] memset                     
     1.89%  [kernel]          [k] kfree                      
     1.80%  [xfs]             [k] xfs_buf_item_size          
     1.70%  [kernel]          [k] page_fault                 
     1.62%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.50%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.30%  [xfs]             [k] xfs_buf_item_format        
     1.27%  [kernel]          [k] __d_lookup_rcu             
     0.97%  [xfs]             [k] xfs_da_do_buf              
     0.96%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.93%  [kernel]          [k] s_show                     
     0.93%  [xfs]             [k] xfs_perag_put              

time: 1341306651
   PerfTop:    1073 irqs/sec  kernel:95.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.76%  [kernel]          [k] memcpy                     
     4.54%  [xfs]             [k] _xfs_buf_find              
     4.32%  [kernel]          [k] _raw_spin_lock             
     3.15%  [unknown]         [.] 0x00007f84a7766b99         
     2.77%  [kernel]          [k] kmem_cache_alloc           
     2.49%  [xfs]             [k] xfs_next_bit               
     2.36%  [kernel]          [k] __kmalloc                  
     2.27%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
     1.88%  [kernel]          [k] memset                     
     1.88%  [kernel]          [k] kfree                      
     1.77%  [xfs]             [k] xfs_buf_item_size          
     1.62%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.42%  [xfs]             [k] xfs_buf_item_format        
     1.40%  [kernel]          [k] page_fault                 
     1.39%  [kernel]          [k] __d_lookup_rcu             
     0.99%  [xfs]             [k] xfs_da_do_buf              
     0.96%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.88%  [kernel]          [k] s_show                     
     0.87%  [xfs]             [k] xfs_perag_put              

time: 1341306651
   PerfTop:     492 irqs/sec  kernel:85.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.74%  [kernel]          [k] memcpy                     
     4.48%  [xfs]             [k] _xfs_buf_find              
     4.27%  [kernel]          [k] _raw_spin_lock             
     3.00%  [unknown]         [.] 0x00007f84a7766b99         
     2.76%  [kernel]          [k] kmem_cache_alloc           
     2.54%  [xfs]             [k] xfs_next_bit               
     2.39%  [kernel]          [k] __kmalloc                  
     2.30%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
no symbols found in /bin/ps, maybe install a debug package?
     1.96%  [kernel]          [k] kfree                      
     1.92%  [kernel]          [k] memset                     
     1.75%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.44%  [xfs]             [k] xfs_buf_item_format        
     1.39%  [kernel]          [k] __d_lookup_rcu             
     1.36%  [kernel]          [k] page_fault                 
     0.99%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.86%  [kernel]          [k] s_show                     
     0.85%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      70 irqs/sec  kernel:72.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.73%  [kernel]          [k] memcpy                     
     4.47%  [xfs]             [k] _xfs_buf_find              
     4.27%  [kernel]          [k] _raw_spin_lock             
     2.99%  [unknown]         [.] 0x00007f84a7766b99         
     2.75%  [kernel]          [k] kmem_cache_alloc           
     2.53%  [xfs]             [k] xfs_next_bit               
     2.39%  [kernel]          [k] __kmalloc                  
     2.30%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
     1.96%  [kernel]          [k] kfree                      
     1.92%  [kernel]          [k] memset                     
     1.75%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.49%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.43%  [xfs]             [k] xfs_buf_item_format        
     1.38%  [kernel]          [k] __d_lookup_rcu             
     1.37%  [kernel]          [k] page_fault                 
     0.98%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.89%  [kernel]          [k] s_show                     
     0.85%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      87 irqs/sec  kernel:71.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.72%  [kernel]          [k] memcpy                     
     4.45%  [xfs]             [k] _xfs_buf_find              
     4.25%  [kernel]          [k] _raw_spin_lock             
     2.99%  [unknown]         [.] 0x00007f84a7766b99         
     2.74%  [kernel]          [k] kmem_cache_alloc           
     2.52%  [xfs]             [k] xfs_next_bit               
     2.38%  [kernel]          [k] __kmalloc                  
     2.29%  [kernel]          [k] kmem_cache_free            
     2.19%  [xfs]             [k] xfs_buf_offset             
     1.95%  [kernel]          [k] kfree                      
     1.91%  [kernel]          [k] memset                     
     1.74%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.43%  [xfs]             [k] xfs_buf_item_format        
     1.38%  [kernel]          [k] page_fault                 
     1.38%  [kernel]          [k] __d_lookup_rcu             
     0.98%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.93%  [kernel]          [k] s_show                     
     0.84%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      88 irqs/sec  kernel:68.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.69%  [kernel]          [k] memcpy                     
     4.42%  [xfs]             [k] _xfs_buf_find              
     4.25%  [kernel]          [k] _raw_spin_lock             
     2.98%  [unknown]         [.] 0x00007f84a7766b99         

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 10:59                 ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 10:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > Adding dri-devel and a few others because an i915 patch contributed to
> > the regression.
> > 
> > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > you have lots of CPU power, there will be little difference in
> > > > > performance...
> > > > 
> > > > When I checked it it could only be called twice, and we'd already
> > > > optimize away the second call.  I'd defintively like to track down where
> > > > the performance changes happend, at least to a major version but even
> > > > better to a -rc or git commit.
> > > > 
> > > 
> > > By all means feel free to run the test yourself and run the bisection :)
> > > 
> > > It's rare but on this occasion the test machine is idle so I started an
> > > automated git bisection. As you know the milage with an automated bisect
> > > varies so it may or may not find the right commit. Test machine is sandy so
> > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > is the report of interest. The script is doing a full search between v3.3 and
> > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > apparently unrelated patch that caused the problem.
> > > 
> > 
> > It was obvious very quickly that there were two distinct regression so I
> > ran two bisections. One led to a XFS and the other led to an i915 patch
> > that enables RC6 to reduce power usage.
> > 
> > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> 
> Doesn't seem to be the major cause of the regression. By itself, it
> has impact, but the majority comes from the XFS change...
> 

The fact it has an impact at all is weird but lets see what the DRI
folks think about it.

> > [c999a223: xfs: introduce an allocation workqueue]
> 
> Which indicates that there is workqueue scheduling issues, I think.
> The same amount of work is being done, but half of it is being
> pushed off into a workqueue to avoid stack overflow issues (*).  I
> tested the above patch in anger on an 8p machine, similar to the
> machine you saw no regressions on, but the workload didn't drive it
> to being completely CPU bound (only about 90%) so the allocation
> work was probably always scheduled quickly.
> 

What test were you using?

> How many worker threads have been spawned on these machines
> that are showing the regression?

20 or 21 generally. An example list as spotted by top looks like

kworker/0:0        
kworker/0:1        
kworker/0:2        
kworker/1:0        
kworker/1:1        
kworker/1:2        
kworker/2:0        
kworker/2:1        
kworker/2:2        
kworker/3:0        
kworker/3:1        
kworker/3:2        
kworker/4:0        
kworker/4:1        
kworker/5:0        
kworker/5:1        
kworker/6:0        
kworker/6:1        
kworker/6:2        
kworker/7:0        
kworker/7:1

There were 8 unbound workers.

> What is the context switch rate on the machines whenteh test is running?

This is vmstat from a vanilla kernel. The actual vmstat is after the --.
The information before that is recorded by mmtests to try and detect if
there was jitter in the vmstat output. It's showing that there is little
or no jitter in this test.

VANILLA
 1341306582.6713   1.8109     1.8109 --  0  0      0 16050784  11448 104056    0    0   376     0  209  526  0  0 99  1  0
 1341306584.6715   3.8112     2.0003 --  1  0      0 16050628  11448 104064    0    0     0     0  121  608  0  0 100  0  0
 1341306586.6718   5.8114     2.0003 --  0  0      0 16047432  11460 104288    0    0   102    45  227  999  0  0 99  1  0
 1341306588.6721   7.8117     2.0003 --  1  0      0 16046944  11460 104292    0    0     0     0  120  663  0  0 100  0  0
 1341306590.6723   9.8119     2.0002 --  0  2      0 16045788  11476 104296    0    0    12    40  190  754  0  0 99  0  0
 1341306592.6725  11.8121     2.0002 --  0  1      0 15990236  12600 141724    0    0 19054    30 1400 2937  2  1 88  9  0
 1341306594.6727  13.8124     2.0002 --  1  0      0 15907628  12600 186360    0    0  1653     0 3117 6406  2  9 88  1  0
 1341306596.6730  15.8127     2.0003 --  0  0      0 15825964  12608 226636    0    0    15 11024 3073 6350  2  9 89  0  0
 1341306598.6733  17.8130     2.0003 --  1  0      0 15730420  12608 271632    0    0     0  3072 3461 7179  2 10 88  0  0
 1341306600.6736  19.8132     2.0003 --  1  0      0 15686200  12608 310816    0    0     0 12416 3093 6198  2  9 89  0  0
 1341306602.6738  21.8135     2.0003 --  2  0      0 15593588  12616 354928    0    0     0    32 3482 7146  2 11 87  0  0
 1341306604.6741  23.8138     2.0003 --  2  0      0 15562032  12616 393772    0    0     0 12288 3129 6330  2 10 89  0  0
 1341306606.6744  25.8140     2.0002 --  1  0      0 15458316  12624 438004    0    0     0    26 3471 7107  2 11 87  0  0
 1341306608.6746  27.8142     2.0002 --  1  0      0 15432024  12624 474244    0    0     0 12416 3011 6017  1 10 89  0  0
 1341306610.6749  29.8145     2.0003 --  2  0      0 15343280  12624 517696    0    0     0    24 3393 6826  2 11 87  0  0
 1341306612.6751  31.8148     2.0002 --  1  0      0 15311136  12632 551816    0    0     0 16502 2818 5653  2  9 88  1  0
 1341306614.6754  33.8151     2.0003 --  1  0      0 15220648  12632 594936    0    0     0  3584 3451 6779  2 11 87  0  0
 1341306616.6755  35.8152     2.0001 --  4  0      0 15221252  12632 649296    0    0     0 38559 4846 8709  2 15 78  6  0
 1341306618.6758  37.8155     2.0003 --  1  0      0 15177724  12640 668476    0    0    20 40679 2204 4067  1  5 89  5  0
 1341306620.6761  39.8158     2.0003 --  1  0      0 15090204  12640 711752    0    0     0     0 3316 6788  2 11 88  0  0
 1341306622.6764  41.8160     2.0003 --  1  0      0 15005356  12640 748532    0    0     0 12288 3073 6132  2 10 89  0  0
 1341306624.6766  43.8163     2.0002 --  2  0      0 14913088  12648 791952    0    0     0    28 3408 6806  2 11 87  0  0
 1341306626.6769  45.8166     2.0003 --  1  0      0 14891512  12648 826328    0    0     0 12420 2906 5710  1  9 90  0  0
 1341306628.6772  47.8168     2.0003 --  1  0      0 14794316  12656 868936    0    0     0    26 3367 6798  2 11 87  0  0
 1341306630.6774  49.8171     2.0003 --  1  0      0 14769188  12656 905016    0    0    30 12324 3029 5876  2 10 89  0  0
 1341306632.6777  51.8173     2.0002 --  1  0      0 14679544  12656 947712    0    0     0     0 3399 6868  2 11 87  0  0
 1341306634.6780  53.8176     2.0003 --  1  0      0 14646156  12664 982032    0    0     0 14658 2987 5761  1 10 89  0  0
 1341306636.6782  55.8179     2.0003 --  1  0      0 14560504  12664 1023816    0    0     0  4404 3454 6876  2 11 87  0  0
 1341306638.6783  57.8180     2.0001 --  2  0      0 14533384  12664 1056812    0    0     0 15810 3002 5581  1 10 89  0  0
 1341306640.6785  59.8182     2.0002 --  1  0      0 14593332  12672 1027392    0    0     0 31790 3504 1811  1 13 78  8  0
 1341306642.6787  61.8183     2.0001 --  1  0      0 14686968  12672 1007604    0    0     0 14621 2434 1248  1 10 89  0  0
 1341306644.6789  63.8185     2.0002 --  1  1      0 15042476  12680 788104    0    0     0 36564 2809 1484  1 12 86  1  0
 1341306646.6790  65.8187     2.0002 --  1  0      0 15128292  12680 757948    0    0     0 26395 3050 1313  1 13 86  1  0
 1341306648.6792  67.8189     2.0002 --  1  0      0 15160036  12680 727964    0    0     0  5463 2752  910  1 12 87  0  0
 1341306650.6795  69.8192     2.0003 --  0  0      0 15633256  12688 332572    0    0  1156 12308 2117 2346  1  7 91  1  0
 1341306652.6797  71.8194     2.0002 --  0  0      0 15633892  12688 332652    0    0     0     0  224  758  0  0 100  0  0
 1341306654.6800  73.8197     2.0003 --  0  0      0 15633900  12688 332524    0    0     0     0  231 1009  0  0 100  0  0
 1341306656.6803  75.8199     2.0003 --  0  0      0 15637436  12696 332504    0    0     0    38  266  713  0  0 99  0  0
 1341306658.6805  77.8202     2.0003 --  0  0      0 15654180  12696 332352    0    0     0     0  270  821  0  0 100  0  0

REVERT-XFS
 1341307733.8702   1.7941     1.7941 --  0  0      0 16050640  12036 103996    0    0   372     0  216  752  0  0 99  1  0
 1341307735.8704   3.7944     2.0002 --  0  0      0 16050864  12036 104028    0    0     0     0  132  857  0  0 100  0  0
 1341307737.8707   5.7946     2.0002 --  0  0      0 16047492  12048 104252    0    0   102    37  255  938  0  0 99  1  0
 1341307739.8709   7.7949     2.0003 --  0  0      0 16047600  12072 104324    0    0    32     2  129  658  0  0 100  0  0
 1341307741.8712   9.7951     2.0002 --  1  1      0 16046676  12080 104328    0    0     0    32  165  729  0  0 100  0  0
 1341307743.8714  11.7954     2.0003 --  0  1      0 15990840  13216 142612    0    0 19422    30 1467 3015  2  1 89  8  0
 1341307745.8717  13.7956     2.0002 --  0  0      0 15825496  13216 226396    0    0  1310 11214 2217 1348  2  8 89  1  0
 1341307747.8717  15.7957     2.0001 --  1  0      0 15677816  13224 314672    0    0     4 15294 2307 1173  2  9 89  0  0
 1341307749.8719  17.7959     2.0002 --  1  0      0 15524372  13224 409728    0    0     0 12288 2466  888  1 10 89  0  0
 1341307751.8721  19.7960     2.0002 --  1  0      0 15368424  13224 502552    0    0     0 12416 2312  878  1 10 89  0  0
 1341307753.8722  21.7962     2.0002 --  1  0      0 15225216  13232 593092    0    0     0 12448 2539 1380  1 10 88  0  0
 1341307755.8724  23.7963     2.0002 --  2  0      0 15163712  13232 664768    0    0     0 32160 2184 1177  1  8 90  0  0
 1341307757.8727  25.7967     2.0003 --  1  0      0 14973888  13240 755080    0    0     0 12316 2482 1219  1 10 89  0  0
 1341307759.8728  27.7968     2.0001 --  1  0      0 14883580  13240 840036    0    0     0 44471 2711 1234  2 10 88  0  0
 1341307761.8730  29.7970     2.0002 --  1  0      0 14800304  13240 920504    0    0     0 42554 2571 1050  1 10 89  0  0
 1341307763.8734  31.7973     2.0003 --  0  0      0 14642504  13248 995004    0    0     0  3232 2276 1081  1  8 90  0  0
 1341307765.8737  33.7976     2.0003 --  1  0      0 14545072  13248 1052536    0    0     0 18688 2628 1114  1  9 89  0  0
 1341307767.8739  35.7979     2.0003 --  1  0      0 14783848  13248 926824    0    0     0 59559 2409 1308  0 10 89  1  0
 1341307769.8740  37.7980     2.0001 --  2  0      0 14854800  13256 896832    0    0     0  9172 2419 1004  1 10 89  1  0
 1341307771.8742  39.7981     2.0002 --  2  0      0 14835084  13256 875612    0    0     0 12288 2524  812  0 11 89  0  0
 1341307773.8743  41.7983     2.0002 --  2  0      0 15126252  13256 745844    0    0     0 10297 2714 1163  1 12 88  0  0
 1341307775.8745  43.7985     2.0002 --  1  0      0 15108800  13264 724544    0    0     0 12316 2499  931  1 11 88  0  0
 1341307777.8746  45.7986     2.0001 --  2  0      0 15226236  13264 694580    0    0     0 12416 2700 1194  1 12 88  0  0
 1341307779.8750  47.7989     2.0003 --  1  0      0 15697632  13264 300716    0    0  1156     0  934 1701  0  2 96  1  0
 1341307781.8752  49.7992     2.0003 --  0  0      0 15697508  13272 300720    0    0     0    66  166  641  0  0 100  0  0
 1341307783.8755  51.7995     2.0003 --  0  0      0 15699008  13272 300524    0    0     0     0  248  865  0  0 100  0  0
 1341307785.8758  53.7997     2.0003 --  0  0      0 15702452  13272 300520    0    0     0     0  285  960  0  0 99  0  0
 1341307787.8760  55.7999     2.0002 --  0  0      0 15719404  13280 300436    0    0     0    26  136  590  0  0 99  0  0

Vanilla average context switch rate	4278.53
Revert average context switch rate	1095

> Can you run latencytop to see
> if there is excessive starvation/wait times for allocation
> completion?

I'm not sure what format you are looking for.  latencytop is shit for
capturing information throughout a test and it does not easily allow you to
record a snapshot of a test. You can record all the console output of course
but that's a complete mess. I tried capturing /proc/latency_stats over time
instead because that can be trivially sorted on a system-wide basis but
as I write this I find that latency_stats was bust. It was just spitting out

Latency Top version : v0.1

and nothing else.  Either latency_stats is broken or my config is. Not sure
which it is right now and won't get enough time on this today to pinpoint it.

> A pert top profile comparison might be informative,
> too...
> 

I'm not sure if this is what you really wanted. I thought an oprofile or
perf report would have made more sense but I recorded perf top over time
anyway and it's at the end of the mail.  The timestamp information is poor
because the perf top information was buffered so it would receive a bunch
of updates at once. Each sample should be roughly 2 seconds apart. This
buffering can be dealt with, I just failed to do it in advance and I do
not think it's necessary to rerun the tests for it.

> (*) The stack usage below submit_bio() can be more than 5k (DM, MD,
> SCSI, driver, memory allocation), so it's really not safe to do
> allocation anywhere below about 3k of kernel stack being used. e.g.
> on a relatively trivial storage setup without the above commit:
> 
> [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left
> 
> Fundamentally, 8k stacks on x86-64 are too small for our
> increasingly complex storage layers and the 100+ function deep call
> chains that occur.
> 

I understand the patches motivation. For these tests I'm being deliberately
a bit of a dummy and just capturing information. This might allow me to
actually get through all the results and identify some of the problems
and spread them around a bit. Either that or I need to clone myself a few
times to tackle each of the problems in a reasonable timeframe :)

For just these XFS tests I've uploaded a tarball of the logs to
http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

For results with no monitor you can find them somewhere like this

default/no-monitor/sandy/fsmark-single-3.4.0-vanilla/noprofile/fsmark.log

Results with monitors attached are in run-monitor. You
can read the iostat logs for example from

default/run-monitor/sandy/iostat-3.4.0-vanilla-fsmark-single

Some of the monitor logs are gzipped.

This is perf top over time for the vanilla kernel

time: 1341306570

time: 1341306579
   PerfTop:       1 irqs/sec  kernel: 0.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    61.85%  [kernel]        [k] __rmqueue  
    38.15%  libc-2.11.3.so  [.] _IO_vfscanf

time: 1341306579
   PerfTop:       3 irqs/sec  kernel:66.7%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    19.88%  [kernel]        [k] _raw_spin_lock_irqsave  
    17.14%  [kernel]        [k] __rmqueue               
    16.96%  [kernel]        [k] format_decode           
    15.37%  libc-2.11.3.so  [.] __tzfile_compute        
    13.55%  [kernel]        [k] copy_user_generic_string
    10.57%  libc-2.11.3.so  [.] _IO_vfscanf             
     6.53%  [kernel]        [k] find_first_bit          

time: 1341306579
   PerfTop:       0 irqs/sec  kernel:-nan%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    17.51%  [kernel]        [k] _raw_spin_lock_irqsave  
    15.10%  [kernel]        [k] __rmqueue               
    14.94%  [kernel]        [k] format_decode           
    13.54%  libc-2.11.3.so  [.] __tzfile_compute        
    11.94%  [kernel]        [k] copy_user_generic_string
    11.90%  [kernel]        [k] _raw_spin_lock          
     9.31%  libc-2.11.3.so  [.] _IO_vfscanf             
     5.75%  [kernel]        [k] find_first_bit          

time: 1341306579
   PerfTop:      41 irqs/sec  kernel:58.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    13.62%  [kernel]          [k] _raw_spin_lock_irqsave   
    11.02%  [kernel]          [k] __rmqueue                
    10.91%  [kernel]          [k] format_decode            
     9.89%  libc-2.11.3.so    [.] __tzfile_compute         
     8.72%  [kernel]          [k] copy_user_generic_string 
     8.69%  [kernel]          [k] _raw_spin_lock           
     7.15%  libc-2.11.3.so    [.] _IO_vfscanf              
     4.20%  [kernel]          [k] find_first_bit           
     1.47%  libc-2.11.3.so    [.] __strcmp_sse42           
     1.37%  libc-2.11.3.so    [.] __strchr_sse42           
     1.19%  sed               [.] 0x0000000000009f7d       
     0.90%  libc-2.11.3.so    [.] vfprintf                 
     0.84%  [kernel]          [k] hrtimer_interrupt        
     0.84%  libc-2.11.3.so    [.] re_string_realloc_buffers
     0.76%  [kernel]          [k] enqueue_entity           
     0.66%  [kernel]          [k] __switch_to              
     0.65%  libc-2.11.3.so    [.] _IO_default_xsputn       
     0.62%  [kernel]          [k] do_vfs_ioctl             
     0.59%  [kernel]          [k] perf_event_mmap_event    
     0.56%  gzip              [.] 0x0000000000007b96       
     0.55%  libc-2.11.3.so    [.] bsearch                  

time: 1341306579
   PerfTop:      35 irqs/sec  kernel:62.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.50%  [kernel]          [k] _raw_spin_lock_irqsave   
     9.22%  [kernel]          [k] __rmqueue                
     9.13%  [kernel]          [k] format_decode            
     8.27%  libc-2.11.3.so    [.] __tzfile_compute         
     7.92%  [kernel]          [k] copy_user_generic_string 
     7.74%  [kernel]          [k] _raw_spin_lock           
     6.21%  libc-2.11.3.so    [.] _IO_vfscanf              
     3.51%  [kernel]          [k] find_first_bit           
     1.44%  gzip              [.] 0x0000000000007b96       
     1.23%  libc-2.11.3.so    [.] __strcmp_sse42           
     1.15%  libc-2.11.3.so    [.] __strchr_sse42           
     1.06%  libc-2.11.3.so    [.] vfprintf                 
     0.99%  sed               [.] 0x0000000000009f7d       
     0.92%  [unknown]         [.] 0x00007f84a7766b99       
     0.70%  [kernel]          [k] hrtimer_interrupt        
     0.70%  libc-2.11.3.so    [.] re_string_realloc_buffers
     0.64%  [kernel]          [k] enqueue_entity           
     0.58%  libtcl8.5.so      [.] 0x000000000006fe86       
     0.55%  [kernel]          [k] __switch_to              
     0.54%  libc-2.11.3.so    [.] _IO_default_xsputn       
     0.53%  [kernel]          [k] __d_lookup_rcu           

time: 1341306585
   PerfTop:     100 irqs/sec  kernel:59.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     8.61%  [kernel]          [k] _raw_spin_lock_irqsave         
     5.92%  [kernel]          [k] __rmqueue                      
     5.86%  [kernel]          [k] format_decode                  
     5.31%  libc-2.11.3.so    [.] __tzfile_compute               
     5.30%  [kernel]          [k] copy_user_generic_string       
     5.27%  [kernel]          [k] _raw_spin_lock                 
     3.99%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.45%  [unknown]         [.] 0x00007f84a7766b99             
     2.26%  [kernel]          [k] find_first_bit                 
     1.68%  [kernel]          [k] page_fault                     
     1.45%  libc-2.11.3.so    [.] _int_malloc                    
     1.28%  gzip              [.] 0x0000000000007b96             
     1.13%  libc-2.11.3.so    [.] vfprintf                       
     1.06%  libc-2.11.3.so    [.] __strchr_sse42                 
     1.02%  perl              [.] 0x0000000000044505             
     0.79%  libc-2.11.3.so    [.] __strcmp_sse42                 
     0.79%  [kernel]          [k] do_task_stat                   
     0.77%  [kernel]          [k] zap_pte_range                  
     0.72%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     0.70%  libc-2.11.3.so    [.] malloc                         
     0.70%  libc-2.11.3.so    [.] __mbrtowc                      

time: 1341306585
   PerfTop:      19 irqs/sec  kernel:78.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.97%  [kernel]          [k] _raw_spin_lock_irqsave         
     5.48%  [kernel]          [k] __rmqueue                      
     5.43%  [kernel]          [k] format_decode                  
     5.24%  [kernel]          [k] copy_user_generic_string       
     5.18%  [kernel]          [k] _raw_spin_lock                 
     4.92%  libc-2.11.3.so    [.] __tzfile_compute               
     4.25%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.33%  [unknown]         [.] 0x00007f84a7766b99             
     2.12%  [kernel]          [k] page_fault                     
     2.09%  [kernel]          [k] find_first_bit                 
     1.34%  libc-2.11.3.so    [.] _int_malloc                    
     1.19%  gzip              [.] 0x0000000000007b96             
     1.05%  libc-2.11.3.so    [.] vfprintf                       
     0.98%  libc-2.11.3.so    [.] __strchr_sse42                 
     0.94%  perl              [.] 0x0000000000044505             
     0.94%  libc-2.11.3.so    [.] _dl_addr                       
     0.91%  [kernel]          [k] zap_pte_range                  
     0.74%  [kernel]          [k] s_show                         
     0.73%  libc-2.11.3.so    [.] __strcmp_sse42                 
     0.73%  [kernel]          [k] do_task_stat                   
     0.67%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal

time: 1341306585
   PerfTop:      38 irqs/sec  kernel:68.4%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.64%  [kernel]          [k] _raw_spin_lock_irqsave     
     4.89%  [kernel]          [k] _raw_spin_lock             
     4.77%  [kernel]          [k] __rmqueue                  
     4.72%  [kernel]          [k] format_decode              
     4.56%  [kernel]          [k] copy_user_generic_string   
     4.53%  libc-2.11.3.so    [.] _IO_vfscanf                
     4.28%  libc-2.11.3.so    [.] __tzfile_compute           
     2.52%  [unknown]         [.] 0x00007f84a7766b99         
no symbols found in /bin/sort, maybe install a debug package?
     2.10%  [kernel]          [k] page_fault                 
     1.82%  [kernel]          [k] find_first_bit             
     1.31%  libc-2.11.3.so    [.] _int_malloc                
     1.14%  libc-2.11.3.so    [.] vfprintf                   
     1.08%  libc-2.11.3.so    [.] _dl_addr                   
     1.07%  [kernel]          [k] s_show                     
     1.05%  libc-2.11.3.so    [.] __strchr_sse42             
     1.03%  gzip              [.] 0x0000000000007b96         
     0.82%  [kernel]          [k] do_task_stat               
     0.82%  perl              [.] 0x0000000000044505         
     0.79%  [kernel]          [k] zap_pte_range              
     0.70%  [kernel]          [k] seq_put_decimal_ull        
     0.69%  [kernel]          [k] find_busiest_group         

time: 1341306591
   PerfTop:      66 irqs/sec  kernel:59.1%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     6.52%  [kernel]          [k] _raw_spin_lock_irqsave         
     4.11%  libc-2.11.3.so    [.] _IO_vfscanf                    
     3.91%  [kernel]          [k] _raw_spin_lock                 
     3.50%  [kernel]          [k] copy_user_generic_string       
     3.41%  [kernel]          [k] __rmqueue                      
     3.38%  [kernel]          [k] format_decode                  
     3.06%  libc-2.11.3.so    [.] __tzfile_compute               
     2.90%  [unknown]         [.] 0x00007f84a7766b99             
     2.30%  [kernel]          [k] page_fault                     
     2.20%  perl              [.] 0x0000000000044505             
     1.83%  libc-2.11.3.so    [.] vfprintf                       
     1.61%  libc-2.11.3.so    [.] _int_malloc                    
     1.30%  [kernel]          [k] find_first_bit                 
     1.22%  libc-2.11.3.so    [.] _dl_addr                       
     1.19%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     1.10%  libc-2.11.3.so    [.] __strchr_sse42                 
     1.01%  [kernel]          [k] zap_pte_range                  
     0.99%  [kernel]          [k] s_show                         
     0.98%  [kernel]          [k] __percpu_counter_add           
     0.86%  [kernel]          [k] __strnlen_user                 
     0.75%  ld-2.11.3.so      [.] do_lookup_x                    

time: 1341306591
   PerfTop:      39 irqs/sec  kernel:69.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     6.26%  [kernel]          [k] _raw_spin_lock_irqsave         
     4.05%  [kernel]          [k] _raw_spin_lock                 
     3.86%  libc-2.11.3.so    [.] _IO_vfscanf                    
     3.21%  [kernel]          [k] copy_user_generic_string       
     3.03%  [kernel]          [k] __rmqueue                      
     3.00%  [kernel]          [k] format_decode                  
     2.93%  [unknown]         [.] 0x00007f84a7766b99             
     2.72%  libc-2.11.3.so    [.] __tzfile_compute               
     2.20%  [kernel]          [k] page_fault                     
     1.96%  perl              [.] 0x0000000000044505             
     1.77%  libc-2.11.3.so    [.] vfprintf                       
     1.43%  libc-2.11.3.so    [.] _int_malloc                    
     1.16%  [kernel]          [k] find_first_bit                 
     1.09%  libc-2.11.3.so    [.] _dl_addr                       
     1.06%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     1.02%  [kernel]          [k] s_show                         
     0.98%  libc-2.11.3.so    [.] __strchr_sse42                 
     0.93%  gzip              [.] 0x0000000000007b96             
     0.90%  [kernel]          [k] zap_pte_range                  
     0.87%  [kernel]          [k] __percpu_counter_add           
     0.76%  [kernel]          [k] __strnlen_user                 

time: 1341306591
   PerfTop:     185 irqs/sec  kernel:70.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     4.81%  [kernel]          [k] _raw_spin_lock_irqsave         
     3.60%  [unknown]         [.] 0x00007f84a7766b99             
     3.10%  [kernel]          [k] _raw_spin_lock                 
     3.04%  [kernel]          [k] page_fault                     
     2.66%  libc-2.11.3.so    [.] _IO_vfscanf                    
     2.14%  [kernel]          [k] copy_user_generic_string       
     2.11%  [kernel]          [k] format_decode                  
     1.96%  [kernel]          [k] __rmqueue                      
     1.86%  libc-2.11.3.so    [.] _dl_addr                       
     1.76%  libc-2.11.3.so    [.] __tzfile_compute               
     1.26%  perl              [.] 0x0000000000044505             
     1.19%  libc-2.11.3.so    [.] __mbrtowc                      
     1.14%  libc-2.11.3.so    [.] vfprintf                       
     1.12%  libc-2.11.3.so    [.] _int_malloc                    
     1.09%  gzip              [.] 0x0000000000007b96             
     0.95%  libc-2.11.3.so    [.] __gconv_transform_utf8_internal
     0.88%  [kernel]          [k] _raw_spin_unlock_irqrestore    
     0.87%  [kernel]          [k] __strnlen_user                 
     0.82%  [kernel]          [k] clear_page_c                   
     0.77%  [kernel]          [k] __schedule                     
     0.76%  [kernel]          [k] find_get_page                  

time: 1341306595
   PerfTop:     385 irqs/sec  kernel:48.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    27.20%  cc1               [.] 0x0000000000210978         
     3.01%  [unknown]         [.] 0x00007f84a7766b99         
     2.18%  [kernel]          [k] page_fault                 
     1.96%  libbfd-2.21.so    [.] 0x00000000000b9cdd         
     1.95%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.91%  ld.bfd            [.] 0x000000000000e3b9         
     1.85%  [kernel]          [k] _raw_spin_lock             
     1.31%  [kernel]          [k] copy_user_generic_string   
     1.20%  libbfd-2.21.so    [.] bfd_hash_lookup            
     1.10%  libc-2.11.3.so    [.] __strcmp_sse42             
     0.93%  libc-2.11.3.so    [.] _IO_vfscanf                
     0.85%  [kernel]          [k] _raw_spin_unlock_irqrestore
     0.82%  libc-2.11.3.so    [.] _int_malloc                
     0.80%  [kernel]          [k] __rmqueue                  
     0.79%  [kernel]          [k] kmem_cache_alloc           
     0.74%  [kernel]          [k] format_decode              
     0.71%  libc-2.11.3.so    [.] _dl_addr                   
     0.62%  libbfd-2.21.so    [.] _bfd_final_link_relocate   
     0.61%  libc-2.11.3.so    [.] __tzfile_compute           
     0.61%  libc-2.11.3.so    [.] vfprintf                   
     0.59%  [kernel]          [k] find_busiest_group         

time: 1341306595
   PerfTop:    1451 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     9.75%  cc1               [.] 0x0000000000210978         
     8.81%  [unknown]         [.] 0x00007f84a7766b99         
     4.62%  [kernel]          [k] page_fault                 
     3.61%  [kernel]          [k] _raw_spin_lock             
     2.67%  [kernel]          [k] memcpy                     
     2.03%  [kernel]          [k] _raw_spin_lock_irqsave     
     2.00%  [kernel]          [k] kmem_cache_alloc           
     1.64%  [xfs]             [k] _xfs_buf_find              
     1.31%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.16%  [kernel]          [k] kmem_cache_free            
     1.15%  [xfs]             [k] xfs_next_bit               
     0.98%  [kernel]          [k] __d_lookup                 
     0.89%  [xfs]             [k] xfs_da_do_buf              
     0.83%  [kernel]          [k] memset                     
     0.80%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.79%  [kernel]          [k] link_path_walk             
     0.76%  [xfs]             [k] xfs_buf_item_size          
no symbols found in /usr/bin/tee, maybe install a debug package?
no symbols found in /bin/date, maybe install a debug package?
     0.73%  [xfs]             [k] xfs_buf_offset             
     0.71%  [kernel]          [k] __kmalloc                  
     0.70%  [kernel]          [k] kfree                      
     0.70%  libbfd-2.21.so    [.] 0x00000000000b9cdd         

time: 1341306601
   PerfTop:    1267 irqs/sec  kernel:85.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.81%  [unknown]         [.] 0x00007f84a7766b99         
     5.98%  cc1               [.] 0x0000000000210978         
     5.20%  [kernel]          [k] page_fault                 
     3.54%  [kernel]          [k] _raw_spin_lock             
     3.37%  [kernel]          [k] memcpy                     
     2.03%  [kernel]          [k] kmem_cache_alloc           
     1.91%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.75%  [xfs]             [k] _xfs_buf_find              
     1.35%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.28%  [xfs]             [k] xfs_next_bit               
     1.14%  [kernel]          [k] kmem_cache_free            
     1.13%  [kernel]          [k] __kmalloc                  
     1.12%  [kernel]          [k] __d_lookup                 
     0.97%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.96%  [xfs]             [k] xfs_buf_offset             
     0.95%  [kernel]          [k] memset                     
     0.91%  [kernel]          [k] link_path_walk             
     0.88%  [xfs]             [k] xfs_da_do_buf              
     0.85%  [kernel]          [k] kfree                      
     0.84%  [xfs]             [k] xfs_buf_item_size          
     0.74%  [xfs]             [k] xfs_btree_lookup           

time: 1341306601
   PerfTop:    1487 irqs/sec  kernel:85.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.84%  [unknown]         [.] 0x00007f84a7766b99         
     5.15%  [kernel]          [k] page_fault                 
     3.93%  cc1               [.] 0x0000000000210978         
     3.76%  [kernel]          [k] _raw_spin_lock             
     3.50%  [kernel]          [k] memcpy                     
     2.13%  [kernel]          [k] kmem_cache_alloc           
     1.91%  [xfs]             [k] _xfs_buf_find              
     1.79%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.52%  [kernel]          [k] __kmalloc                  
     1.33%  [kernel]          [k] kmem_cache_free            
     1.32%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.29%  [kernel]          [k] __d_lookup                 
     1.27%  [xfs]             [k] xfs_next_bit               
     1.11%  [kernel]          [k] link_path_walk             
     1.01%  [xfs]             [k] xfs_buf_offset             
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_da_do_buf              
     0.97%  [kernel]          [k] kfree                      
     0.96%  [kernel]          [k] memset                     
     0.84%  [xfs]             [k] xfs_btree_lookup           
     0.82%  [xfs]             [k] xfs_buf_item_format        

time: 1341306601
   PerfTop:    1291 irqs/sec  kernel:85.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.21%  [unknown]         [.] 0x00007f84a7766b99         
     5.18%  [kernel]          [k] page_fault                 
     3.83%  [kernel]          [k] _raw_spin_lock             
     3.67%  [kernel]          [k] memcpy                     
     2.92%  cc1               [.] 0x0000000000210978         
     2.28%  [kernel]          [k] kmem_cache_alloc           
     2.18%  [xfs]             [k] _xfs_buf_find              
     1.66%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.56%  [kernel]          [k] __kmalloc                  
     1.43%  [kernel]          [k] __d_lookup                 
     1.43%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.40%  [kernel]          [k] kmem_cache_free            
     1.29%  [xfs]             [k] xfs_next_bit               
     1.13%  [xfs]             [k] xfs_buf_offset             
     1.07%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_da_do_buf              
     1.01%  [kernel]          [k] memset                     
     1.01%  [kernel]          [k] kfree                      
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.87%  [xfs]             [k] xfs_buf_item_size          
     0.84%  [xfs]             [k] xfs_btree_lookup           

time: 1341306607
   PerfTop:    1435 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.06%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] page_fault                 
     3.88%  [kernel]          [k] _raw_spin_lock             
     3.83%  [kernel]          [k] memcpy                     
     2.41%  [xfs]             [k] _xfs_buf_find              
     2.35%  [kernel]          [k] kmem_cache_alloc           
     2.19%  cc1               [.] 0x0000000000210978         
     1.68%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.55%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] __d_lookup                 
     1.43%  [kernel]          [k] kmem_cache_free            
     1.42%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.37%  [xfs]             [k] xfs_next_bit               
     1.27%  [xfs]             [k] xfs_buf_offset             
     1.12%  [kernel]          [k] link_path_walk             
     1.09%  [kernel]          [k] kfree                      
     1.08%  [kernel]          [k] memset                     
     1.04%  [xfs]             [k] xfs_da_do_buf              
     0.99%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.92%  [xfs]             [k] xfs_buf_item_size          
     0.89%  [xfs]             [k] xfs_btree_lookup           

time: 1341306607
   PerfTop:    1281 irqs/sec  kernel:87.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.00%  [unknown]         [.] 0x00007f84a7766b99         
     5.44%  [kernel]          [k] page_fault                 
     4.04%  [kernel]          [k] _raw_spin_lock             
     3.94%  [kernel]          [k] memcpy                     
     2.51%  [xfs]             [k] _xfs_buf_find              
     2.32%  [kernel]          [k] kmem_cache_alloc           
     1.75%  cc1               [.] 0x0000000000210978         
     1.66%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [kernel]          [k] __d_lookup                 
     1.56%  [kernel]          [k] __kmalloc                  
     1.46%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] kmem_cache_free            
     1.41%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.34%  [xfs]             [k] xfs_buf_offset             
     1.20%  [kernel]          [k] link_path_walk             
     1.16%  [kernel]          [k] kfree                      
     1.11%  [kernel]          [k] memset                     
     1.04%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.94%  [xfs]             [k] xfs_da_do_buf              
     0.92%  [xfs]             [k] xfs_btree_lookup           
     0.89%  [xfs]             [k] xfs_buf_item_size          

time: 1341306607
   PerfTop:    1455 irqs/sec  kernel:86.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.14%  [unknown]         [.] 0x00007f84a7766b99         
     5.36%  [kernel]          [k] page_fault                 
     4.12%  [kernel]          [k] _raw_spin_lock             
     4.02%  [kernel]          [k] memcpy                     
     2.54%  [xfs]             [k] _xfs_buf_find              
     2.41%  [kernel]          [k] kmem_cache_alloc           
     1.69%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.56%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.47%  [kernel]          [k] __d_lookup                 
     1.42%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.39%  [xfs]             [k] xfs_buf_offset             
     1.39%  cc1               [.] 0x0000000000210978         
     1.37%  [kernel]          [k] kmem_cache_free            
     1.24%  [kernel]          [k] link_path_walk             
     1.17%  [kernel]          [k] memset                     
     1.16%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.91%  [xfs]             [k] xfs_btree_lookup           

time: 1341306613
   PerfTop:    1245 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.05%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] page_fault                 
     4.10%  [kernel]          [k] _raw_spin_lock             
     4.06%  [kernel]          [k] memcpy                     
     2.74%  [xfs]             [k] _xfs_buf_find              
     2.40%  [kernel]          [k] kmem_cache_alloc           
     1.64%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [xfs]             [k] xfs_next_bit               
     1.54%  [kernel]          [k] __kmalloc                  
     1.49%  [kernel]          [k] __d_lookup                 
     1.45%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.41%  [kernel]          [k] kmem_cache_free            
     1.35%  [xfs]             [k] xfs_buf_offset             
     1.25%  [kernel]          [k] link_path_walk             
     1.22%  [kernel]          [k] kfree                      
     1.16%  [kernel]          [k] memset                     
     1.15%  cc1               [.] 0x0000000000210978         
     1.02%  [xfs]             [k] xfs_buf_item_size          
     1.00%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.92%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_da_do_buf              

time: 1341306613
   PerfTop:    1433 irqs/sec  kernel:87.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    12.04%  [unknown]         [.] 0x00007f84a7766b99         
     5.30%  [kernel]          [k] page_fault                 
     4.08%  [kernel]          [k] memcpy                     
     4.07%  [kernel]          [k] _raw_spin_lock             
     2.88%  [xfs]             [k] _xfs_buf_find              
     2.50%  [kernel]          [k] kmem_cache_alloc           
     1.72%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.68%  [xfs]             [k] xfs_next_bit               
     1.56%  [kernel]          [k] __d_lookup                 
     1.54%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.46%  [kernel]          [k] kmem_cache_free            
     1.40%  [xfs]             [k] xfs_buf_offset             
     1.25%  [kernel]          [k] link_path_walk             
     1.21%  [kernel]          [k] memset                     
     1.18%  [kernel]          [k] kfree                      
     1.04%  [xfs]             [k] xfs_buf_item_size          
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.95%  [xfs]             [k] xfs_btree_lookup           
     0.94%  cc1               [.] 0x0000000000210978         
     0.90%  [xfs]             [k] xfs_da_do_buf              

time: 1341306613
   PerfTop:    1118 irqs/sec  kernel:87.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

no symbols found in /usr/bin/vmstat, maybe install a debug package?
    12.03%  [unknown]         [.] 0x00007f84a7766b99         
     5.48%  [kernel]          [k] page_fault                 
     4.21%  [kernel]          [k] memcpy                     
     4.11%  [kernel]          [k] _raw_spin_lock             
     2.98%  [xfs]             [k] _xfs_buf_find              
     2.47%  [kernel]          [k] kmem_cache_alloc           
     1.81%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.72%  [xfs]             [k] xfs_next_bit               
     1.51%  [kernel]          [k] __kmalloc                  
     1.48%  [kernel]          [k] kmem_cache_free            
     1.48%  [kernel]          [k] __d_lookup                 
     1.47%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.39%  [xfs]             [k] xfs_buf_offset             
     1.23%  [kernel]          [k] link_path_walk             
     1.19%  [kernel]          [k] memset                     
     1.19%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_buf_item_size          
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.93%  [xfs]             [k] xfs_buf_item_format        
     0.91%  [xfs]             [k] xfs_da_do_buf              

time: 1341306617
   PerfTop:    1454 irqs/sec  kernel:87.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.93%  [unknown]         [.] 0x00007f84a7766b99         
     5.42%  [kernel]          [k] page_fault                 
     4.28%  [kernel]          [k] memcpy                     
     4.20%  [kernel]          [k] _raw_spin_lock             
     3.15%  [xfs]             [k] _xfs_buf_find              
     2.52%  [kernel]          [k] kmem_cache_alloc           
     1.76%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.72%  [xfs]             [k] xfs_next_bit               
     1.59%  [kernel]          [k] __d_lookup                 
     1.51%  [kernel]          [k] __kmalloc                  
     1.49%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.48%  [kernel]          [k] kmem_cache_free            
     1.40%  [xfs]             [k] xfs_buf_offset             
     1.29%  [kernel]          [k] memset                     
     1.20%  [kernel]          [k] link_path_walk             
     1.17%  [kernel]          [k] kfree                      
     1.09%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_buf_item_size          
     0.95%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              
     0.91%  [xfs]             [k] xfs_buf_item_format        

time: 1341306617
   PerfTop:    1758 irqs/sec  kernel:90.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.99%  [unknown]         [.] 0x00007f84a7766b99         
     5.40%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.04%  [kernel]          [k] memcpy                     
     3.86%  [xfs]             [k] _xfs_buf_find              
     2.31%  [kernel]          [k] kmem_cache_alloc           
     2.03%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.67%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.60%  [kernel]          [k] __d_lookup                 
     1.60%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] __kmalloc                  
     1.36%  [xfs]             [k] xfs_buf_offset             
     1.35%  [kernel]          [k] kmem_cache_free            
     1.17%  [kernel]          [k] kfree                      
     1.16%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] link_path_walk             
     0.98%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.97%  [xfs]             [k] xfs_btree_lookup           
     0.92%  [xfs]             [k] xfs_perag_put              
     0.90%  [xfs]             [k] xfs_buf_item_size          
     0.84%  [xfs]             [k] xfs_da_do_buf              

time: 1341306623
   PerfTop:    1022 irqs/sec  kernel:88.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.93%  [unknown]         [.] 0x00007f84a7766b99         
     5.34%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.01%  [kernel]          [k] memcpy                     
     4.01%  [xfs]             [k] _xfs_buf_find              
     2.28%  [kernel]          [k] kmem_cache_alloc           
     2.00%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.68%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.60%  [xfs]             [k] xfs_next_bit               
     1.59%  [kernel]          [k] __d_lookup                 
     1.41%  [kernel]          [k] __kmalloc                  
     1.39%  [kernel]          [k] kmem_cache_free            
     1.35%  [xfs]             [k] xfs_buf_offset             
     1.15%  [kernel]          [k] kfree                      
     1.15%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] link_path_walk             
     0.98%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_perag_put              
     0.88%  [xfs]             [k] xfs_buf_item_size          
     0.86%  [xfs]             [k] xfs_da_do_buf              

time: 1341306623
   PerfTop:    1430 irqs/sec  kernel:87.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.05%  [unknown]         [.] 0x00007f84a7766b99         
     5.24%  [kernel]          [k] _raw_spin_lock             
     4.89%  [kernel]          [k] page_fault                 
     4.13%  [kernel]          [k] memcpy                     
     3.96%  [xfs]             [k] _xfs_buf_find              
     2.35%  [kernel]          [k] kmem_cache_alloc           
     1.95%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.63%  [kernel]          [k] __d_lookup                 
     1.54%  [xfs]             [k] xfs_next_bit               
     1.42%  [kernel]          [k] __kmalloc                  
     1.41%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.16%  [kernel]          [k] memset                     
     1.11%  [kernel]          [k] kfree                      
     1.10%  [kernel]          [k] link_path_walk             
     1.05%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.91%  [xfs]             [k] xfs_buf_item_size          
     0.87%  [xfs]             [k] xfs_da_do_buf              
     0.87%  [xfs]             [k] xfs_perag_put              

time: 1341306623
   PerfTop:    1267 irqs/sec  kernel:87.1%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.20%  [unknown]         [.] 0x00007f84a7766b99         
     5.08%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.12%  [kernel]          [k] memcpy                     
     3.96%  [xfs]             [k] _xfs_buf_find              
     2.41%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.61%  [kernel]          [k] __d_lookup                 
     1.50%  [xfs]             [k] xfs_next_bit               
     1.44%  [kernel]          [k] __kmalloc                  
     1.40%  [kernel]          [k] kmem_cache_free            
     1.31%  [xfs]             [k] xfs_buf_offset             
no symbols found in /usr/bin/iostat, maybe install a debug package?
     1.16%  [kernel]          [k] memset                     
     1.11%  [kernel]          [k] kfree                      
     1.06%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [xfs]             [k] xfs_btree_lookup           
     0.95%  [xfs]             [k] xfs_buf_item_size          
     0.90%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_perag_put              

time: 1341306629
   PerfTop:    1399 irqs/sec  kernel:88.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.12%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.23%  [kernel]          [k] memcpy                     
     4.03%  [xfs]             [k] _xfs_buf_find              
     2.37%  [kernel]          [k] kmem_cache_alloc           
     1.96%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.69%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.63%  [kernel]          [k] __d_lookup                 
     1.50%  [xfs]             [k] xfs_next_bit               
     1.45%  [kernel]          [k] __kmalloc                  
     1.35%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.17%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] kfree                      
     1.07%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.07%  [kernel]          [k] link_path_walk             
     1.04%  [xfs]             [k] xfs_btree_lookup           
     1.02%  [xfs]             [k] xfs_buf_item_size          
     0.93%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_buf_item_format        

time: 1341306629
   PerfTop:    1225 irqs/sec  kernel:87.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.15%  [unknown]         [.] 0x00007f84a7766b99         
     5.02%  [kernel]          [k] _raw_spin_lock             
     4.85%  [kernel]          [k] page_fault                 
     4.22%  [kernel]          [k] memcpy                     
     4.19%  [xfs]             [k] _xfs_buf_find              
     2.32%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.71%  [kernel]          [k] __d_lookup                 
     1.68%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.54%  [xfs]             [k] xfs_next_bit               
     1.51%  [kernel]          [k] __kmalloc                  
     1.36%  [kernel]          [k] kmem_cache_free            
     1.28%  [xfs]             [k] xfs_buf_offset             
     1.14%  [kernel]          [k] memset                     
     1.09%  [kernel]          [k] kfree                      
     1.06%  [kernel]          [k] link_path_walk             
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.86%  [kernel]          [k] s_show                     

time: 1341306629
   PerfTop:    1400 irqs/sec  kernel:87.4%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.23%  [unknown]         [.] 0x00007f84a7766b99         
     5.07%  [kernel]          [k] _raw_spin_lock             
     4.87%  [kernel]          [k] page_fault                 
     4.27%  [xfs]             [k] _xfs_buf_find              
     4.18%  [kernel]          [k] memcpy                     
     2.31%  [kernel]          [k] kmem_cache_alloc           
     1.94%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.73%  [kernel]          [k] __d_lookup                 
     1.66%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.49%  [xfs]             [k] xfs_next_bit               
     1.49%  [kernel]          [k] __kmalloc                  
     1.40%  [kernel]          [k] kmem_cache_free            
     1.29%  [xfs]             [k] xfs_buf_offset             
     1.11%  [kernel]          [k] kfree                      
     1.07%  [kernel]          [k] memset                     
     1.07%  [kernel]          [k] link_path_walk             
     1.05%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.97%  [xfs]             [k] xfs_btree_lookup           
     0.93%  [xfs]             [k] xfs_da_do_buf              
     0.89%  [kernel]          [k] s_show                     

time: 1341306635
   PerfTop:    1251 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.20%  [unknown]         [.] 0x00007f84a7766b99         
     5.10%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.29%  [xfs]             [k] _xfs_buf_find              
     4.19%  [kernel]          [k] memcpy                     
     2.26%  [kernel]          [k] kmem_cache_alloc           
     1.87%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] __d_lookup                 
     1.64%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.53%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.41%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.10%  [kernel]          [k] link_path_walk             
     1.09%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] kfree                      
     1.03%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.99%  [xfs]             [k] xfs_buf_item_size          
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.96%  [kernel]          [k] s_show                     
     0.93%  [xfs]             [k] xfs_da_do_buf              

time: 1341306635
   PerfTop:    1429 irqs/sec  kernel:88.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.18%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.82%  [kernel]          [k] page_fault                 
     4.28%  [xfs]             [k] _xfs_buf_find              
     4.21%  [kernel]          [k] memcpy                     
     2.23%  [kernel]          [k] kmem_cache_alloc           
     1.90%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] __d_lookup                 
     1.67%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.52%  [kernel]          [k] __kmalloc                  
     1.52%  [xfs]             [k] xfs_next_bit               
     1.36%  [kernel]          [k] kmem_cache_free            
     1.34%  [xfs]             [k] xfs_buf_offset             
     1.11%  [kernel]          [k] link_path_walk             
     1.11%  [kernel]          [k] memset                     
     1.08%  [kernel]          [k] kfree                      
     1.03%  [xfs]             [k] xfs_buf_item_size          
     1.03%  [xfs]             [k] xfs_dir2_node_addname_int  
     1.01%  [kernel]          [k] s_show                     
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306635
   PerfTop:    1232 irqs/sec  kernel:88.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.11%  [unknown]         [.] 0x00007f84a7766b99         
     5.13%  [kernel]          [k] _raw_spin_lock             
     4.87%  [kernel]          [k] page_fault                 
     4.33%  [xfs]             [k] _xfs_buf_find              
     4.16%  [kernel]          [k] memcpy                     
     2.24%  [kernel]          [k] kmem_cache_alloc           
     1.84%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.82%  [kernel]          [k] __d_lookup                 
     1.65%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.50%  [xfs]             [k] xfs_next_bit               
     1.49%  [kernel]          [k] __kmalloc                  
     1.34%  [kernel]          [k] kmem_cache_free            
     1.32%  [xfs]             [k] xfs_buf_offset             
     1.13%  [kernel]          [k] link_path_walk             
     1.13%  [kernel]          [k] kfree                      
     1.11%  [kernel]          [k] memset                     
     1.06%  [kernel]          [k] s_show                     
     1.03%  [xfs]             [k] xfs_buf_item_size          
     1.02%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.98%  [xfs]             [k] xfs_btree_lookup           
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306639
   PerfTop:    1444 irqs/sec  kernel:87.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    11.19%  [unknown]         [.] 0x00007f84a7766b99         
     5.10%  [kernel]          [k] _raw_spin_lock             
     4.95%  [kernel]          [k] page_fault                 
     4.40%  [xfs]             [k] _xfs_buf_find              
     4.10%  [kernel]          [k] memcpy                     
     2.20%  [kernel]          [k] kmem_cache_alloc           
     1.93%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.81%  [kernel]          [k] __d_lookup                 
     1.59%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.50%  [xfs]             [k] xfs_next_bit               
     1.48%  [kernel]          [k] __kmalloc                  
     1.37%  [kernel]          [k] kmem_cache_free            
     1.36%  [xfs]             [k] xfs_buf_offset             
     1.15%  [kernel]          [k] memset                     
     1.12%  [kernel]          [k] s_show                     
     1.12%  [kernel]          [k] link_path_walk             
     1.10%  [kernel]          [k] kfree                      
     1.02%  [xfs]             [k] xfs_buf_item_size          
     0.99%  [xfs]             [k] xfs_btree_lookup           
     0.97%  [xfs]             [k] xfs_dir2_node_addname_int  
     0.94%  [xfs]             [k] xfs_da_do_buf              

time: 1341306639
   PerfTop:    1195 irqs/sec  kernel:90.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

    10.00%  [unknown]         [.] 0x00007f84a7766b99         
     5.17%  [kernel]          [k] _raw_spin_lock             
     4.44%  [xfs]             [k] _xfs_buf_find              
     4.37%  [kernel]          [k] page_fault                 
     4.37%  [kernel]          [k] memcpy                     
     2.30%  [kernel]          [k] kmem_cache_alloc           
     1.90%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.63%  [kernel]          [k] __d_lookup                 
     1.62%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.59%  [xfs]             [k] xfs_buf_offset             
     1.50%  [kernel]          [k] kmem_cache_free            
     1.50%  [kernel]          [k] __kmalloc                  
     1.49%  [xfs]             [k] xfs_next_bit               
     1.33%  [kernel]          [k] memset                     
     1.28%  [kernel]          [k] kfree                      
     1.11%  [xfs]             [k] xfs_buf_item_size          
     1.07%  [kernel]          [k] s_show                     
     1.07%  [kernel]          [k] link_path_walk             
     0.93%  [xfs]             [k] xfs_btree_lookup           
     0.90%  [xfs]             [k] xfs_da_do_buf              
     0.84%  [xfs]             [k] xfs_dir2_node_addname_int  

time: 1341306645
   PerfTop:    1097 irqs/sec  kernel:95.8%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     7.51%  [unknown]         [.] 0x00007f84a7766b99         
     5.02%  [kernel]          [k] _raw_spin_lock             
     4.63%  [kernel]          [k] memcpy                     
     4.51%  [xfs]             [k] _xfs_buf_find              
     3.32%  [kernel]          [k] page_fault                 
     2.37%  [kernel]          [k] kmem_cache_alloc           
     1.87%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.77%  [xfs]             [k] xfs_buf_offset             
     1.75%  [kernel]          [k] __kmalloc                  
     1.73%  [xfs]             [k] xfs_next_bit               
     1.65%  [kernel]          [k] memset                     
     1.60%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.53%  [kernel]          [k] kfree                      
     1.52%  [kernel]          [k] kmem_cache_free            
     1.44%  [xfs]             [k] xfs_trans_ail_cursor_first 
     1.26%  [kernel]          [k] __d_lookup                 
     1.26%  [xfs]             [k] xfs_buf_item_size          
     1.05%  [kernel]          [k] s_show                     
     1.03%  [xfs]             [k] xfs_buf_item_format        
     0.92%  [kernel]          [k] __d_lookup_rcu             
     0.87%  [kernel]          [k] link_path_walk             

time: 1341306645
   PerfTop:    1038 irqs/sec  kernel:95.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.89%  [unknown]         [.] 0x00007f84a7766b99         
     5.18%  [kernel]          [k] memcpy                     
     4.60%  [xfs]             [k] _xfs_buf_find              
     4.52%  [kernel]          [k] _raw_spin_lock             
     2.60%  [kernel]          [k] page_fault                 
     2.42%  [kernel]          [k] kmem_cache_alloc           
     1.99%  [kernel]          [k] __kmalloc                  
     1.96%  [xfs]             [k] xfs_next_bit               
     1.93%  [xfs]             [k] xfs_buf_offset             
     1.84%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.83%  [kernel]          [k] memset                     
     1.80%  [kernel]          [k] kmem_cache_free            
     1.68%  [kernel]          [k] kfree                      
     1.47%  [xfs]             [k] xfs_buf_item_size          
     1.45%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.18%  [kernel]          [k] __d_lookup_rcu             
     1.13%  [xfs]             [k] xfs_trans_ail_cursor_first 
     1.12%  [xfs]             [k] xfs_buf_item_format        
     1.04%  [kernel]          [k] s_show                     
     1.01%  [kernel]          [k] __d_lookup                 
     0.93%  [xfs]             [k] xfs_da_do_buf              

time: 1341306645
   PerfTop:    1087 irqs/sec  kernel:96.0%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.27%  [kernel]          [k] memcpy                     
     4.77%  [unknown]         [.] 0x00007f84a7766b99         
     4.69%  [xfs]             [k] _xfs_buf_find              
     4.56%  [kernel]          [k] _raw_spin_lock             
     2.47%  [kernel]          [k] kmem_cache_alloc           
     2.18%  [xfs]             [k] xfs_next_bit               
     2.11%  [kernel]          [k] page_fault                 
     2.00%  [xfs]             [k] xfs_buf_offset             
     1.99%  [kernel]          [k] __kmalloc                  
     1.96%  [kernel]          [k] kmem_cache_free            
     1.85%  [kernel]          [k] kfree                      
     1.82%  [kernel]          [k] memset                     
     1.75%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.58%  [xfs]             [k] xfs_buf_item_size          
     1.41%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.23%  [kernel]          [k] __d_lookup_rcu             
     1.21%  [xfs]             [k] xfs_buf_item_format        
     0.99%  [kernel]          [k] s_show                     
     0.97%  [xfs]             [k] xfs_perag_put              
     0.92%  [xfs]             [k] xfs_da_do_buf              
     0.92%  [xfs]             [k] xfs_trans_ail_cursor_first 

time: 1341306651
   PerfTop:    1157 irqs/sec  kernel:96.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.53%  [kernel]          [k] memcpy                     
     4.61%  [xfs]             [k] _xfs_buf_find              
     4.40%  [kernel]          [k] _raw_spin_lock             
     3.83%  [unknown]         [.] 0x00007f84a7766b99         
     2.67%  [kernel]          [k] kmem_cache_alloc           
     2.32%  [xfs]             [k] xfs_next_bit               
     2.21%  [kernel]          [k] __kmalloc                  
     2.21%  [xfs]             [k] xfs_buf_offset             
     2.19%  [kernel]          [k] kmem_cache_free            
     1.92%  [kernel]          [k] memset                     
     1.89%  [kernel]          [k] kfree                      
     1.80%  [xfs]             [k] xfs_buf_item_size          
     1.70%  [kernel]          [k] page_fault                 
     1.62%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.50%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.30%  [xfs]             [k] xfs_buf_item_format        
     1.27%  [kernel]          [k] __d_lookup_rcu             
     0.97%  [xfs]             [k] xfs_da_do_buf              
     0.96%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.93%  [kernel]          [k] s_show                     
     0.93%  [xfs]             [k] xfs_perag_put              

time: 1341306651
   PerfTop:    1073 irqs/sec  kernel:95.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.76%  [kernel]          [k] memcpy                     
     4.54%  [xfs]             [k] _xfs_buf_find              
     4.32%  [kernel]          [k] _raw_spin_lock             
     3.15%  [unknown]         [.] 0x00007f84a7766b99         
     2.77%  [kernel]          [k] kmem_cache_alloc           
     2.49%  [xfs]             [k] xfs_next_bit               
     2.36%  [kernel]          [k] __kmalloc                  
     2.27%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
     1.88%  [kernel]          [k] memset                     
     1.88%  [kernel]          [k] kfree                      
     1.77%  [xfs]             [k] xfs_buf_item_size          
     1.62%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.42%  [xfs]             [k] xfs_buf_item_format        
     1.40%  [kernel]          [k] page_fault                 
     1.39%  [kernel]          [k] __d_lookup_rcu             
     0.99%  [xfs]             [k] xfs_da_do_buf              
     0.96%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.88%  [kernel]          [k] s_show                     
     0.87%  [xfs]             [k] xfs_perag_put              

time: 1341306651
   PerfTop:     492 irqs/sec  kernel:85.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.74%  [kernel]          [k] memcpy                     
     4.48%  [xfs]             [k] _xfs_buf_find              
     4.27%  [kernel]          [k] _raw_spin_lock             
     3.00%  [unknown]         [.] 0x00007f84a7766b99         
     2.76%  [kernel]          [k] kmem_cache_alloc           
     2.54%  [xfs]             [k] xfs_next_bit               
     2.39%  [kernel]          [k] __kmalloc                  
     2.30%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
no symbols found in /bin/ps, maybe install a debug package?
     1.96%  [kernel]          [k] kfree                      
     1.92%  [kernel]          [k] memset                     
     1.75%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.44%  [xfs]             [k] xfs_buf_item_format        
     1.39%  [kernel]          [k] __d_lookup_rcu             
     1.36%  [kernel]          [k] page_fault                 
     0.99%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.86%  [kernel]          [k] s_show                     
     0.85%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      70 irqs/sec  kernel:72.9%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.73%  [kernel]          [k] memcpy                     
     4.47%  [xfs]             [k] _xfs_buf_find              
     4.27%  [kernel]          [k] _raw_spin_lock             
     2.99%  [unknown]         [.] 0x00007f84a7766b99         
     2.75%  [kernel]          [k] kmem_cache_alloc           
     2.53%  [xfs]             [k] xfs_next_bit               
     2.39%  [kernel]          [k] __kmalloc                  
     2.30%  [kernel]          [k] kmem_cache_free            
     2.20%  [xfs]             [k] xfs_buf_offset             
     1.96%  [kernel]          [k] kfree                      
     1.92%  [kernel]          [k] memset                     
     1.75%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.49%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.43%  [xfs]             [k] xfs_buf_item_format        
     1.38%  [kernel]          [k] __d_lookup_rcu             
     1.37%  [kernel]          [k] page_fault                 
     0.98%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.89%  [kernel]          [k] s_show                     
     0.85%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      87 irqs/sec  kernel:71.3%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.72%  [kernel]          [k] memcpy                     
     4.45%  [xfs]             [k] _xfs_buf_find              
     4.25%  [kernel]          [k] _raw_spin_lock             
     2.99%  [unknown]         [.] 0x00007f84a7766b99         
     2.74%  [kernel]          [k] kmem_cache_alloc           
     2.52%  [xfs]             [k] xfs_next_bit               
     2.38%  [kernel]          [k] __kmalloc                  
     2.29%  [kernel]          [k] kmem_cache_free            
     2.19%  [xfs]             [k] xfs_buf_offset             
     1.95%  [kernel]          [k] kfree                      
     1.91%  [kernel]          [k] memset                     
     1.74%  [xfs]             [k] xfs_buf_item_size          
     1.56%  [kernel]          [k] _raw_spin_lock_irqsave     
     1.48%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.43%  [xfs]             [k] xfs_buf_item_format        
     1.38%  [kernel]          [k] page_fault                 
     1.38%  [kernel]          [k] __d_lookup_rcu             
     0.98%  [xfs]             [k] xlog_cil_prepare_log_vecs  
     0.96%  [xfs]             [k] xfs_da_do_buf              
     0.93%  [kernel]          [k] s_show                     
     0.84%  [xfs]             [k] xfs_perag_put              

time: 1341306657
   PerfTop:      88 irqs/sec  kernel:68.2%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

     5.69%  [kernel]          [k] memcpy                     
     4.42%  [xfs]             [k] _xfs_buf_find              
     4.25%  [kernel]          [k] _raw_spin_lock             
     2.98%  [unknown]         [.] 0x00007f84a7766b99         

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03 10:59                 ` Mel Gorman
  (?)
@ 2012-07-03 11:44                   ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 11:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.  latencytop is shit for
> capturing information throughout a test and it does not easily allow you to
> record a snapshot of a test. You can record all the console output of course
> but that's a complete mess. I tried capturing /proc/latency_stats over time
> instead because that can be trivially sorted on a system-wide basis but
> as I write this I find that latency_stats was bust. It was just spitting out
> 
> Latency Top version : v0.1
> 
> and nothing else.  Either latency_stats is broken or my config is. Not sure
> which it is right now and won't get enough time on this today to pinpoint it.
> 

PEBKAC. Script that monitored /proc/latency_stats was not enabling
latency top via /proc/sys/kernel/latencytop

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 11:44                   ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 11:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.  latencytop is shit for
> capturing information throughout a test and it does not easily allow you to
> record a snapshot of a test. You can record all the console output of course
> but that's a complete mess. I tried capturing /proc/latency_stats over time
> instead because that can be trivially sorted on a system-wide basis but
> as I write this I find that latency_stats was bust. It was just spitting out
> 
> Latency Top version : v0.1
> 
> and nothing else.  Either latency_stats is broken or my config is. Not sure
> which it is right now and won't get enough time on this today to pinpoint it.
> 

PEBKAC. Script that monitored /proc/latency_stats was not enabling
latency top via /proc/sys/kernel/latencytop

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 11:44                   ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 11:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.  latencytop is shit for
> capturing information throughout a test and it does not easily allow you to
> record a snapshot of a test. You can record all the console output of course
> but that's a complete mess. I tried capturing /proc/latency_stats over time
> instead because that can be trivially sorted on a system-wide basis but
> as I write this I find that latency_stats was bust. It was just spitting out
> 
> Latency Top version : v0.1
> 
> and nothing else.  Either latency_stats is broken or my config is. Not sure
> which it is right now and won't get enough time on this today to pinpoint it.
> 

PEBKAC. Script that monitored /proc/latency_stats was not enabling
latency top via /proc/sys/kernel/latencytop

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03 10:59                 ` Mel Gorman
  (?)
@ 2012-07-03 12:31                   ` Daniel Vetter
  -1 siblings, 0 replies; 108+ messages in thread
From: Daniel Vetter @ 2012-07-03 12:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Daniel Vetter, Chris Wilson

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.

Well, presuming I understand things correctly the cpu die only goes into
the lowest sleep state (which iirc switches off l3 caches and
interconnects) when both the cpu and gpu are in the lowest sleep state.
rc6 is that deep-sleep state for the gpu, so without that enabled your
system won't go into these deep-sleep states.

I guess the slight changes in wakeup latency, power consumption (cuts
about 10W on an idle desktop snb with resulting big effect on what turbo
boost can sustain for short amounts of time) and all the follow-on effects
are good enough to massively change timing-critical things.

So this having an effect isn't too weird.

Obviously, if you also have X running while doing these tests there's the
chance that the gpu dies because of an issue when waking up from rc6
(we've known a few of these), but if no drm client is up, that shouldn't
be possible. So please retest without X running if that hasn't been done
already.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 12:31                   ` Daniel Vetter
  0 siblings, 0 replies; 108+ messages in thread
From: Daniel Vetter @ 2012-07-03 12:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Daniel Vetter, Chris Wilson

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.

Well, presuming I understand things correctly the cpu die only goes into
the lowest sleep state (which iirc switches off l3 caches and
interconnects) when both the cpu and gpu are in the lowest sleep state.
rc6 is that deep-sleep state for the gpu, so without that enabled your
system won't go into these deep-sleep states.

I guess the slight changes in wakeup latency, power consumption (cuts
about 10W on an idle desktop snb with resulting big effect on what turbo
boost can sustain for short amounts of time) and all the follow-on effects
are good enough to massively change timing-critical things.

So this having an effect isn't too weird.

Obviously, if you also have X running while doing these tests there's the
chance that the gpu dies because of an issue when waking up from rc6
(we've known a few of these), but if no drm client is up, that shouldn't
be possible. So please retest without X running if that hasn't been done
already.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 12:31                   ` Daniel Vetter
  0 siblings, 0 replies; 108+ messages in thread
From: Daniel Vetter @ 2012-07-03 12:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.

Well, presuming I understand things correctly the cpu die only goes into
the lowest sleep state (which iirc switches off l3 caches and
interconnects) when both the cpu and gpu are in the lowest sleep state.
rc6 is that deep-sleep state for the gpu, so without that enabled your
system won't go into these deep-sleep states.

I guess the slight changes in wakeup latency, power consumption (cuts
about 10W on an idle desktop snb with resulting big effect on what turbo
boost can sustain for short amounts of time) and all the follow-on effects
are good enough to massively change timing-critical things.

So this having an effect isn't too weird.

Obviously, if you also have X running while doing these tests there's the
chance that the gpu dies because of an issue when waking up from rc6
(we've known a few of these), but if no drm client is up, that shouldn't
be possible. So please retest without X running if that hasn't been done
already.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-02 19:35             ` Mel Gorman
  (?)
@ 2012-07-03 13:04               ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 13:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > <SNIP>
> >
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [c999a223: xfs: introduce an allocation workqueue]
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> 
> gdm was running on the machine so i915 would have been in use. 

Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded
but X is not.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 13:04               ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 13:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > <SNIP>
> >
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [c999a223: xfs: introduce an allocation workqueue]
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> 
> gdm was running on the machine so i915 would have been in use. 

Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded
but X is not.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 13:04               ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 13:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, linux-mm, linux-fsdevel, Eugeni Dodonov

On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > <SNIP>
> >
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [c999a223: xfs: introduce an allocation workqueue]
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> 
> gdm was running on the machine so i915 would have been in use. 

Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded
but X is not.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03 12:31                   ` Daniel Vetter
  (?)
@ 2012-07-03 13:08                     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 13:08 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Chris Wilson

On Tue, Jul 03, 2012 at 02:31:19PM +0200, Daniel Vetter wrote:
> On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > > Adding dri-devel and a few others because an i915 patch contributed to
> > > > the regression.
> > > > 
> > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > > you have lots of CPU power, there will be little difference in
> > > > > > > performance...
> > > > > > 
> > > > > > When I checked it it could only be called twice, and we'd already
> > > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > > the performance changes happend, at least to a major version but even
> > > > > > better to a -rc or git commit.
> > > > > > 
> > > > > 
> > > > > By all means feel free to run the test yourself and run the bisection :)
> > > > > 
> > > > > It's rare but on this occasion the test machine is idle so I started an
> > > > > automated git bisection. As you know the milage with an automated bisect
> > > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > > apparently unrelated patch that caused the problem.
> > > > > 
> > > > 
> > > > It was obvious very quickly that there were two distinct regression so I
> > > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > > that enables RC6 to reduce power usage.
> > > > 
> > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > > 
> > > Doesn't seem to be the major cause of the regression. By itself, it
> > > has impact, but the majority comes from the XFS change...
> > > 
> > 
> > The fact it has an impact at all is weird but lets see what the DRI
> > folks think about it.
> 
> Well, presuming I understand things correctly the cpu die only goes into
> the lowest sleep state (which iirc switches off l3 caches and
> interconnects) when both the cpu and gpu are in the lowest sleep state.

I made a mistake in my previous mail. gdm and X were were *not* running.
Once the screen blanked I would guess the GPU is in a low sleep state
the majority of the time.

> rc6 is that deep-sleep state for the gpu, so without that enabled your
> system won't go into these deep-sleep states.
> 
> I guess the slight changes in wakeup latency, power consumption (cuts
> about 10W on an idle desktop snb with resulting big effect on what turbo
> boost can sustain for short amounts of time) and all the follow-on effects
> are good enough to massively change timing-critical things.
> 

Maybe. How aggressively is the lowest sleep state entered and how long
does it take to exit?

> So this having an effect isn't too weird.
> 
> Obviously, if you also have X running while doing these tests there's the
> chance that the gpu dies because of an issue when waking up from rc6
> (we've known a few of these), but if no drm client is up, that shouldn't
> be possible. So please retest without X running if that hasn't been done
> already.
> 

Again, sorry for the confusion but the posted results are without X running.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 13:08                     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 13:08 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Chris Wilson

On Tue, Jul 03, 2012 at 02:31:19PM +0200, Daniel Vetter wrote:
> On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > > Adding dri-devel and a few others because an i915 patch contributed to
> > > > the regression.
> > > > 
> > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > > you have lots of CPU power, there will be little difference in
> > > > > > > performance...
> > > > > > 
> > > > > > When I checked it it could only be called twice, and we'd already
> > > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > > the performance changes happend, at least to a major version but even
> > > > > > better to a -rc or git commit.
> > > > > > 
> > > > > 
> > > > > By all means feel free to run the test yourself and run the bisection :)
> > > > > 
> > > > > It's rare but on this occasion the test machine is idle so I started an
> > > > > automated git bisection. As you know the milage with an automated bisect
> > > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > > apparently unrelated patch that caused the problem.
> > > > > 
> > > > 
> > > > It was obvious very quickly that there were two distinct regression so I
> > > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > > that enables RC6 to reduce power usage.
> > > > 
> > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > > 
> > > Doesn't seem to be the major cause of the regression. By itself, it
> > > has impact, but the majority comes from the XFS change...
> > > 
> > 
> > The fact it has an impact at all is weird but lets see what the DRI
> > folks think about it.
> 
> Well, presuming I understand things correctly the cpu die only goes into
> the lowest sleep state (which iirc switches off l3 caches and
> interconnects) when both the cpu and gpu are in the lowest sleep state.

I made a mistake in my previous mail. gdm and X were were *not* running.
Once the screen blanked I would guess the GPU is in a low sleep state
the majority of the time.

> rc6 is that deep-sleep state for the gpu, so without that enabled your
> system won't go into these deep-sleep states.
> 
> I guess the slight changes in wakeup latency, power consumption (cuts
> about 10W on an idle desktop snb with resulting big effect on what turbo
> boost can sustain for short amounts of time) and all the follow-on effects
> are good enough to massively change timing-critical things.
> 

Maybe. How aggressively is the lowest sleep state entered and how long
does it take to exit?

> So this having an effect isn't too weird.
> 
> Obviously, if you also have X running while doing these tests there's the
> chance that the gpu dies because of an issue when waking up from rc6
> (we've known a few of these), but if no drm client is up, that shouldn't
> be possible. So please retest without X running if that hasn't been done
> already.
> 

Again, sorry for the confusion but the posted results are without X running.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 13:08                     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-03 13:08 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Chris Wilson

On Tue, Jul 03, 2012 at 02:31:19PM +0200, Daniel Vetter wrote:
> On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > > Adding dri-devel and a few others because an i915 patch contributed to
> > > > the regression.
> > > > 
> > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > > you have lots of CPU power, there will be little difference in
> > > > > > > performance...
> > > > > > 
> > > > > > When I checked it it could only be called twice, and we'd already
> > > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > > the performance changes happend, at least to a major version but even
> > > > > > better to a -rc or git commit.
> > > > > > 
> > > > > 
> > > > > By all means feel free to run the test yourself and run the bisection :)
> > > > > 
> > > > > It's rare but on this occasion the test machine is idle so I started an
> > > > > automated git bisection. As you know the milage with an automated bisect
> > > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > > apparently unrelated patch that caused the problem.
> > > > > 
> > > > 
> > > > It was obvious very quickly that there were two distinct regression so I
> > > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > > that enables RC6 to reduce power usage.
> > > > 
> > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > > 
> > > Doesn't seem to be the major cause of the regression. By itself, it
> > > has impact, but the majority comes from the XFS change...
> > > 
> > 
> > The fact it has an impact at all is weird but lets see what the DRI
> > folks think about it.
> 
> Well, presuming I understand things correctly the cpu die only goes into
> the lowest sleep state (which iirc switches off l3 caches and
> interconnects) when both the cpu and gpu are in the lowest sleep state.

I made a mistake in my previous mail. gdm and X were were *not* running.
Once the screen blanked I would guess the GPU is in a low sleep state
the majority of the time.

> rc6 is that deep-sleep state for the gpu, so without that enabled your
> system won't go into these deep-sleep states.
> 
> I guess the slight changes in wakeup latency, power consumption (cuts
> about 10W on an idle desktop snb with resulting big effect on what turbo
> boost can sustain for short amounts of time) and all the follow-on effects
> are good enough to massively change timing-critical things.
> 

Maybe. How aggressively is the lowest sleep state entered and how long
does it take to exit?

> So this having an effect isn't too weird.
> 
> Obviously, if you also have X running while doing these tests there's the
> chance that the gpu dies because of an issue when waking up from rc6
> (we've known a few of these), but if no drm client is up, that shouldn't
> be possible. So please retest without X running if that hasn't been done
> already.
> 

Again, sorry for the confusion but the posted results are without X running.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03 12:31                   ` Daniel Vetter
@ 2012-07-03 13:28                     ` Eugeni Dodonov
  -1 siblings, 0 replies; 108+ messages in thread
From: Eugeni Dodonov @ 2012-07-03 13:28 UTC (permalink / raw)
  To: Mel Gorman, Dave Chinner, Christoph Hellwig, linux-mm,
	linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard,
	Eugeni Dodonov, Chris Wilson
  Cc: Daniel Vetter

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

On Tue, Jul 3, 2012 at 9:31 AM, Daniel Vetter <daniel@ffwll.ch> wrote:

> Well, presuming I understand things correctly the cpu die only goes into
> the lowest sleep state (which iirc switches off l3 caches and
> interconnects) when both the cpu and gpu are in the lowest sleep state.
> rc6 is that deep-sleep state for the gpu, so without that enabled your
> system won't go into these deep-sleep states.
>
> I guess the slight changes in wakeup latency, power consumption (cuts
> about 10W on an idle desktop snb with resulting big effect on what turbo
> boost can sustain for short amounts of time) and all the follow-on effects
> are good enough to massively change timing-critical things.
>

The sad side effect is that the software has very little control over the
RC6 entry and exit, the hardware enters and leaves RC6 state on its own
when it detects that the GPU is idle beyond a threshold. Chances are that
if you are not running any GPU workload, the GPU simple enters RC6 state
and stays there.

It is possible to observe the current state and also time spent in rc6 by
looking at the /sys/kernel/debug/dri/0/i915_drpc_info file.

One other effect of RC6 is that it also allows CPU to go into higher turbo
modes as it has more watts to spend while GPU is idle, perhaps this is what
causes the issue here?

-- 
Eugeni Dodonov
<http://eugeni.dodonov.net/>

[-- Attachment #2: Type: text/html, Size: 1809 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 13:28                     ` Eugeni Dodonov
  0 siblings, 0 replies; 108+ messages in thread
From: Eugeni Dodonov @ 2012-07-03 13:28 UTC (permalink / raw)
  To: Mel Gorman, Dave Chinner, Christoph Hellwig, linux-mm,
	linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard,
	Eugeni Dodonov, Chris Wilson
  Cc: Daniel Vetter


[-- Attachment #1.1: Type: text/plain, Size: 1366 bytes --]

On Tue, Jul 3, 2012 at 9:31 AM, Daniel Vetter <daniel@ffwll.ch> wrote:

> Well, presuming I understand things correctly the cpu die only goes into
> the lowest sleep state (which iirc switches off l3 caches and
> interconnects) when both the cpu and gpu are in the lowest sleep state.
> rc6 is that deep-sleep state for the gpu, so without that enabled your
> system won't go into these deep-sleep states.
>
> I guess the slight changes in wakeup latency, power consumption (cuts
> about 10W on an idle desktop snb with resulting big effect on what turbo
> boost can sustain for short amounts of time) and all the follow-on effects
> are good enough to massively change timing-critical things.
>

The sad side effect is that the software has very little control over the
RC6 entry and exit, the hardware enters and leaves RC6 state on its own
when it detects that the GPU is idle beyond a threshold. Chances are that
if you are not running any GPU workload, the GPU simple enters RC6 state
and stays there.

It is possible to observe the current state and also time spent in rc6 by
looking at the /sys/kernel/debug/dri/0/i915_drpc_info file.

One other effect of RC6 is that it also allows CPU to go into higher turbo
modes as it has more watts to spend while GPU is idle, perhaps this is what
causes the issue here?

-- 
Eugeni Dodonov
<http://eugeni.dodonov.net/>

[-- Attachment #1.2: Type: text/html, Size: 1809 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03 13:04               ` Mel Gorman
  (?)
@ 2012-07-03 14:04                 ` Daniel Vetter
  -1 siblings, 0 replies; 108+ messages in thread
From: Daniel Vetter @ 2012-07-03 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Dave Chinner, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Daniel Vetter, Chris Wilson

On Tue, Jul 03, 2012 at 02:04:14PM +0100, Mel Gorman wrote:
> On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > <SNIP>
> > >
> > It was obvious very quickly that there were two distinct regression so I
> > ran two bisections. One led to a XFS and the other led to an i915 patch
> > that enables RC6 to reduce power usage.
> > 
> > [c999a223: xfs: introduce an allocation workqueue]
> > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > gdm was running on the machine so i915 would have been in use. 
> 
> Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded
> but X is not.

See my little explanation of rc6, just loading the driver will have
effects. But I'm happy to know that the issue also happens without using
it, makes it really unlikely it's an issue with the gpu or i915.ko ;-)
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 14:04                 ` Daniel Vetter
  0 siblings, 0 replies; 108+ messages in thread
From: Daniel Vetter @ 2012-07-03 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Dave Chinner, linux-mm, linux-kernel,
	linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov,
	Daniel Vetter, Chris Wilson

On Tue, Jul 03, 2012 at 02:04:14PM +0100, Mel Gorman wrote:
> On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > <SNIP>
> > >
> > It was obvious very quickly that there were two distinct regression so I
> > ran two bisections. One led to a XFS and the other led to an i915 patch
> > that enables RC6 to reduce power usage.
> > 
> > [c999a223: xfs: introduce an allocation workqueue]
> > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > gdm was running on the machine so i915 would have been in use. 
> 
> Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded
> but X is not.

See my little explanation of rc6, just loading the driver will have
effects. But I'm happy to know that the issue also happens without using
it, makes it really unlikely it's an issue with the gpu or i915.ko ;-)
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-03 14:04                 ` Daniel Vetter
  0 siblings, 0 replies; 108+ messages in thread
From: Daniel Vetter @ 2012-07-03 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Tue, Jul 03, 2012 at 02:04:14PM +0100, Mel Gorman wrote:
> On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > <SNIP>
> > >
> > It was obvious very quickly that there were two distinct regression so I
> > ran two bisections. One led to a XFS and the other led to an i915 patch
> > that enables RC6 to reduce power usage.
> > 
> > [c999a223: xfs: introduce an allocation workqueue]
> > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > gdm was running on the machine so i915 would have been in use. 
> 
> Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded
> but X is not.

See my little explanation of rc6, just loading the driver will have
effects. But I'm happy to know that the issue also happens without using
it, makes it really unlikely it's an issue with the gpu or i915.ko ;-)
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-03 10:59                 ` Mel Gorman
  (?)
@ 2012-07-04  0:47                   ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-04  0:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.
> 
> > > [c999a223: xfs: introduce an allocation workqueue]
> > 
> > Which indicates that there is workqueue scheduling issues, I think.
> > The same amount of work is being done, but half of it is being
> > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > tested the above patch in anger on an 8p machine, similar to the
> > machine you saw no regressions on, but the workload didn't drive it
> > to being completely CPU bound (only about 90%) so the allocation
> > work was probably always scheduled quickly.
> 
> What test were you using?

fsmark, dbench, compilebench, and a few fio workloads. Also,
xfstests times each test and I keep track of overall runtime, and
none of those showed any performance differential, either...

Indeed, running on a current 3.5-rc5 tree, my usual fsmark
benchmarks are running at the same numbers I've been seeing since
about 3.0 - somewhere around 18k files/s for a single thread, and
110-115k files/s for 8 threads.

I just ran your variant, and I'm getting about 20kfile/s for a
single thread, which is about right because you're using smaller
directories than I am (22500 files per dir vs 100k in my tests).

> > How many worker threads have been spawned on these machines
> > that are showing the regression?
> 
> 20 or 21 generally. An example list as spotted by top looks like

Pretty normal.

> > What is the context switch rate on the machines whenteh test is running?
.....
> Vanilla average context switch rate	4278.53
> Revert average context switch rate	1095

That seems about right, too.

> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.

Where the context switches are coming from, and how long they are
abeing stalled for. Just to get the context switch locations, you
can use perf on the sched:sched_switch event, but that doesn't give
you stall times. Local testing tells me that about 40% of the
switches are from xfs_alloc_vextent, 55% are from the work threads,
and the rest are CPU idling events, which is exactly as I'd expect.

> > A pert top profile comparison might be informative,
> > too...
> > 
> 
> I'm not sure if this is what you really wanted. I thought an oprofile or
> perf report would have made more sense but I recorded perf top over time
> anyway and it's at the end of the mail.

perf report and oprofile give you CPU usage across the run, it's not
instantaneous and that's where all the interesting information is.
e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
it might be 100% for 1s - that's the behaviour run profiles cannot
give you insight into....

As it is, the output you posted is nothing unusual.

> For just these XFS tests I've uploaded a tarball of the logs to
> http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

Ok, so the main thing I missed when first looking at this is that
you are concerned about single thread regressions. Well, I can't
reproduce your results here. Single threaded with or without the
workqueue based allocation gives me roughly 20k +/-0.5k files/s one
a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
machine.  That's the same results I've been seeing since I wrote
this patch almost 12 months ago....

So, given that this is a metadata intensive workload, the only
extent allocation is going to be through inode and directory block
allocation. These paths do not consume a large amount of stack, so
we can tell the allocator not to switch to workqueue stack for these
allocations easily.

The patch below does this. It completely removes all the allocation
based context switches from the no-data fsmark workloads being used
for this testing. It makes no noticable difference to performance
here, so I'm interested if it solves the regression you are seeing
on your machines.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: don't defer metadata allocation to the workqueue

From: Dave Chinner <dchinner@redhat.com>

Almost all metadata allocations come from shallow stack usage
situations. Avoid the overhead of switching the allocation to a
workqueue as we are not in danger of running out of stack when
making these allocations. Metadata allocations are already marked
through the args that are passed down, so this is trivial to do.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_alloc.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index f654f51..4f33c32 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2434,13 +2434,22 @@ xfs_alloc_vextent_worker(
 	current_restore_flags_nested(&pflags, PF_FSTRANS);
 }
 
-
-int				/* error */
+/*
+ * Data allocation requests often come in with little stack to work on. Push
+ * them off to a worker thread so there is lots of stack to use. Metadata
+ * requests, OTOH, are generally from low stack usage paths, so avoid the
+ * context switch overhead here.
+ */
+int
 xfs_alloc_vextent(
-	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+	struct xfs_alloc_arg	*args)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 
+	if (!args->userdata)
+		return __xfs_alloc_vextent(args);
+
+
 	args->done = &done;
 	INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
 	queue_work(xfs_alloc_wq, &args->work);

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-04  0:47                   ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-04  0:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.
> 
> > > [c999a223: xfs: introduce an allocation workqueue]
> > 
> > Which indicates that there is workqueue scheduling issues, I think.
> > The same amount of work is being done, but half of it is being
> > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > tested the above patch in anger on an 8p machine, similar to the
> > machine you saw no regressions on, but the workload didn't drive it
> > to being completely CPU bound (only about 90%) so the allocation
> > work was probably always scheduled quickly.
> 
> What test were you using?

fsmark, dbench, compilebench, and a few fio workloads. Also,
xfstests times each test and I keep track of overall runtime, and
none of those showed any performance differential, either...

Indeed, running on a current 3.5-rc5 tree, my usual fsmark
benchmarks are running at the same numbers I've been seeing since
about 3.0 - somewhere around 18k files/s for a single thread, and
110-115k files/s for 8 threads.

I just ran your variant, and I'm getting about 20kfile/s for a
single thread, which is about right because you're using smaller
directories than I am (22500 files per dir vs 100k in my tests).

> > How many worker threads have been spawned on these machines
> > that are showing the regression?
> 
> 20 or 21 generally. An example list as spotted by top looks like

Pretty normal.

> > What is the context switch rate on the machines whenteh test is running?
.....
> Vanilla average context switch rate	4278.53
> Revert average context switch rate	1095

That seems about right, too.

> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.

Where the context switches are coming from, and how long they are
abeing stalled for. Just to get the context switch locations, you
can use perf on the sched:sched_switch event, but that doesn't give
you stall times. Local testing tells me that about 40% of the
switches are from xfs_alloc_vextent, 55% are from the work threads,
and the rest are CPU idling events, which is exactly as I'd expect.

> > A pert top profile comparison might be informative,
> > too...
> > 
> 
> I'm not sure if this is what you really wanted. I thought an oprofile or
> perf report would have made more sense but I recorded perf top over time
> anyway and it's at the end of the mail.

perf report and oprofile give you CPU usage across the run, it's not
instantaneous and that's where all the interesting information is.
e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
it might be 100% for 1s - that's the behaviour run profiles cannot
give you insight into....

As it is, the output you posted is nothing unusual.

> For just these XFS tests I've uploaded a tarball of the logs to
> http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

Ok, so the main thing I missed when first looking at this is that
you are concerned about single thread regressions. Well, I can't
reproduce your results here. Single threaded with or without the
workqueue based allocation gives me roughly 20k +/-0.5k files/s one
a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
machine.  That's the same results I've been seeing since I wrote
this patch almost 12 months ago....

So, given that this is a metadata intensive workload, the only
extent allocation is going to be through inode and directory block
allocation. These paths do not consume a large amount of stack, so
we can tell the allocator not to switch to workqueue stack for these
allocations easily.

The patch below does this. It completely removes all the allocation
based context switches from the no-data fsmark workloads being used
for this testing. It makes no noticable difference to performance
here, so I'm interested if it solves the regression you are seeing
on your machines.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: don't defer metadata allocation to the workqueue

From: Dave Chinner <dchinner@redhat.com>

Almost all metadata allocations come from shallow stack usage
situations. Avoid the overhead of switching the allocation to a
workqueue as we are not in danger of running out of stack when
making these allocations. Metadata allocations are already marked
through the args that are passed down, so this is trivial to do.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_alloc.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index f654f51..4f33c32 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2434,13 +2434,22 @@ xfs_alloc_vextent_worker(
 	current_restore_flags_nested(&pflags, PF_FSTRANS);
 }
 
-
-int				/* error */
+/*
+ * Data allocation requests often come in with little stack to work on. Push
+ * them off to a worker thread so there is lots of stack to use. Metadata
+ * requests, OTOH, are generally from low stack usage paths, so avoid the
+ * context switch overhead here.
+ */
+int
 xfs_alloc_vextent(
-	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+	struct xfs_alloc_arg	*args)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 
+	if (!args->userdata)
+		return __xfs_alloc_vextent(args);
+
+
 	args->done = &done;
 	INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
 	queue_work(xfs_alloc_wq, &args->work);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-04  0:47                   ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2012-07-04  0:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.
> 
> > > [c999a223: xfs: introduce an allocation workqueue]
> > 
> > Which indicates that there is workqueue scheduling issues, I think.
> > The same amount of work is being done, but half of it is being
> > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > tested the above patch in anger on an 8p machine, similar to the
> > machine you saw no regressions on, but the workload didn't drive it
> > to being completely CPU bound (only about 90%) so the allocation
> > work was probably always scheduled quickly.
> 
> What test were you using?

fsmark, dbench, compilebench, and a few fio workloads. Also,
xfstests times each test and I keep track of overall runtime, and
none of those showed any performance differential, either...

Indeed, running on a current 3.5-rc5 tree, my usual fsmark
benchmarks are running at the same numbers I've been seeing since
about 3.0 - somewhere around 18k files/s for a single thread, and
110-115k files/s for 8 threads.

I just ran your variant, and I'm getting about 20kfile/s for a
single thread, which is about right because you're using smaller
directories than I am (22500 files per dir vs 100k in my tests).

> > How many worker threads have been spawned on these machines
> > that are showing the regression?
> 
> 20 or 21 generally. An example list as spotted by top looks like

Pretty normal.

> > What is the context switch rate on the machines whenteh test is running?
.....
> Vanilla average context switch rate	4278.53
> Revert average context switch rate	1095

That seems about right, too.

> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.

Where the context switches are coming from, and how long they are
abeing stalled for. Just to get the context switch locations, you
can use perf on the sched:sched_switch event, but that doesn't give
you stall times. Local testing tells me that about 40% of the
switches are from xfs_alloc_vextent, 55% are from the work threads,
and the rest are CPU idling events, which is exactly as I'd expect.

> > A pert top profile comparison might be informative,
> > too...
> > 
> 
> I'm not sure if this is what you really wanted. I thought an oprofile or
> perf report would have made more sense but I recorded perf top over time
> anyway and it's at the end of the mail.

perf report and oprofile give you CPU usage across the run, it's not
instantaneous and that's where all the interesting information is.
e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
it might be 100% for 1s - that's the behaviour run profiles cannot
give you insight into....

As it is, the output you posted is nothing unusual.

> For just these XFS tests I've uploaded a tarball of the logs to
> http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz

Ok, so the main thing I missed when first looking at this is that
you are concerned about single thread regressions. Well, I can't
reproduce your results here. Single threaded with or without the
workqueue based allocation gives me roughly 20k +/-0.5k files/s one
a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
machine.  That's the same results I've been seeing since I wrote
this patch almost 12 months ago....

So, given that this is a metadata intensive workload, the only
extent allocation is going to be through inode and directory block
allocation. These paths do not consume a large amount of stack, so
we can tell the allocator not to switch to workqueue stack for these
allocations easily.

The patch below does this. It completely removes all the allocation
based context switches from the no-data fsmark workloads being used
for this testing. It makes no noticable difference to performance
here, so I'm interested if it solves the regression you are seeing
on your machines.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: don't defer metadata allocation to the workqueue

From: Dave Chinner <dchinner@redhat.com>

Almost all metadata allocations come from shallow stack usage
situations. Avoid the overhead of switching the allocation to a
workqueue as we are not in danger of running out of stack when
making these allocations. Metadata allocations are already marked
through the args that are passed down, so this is trivial to do.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_alloc.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index f654f51..4f33c32 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2434,13 +2434,22 @@ xfs_alloc_vextent_worker(
 	current_restore_flags_nested(&pflags, PF_FSTRANS);
 }
 
-
-int				/* error */
+/*
+ * Data allocation requests often come in with little stack to work on. Push
+ * them off to a worker thread so there is lots of stack to use. Metadata
+ * requests, OTOH, are generally from low stack usage paths, so avoid the
+ * context switch overhead here.
+ */
+int
 xfs_alloc_vextent(
-	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+	struct xfs_alloc_arg	*args)
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 
+	if (!args->userdata)
+		return __xfs_alloc_vextent(args);
+
+
 	args->done = &done;
 	INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
 	queue_work(xfs_alloc_wq, &args->work);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
  2012-07-04  0:47                   ` Dave Chinner
  (?)
@ 2012-07-04  9:51                     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04  9:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote:
> > > > <SNIP>
> > > > [c999a223: xfs: introduce an allocation workqueue]
> > > 
> > > Which indicates that there is workqueue scheduling issues, I think.
> > > The same amount of work is being done, but half of it is being
> > > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > > tested the above patch in anger on an 8p machine, similar to the
> > > machine you saw no regressions on, but the workload didn't drive it
> > > to being completely CPU bound (only about 90%) so the allocation
> > > work was probably always scheduled quickly.
> > 
> > What test were you using?
> 
> fsmark, dbench, compilebench, and a few fio workloads. Also,
> xfstests times each test and I keep track of overall runtime, and
> none of those showed any performance differential, either...
> 

Sound. I have some coverage on some of the same tests. When I get to
them I'll keep an eye on the 3.4 figures. It might be due to the disk
I'm using. It's a single disk and nothing to write home about in terms of
performance. It's not exactly XFS's usual target audience.

> Indeed, running on a current 3.5-rc5 tree, my usual fsmark
> benchmarks are running at the same numbers I've been seeing since
> about 3.0 - somewhere around 18k files/s for a single thread, and
> 110-115k files/s for 8 threads.
> 
> I just ran your variant, and I'm getting about 20kfile/s for a
> single thread, which is about right because you're using smaller
> directories than I am (22500 files per dir vs 100k in my tests).
> 

I had data for an fsmark-single test running with 30M files and FWIW the
3.4 performance figures were in line with 3.0 and later kernels.

> > > How many worker threads have been spawned on these machines
> > > that are showing the regression?
> > 
> > 20 or 21 generally. An example list as spotted by top looks like
> 
> Pretty normal.
> 
> > > What is the context switch rate on the machines whenteh test is running?
> .....
> > Vanilla average context switch rate	4278.53
> > Revert average context switch rate	1095
> 
> That seems about right, too.
> 
> > > Can you run latencytop to see
> > > if there is excessive starvation/wait times for allocation
> > > completion?
> > 
> > I'm not sure what format you are looking for.
> 
> Where the context switches are coming from, and how long they are
> abeing stalled for.

Noted. Capturing latency_stats over time is enough to do that. It won't
give a per-process breakdown but in the majority of cases that is not a
problem. Extracting the data is a bit annoying but not impossible and
better than parsing latencytop. Ideally, latencytop would be able to log
data in some sensible format. hmmm.

> Just to get the context switch locations, you
> can use perf on the sched:sched_switch event, but that doesn't give
> you stall times.

No, but both can be captured and roughly correlated with each other
given sufficient motivation.

> Local testing tells me that about 40% of the
> switches are from xfs_alloc_vextent, 55% are from the work threads,
> and the rest are CPU idling events, which is exactly as I'd expect.
> 
> > > A pert top profile comparison might be informative,
> > > too...
> > > 
> > 
> > I'm not sure if this is what you really wanted. I thought an oprofile or
> > perf report would have made more sense but I recorded perf top over time
> > anyway and it's at the end of the mail.
> 
> perf report and oprofile give you CPU usage across the run, it's not
> instantaneous and that's where all the interesting information is.
> e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
> it might be 100% for 1s - that's the behaviour run profiles cannot
> give you insight into....
> 

Fair point. I'll fix up the timestamping and keep the monitor for future
reference.

> As it is, the output you posted is nothing unusual.
> 

Grand. I had taken a look but I saw nothing particularly unusual but I
also was not 100% sure what I should be looking for.,

> > For just these XFS tests I've uploaded a tarball of the logs to
> > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz
> 
> Ok, so the main thing I missed when first looking at this is that
> you are concerned about single thread regressions.

In this specific test, yes. In the original data I posted I had threaded
benchmarks but they did not show the regression. This was a rerun with
just the single threaded case. Generally I run both because I see bug
reports involving both types of test.

> Well, I can't
> reproduce your results here. Single threaded with or without the
> workqueue based allocation gives me roughly 20k +/-0.5k files/s one
> a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
> machine.  That's the same results I've been seeing since I wrote
> this patch almost 12 months ago....
> 
> So, given that this is a metadata intensive workload, the only
> extent allocation is going to be through inode and directory block
> allocation. These paths do not consume a large amount of stack, so
> we can tell the allocator not to switch to workqueue stack for these
> allocations easily.
> 
> The patch below does this. It completely removes all the allocation
> based context switches from the no-data fsmark workloads being used
> for this testing. It makes no noticable difference to performance
> here, so I'm interested if it solves the regression you are seeing
> on your machines.
> 

It does. nodefer-metadata is just your patch applied on top of 3.4 and is
the right-most column. It's within the noise for the reverted patches and
approximately the same performance as 3.3. If you look at the timing
stats at the bottom you'll see that hte patch brings the System time way
down so consider this a

Tested-by: Mel Gorman <mgorman@suse.de>

FS-Mark Single Threaded
                 fsmark-single      single-3.4.0      single-3.4.0      single-3.4.0      single-3.4.0
                 3.4.0-vanilla   revert-aa464191   revert-c999a223       revert-both  nodefer-metadata
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)      25108.00 (77.11%)    25448.40 (79.51%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)     38169.97 (127.43%)   36393.09 (116.84%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)    5344.99 (430.65%)      5599.65 (455.93%)    5961.48 (491.85%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)     47918.10 (159.36%)   47146.20 (155.18%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)     247396.00 (58.35%)   248906.00 (58.10%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)     287141.73 (54.98%)   284274.93 (55.43%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)  175001.08 (-141.58%)   102018.14 (-40.83%)  114055.47 (-57.45%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)     637932.00 (25.44%)   710720.00 (16.94%)

MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99     24.38
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7     27.12
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14     36.11

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-04  9:51                     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04  9:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs,
	dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter,
	Chris Wilson

On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote:
> > > > <SNIP>
> > > > [c999a223: xfs: introduce an allocation workqueue]
> > > 
> > > Which indicates that there is workqueue scheduling issues, I think.
> > > The same amount of work is being done, but half of it is being
> > > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > > tested the above patch in anger on an 8p machine, similar to the
> > > machine you saw no regressions on, but the workload didn't drive it
> > > to being completely CPU bound (only about 90%) so the allocation
> > > work was probably always scheduled quickly.
> > 
> > What test were you using?
> 
> fsmark, dbench, compilebench, and a few fio workloads. Also,
> xfstests times each test and I keep track of overall runtime, and
> none of those showed any performance differential, either...
> 

Sound. I have some coverage on some of the same tests. When I get to
them I'll keep an eye on the 3.4 figures. It might be due to the disk
I'm using. It's a single disk and nothing to write home about in terms of
performance. It's not exactly XFS's usual target audience.

> Indeed, running on a current 3.5-rc5 tree, my usual fsmark
> benchmarks are running at the same numbers I've been seeing since
> about 3.0 - somewhere around 18k files/s for a single thread, and
> 110-115k files/s for 8 threads.
> 
> I just ran your variant, and I'm getting about 20kfile/s for a
> single thread, which is about right because you're using smaller
> directories than I am (22500 files per dir vs 100k in my tests).
> 

I had data for an fsmark-single test running with 30M files and FWIW the
3.4 performance figures were in line with 3.0 and later kernels.

> > > How many worker threads have been spawned on these machines
> > > that are showing the regression?
> > 
> > 20 or 21 generally. An example list as spotted by top looks like
> 
> Pretty normal.
> 
> > > What is the context switch rate on the machines whenteh test is running?
> .....
> > Vanilla average context switch rate	4278.53
> > Revert average context switch rate	1095
> 
> That seems about right, too.
> 
> > > Can you run latencytop to see
> > > if there is excessive starvation/wait times for allocation
> > > completion?
> > 
> > I'm not sure what format you are looking for.
> 
> Where the context switches are coming from, and how long they are
> abeing stalled for.

Noted. Capturing latency_stats over time is enough to do that. It won't
give a per-process breakdown but in the majority of cases that is not a
problem. Extracting the data is a bit annoying but not impossible and
better than parsing latencytop. Ideally, latencytop would be able to log
data in some sensible format. hmmm.

> Just to get the context switch locations, you
> can use perf on the sched:sched_switch event, but that doesn't give
> you stall times.

No, but both can be captured and roughly correlated with each other
given sufficient motivation.

> Local testing tells me that about 40% of the
> switches are from xfs_alloc_vextent, 55% are from the work threads,
> and the rest are CPU idling events, which is exactly as I'd expect.
> 
> > > A pert top profile comparison might be informative,
> > > too...
> > > 
> > 
> > I'm not sure if this is what you really wanted. I thought an oprofile or
> > perf report would have made more sense but I recorded perf top over time
> > anyway and it's at the end of the mail.
> 
> perf report and oprofile give you CPU usage across the run, it's not
> instantaneous and that's where all the interesting information is.
> e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
> it might be 100% for 1s - that's the behaviour run profiles cannot
> give you insight into....
> 

Fair point. I'll fix up the timestamping and keep the monitor for future
reference.

> As it is, the output you posted is nothing unusual.
> 

Grand. I had taken a look but I saw nothing particularly unusual but I
also was not 100% sure what I should be looking for.,

> > For just these XFS tests I've uploaded a tarball of the logs to
> > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz
> 
> Ok, so the main thing I missed when first looking at this is that
> you are concerned about single thread regressions.

In this specific test, yes. In the original data I posted I had threaded
benchmarks but they did not show the regression. This was a rerun with
just the single threaded case. Generally I run both because I see bug
reports involving both types of test.

> Well, I can't
> reproduce your results here. Single threaded with or without the
> workqueue based allocation gives me roughly 20k +/-0.5k files/s one
> a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
> machine.  That's the same results I've been seeing since I wrote
> this patch almost 12 months ago....
> 
> So, given that this is a metadata intensive workload, the only
> extent allocation is going to be through inode and directory block
> allocation. These paths do not consume a large amount of stack, so
> we can tell the allocator not to switch to workqueue stack for these
> allocations easily.
> 
> The patch below does this. It completely removes all the allocation
> based context switches from the no-data fsmark workloads being used
> for this testing. It makes no noticable difference to performance
> here, so I'm interested if it solves the regression you are seeing
> on your machines.
> 

It does. nodefer-metadata is just your patch applied on top of 3.4 and is
the right-most column. It's within the noise for the reverted patches and
approximately the same performance as 3.3. If you look at the timing
stats at the bottom you'll see that hte patch brings the System time way
down so consider this a

Tested-by: Mel Gorman <mgorman@suse.de>

FS-Mark Single Threaded
                 fsmark-single      single-3.4.0      single-3.4.0      single-3.4.0      single-3.4.0
                 3.4.0-vanilla   revert-aa464191   revert-c999a223       revert-both  nodefer-metadata
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)      25108.00 (77.11%)    25448.40 (79.51%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)     38169.97 (127.43%)   36393.09 (116.84%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)    5344.99 (430.65%)      5599.65 (455.93%)    5961.48 (491.85%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)     47918.10 (159.36%)   47146.20 (155.18%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)     247396.00 (58.35%)   248906.00 (58.10%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)     287141.73 (54.98%)   284274.93 (55.43%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)  175001.08 (-141.58%)   102018.14 (-40.83%)  114055.47 (-57.45%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)     637932.00 (25.44%)   710720.00 (16.94%)

MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99     24.38
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7     27.12
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14     36.11

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] IO metadata on XFS
@ 2012-07-04  9:51                     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04  9:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel,
	dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel,
	Eugeni Dodonov

On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote:
> > > > <SNIP>
> > > > [c999a223: xfs: introduce an allocation workqueue]
> > > 
> > > Which indicates that there is workqueue scheduling issues, I think.
> > > The same amount of work is being done, but half of it is being
> > > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > > tested the above patch in anger on an 8p machine, similar to the
> > > machine you saw no regressions on, but the workload didn't drive it
> > > to being completely CPU bound (only about 90%) so the allocation
> > > work was probably always scheduled quickly.
> > 
> > What test were you using?
> 
> fsmark, dbench, compilebench, and a few fio workloads. Also,
> xfstests times each test and I keep track of overall runtime, and
> none of those showed any performance differential, either...
> 

Sound. I have some coverage on some of the same tests. When I get to
them I'll keep an eye on the 3.4 figures. It might be due to the disk
I'm using. It's a single disk and nothing to write home about in terms of
performance. It's not exactly XFS's usual target audience.

> Indeed, running on a current 3.5-rc5 tree, my usual fsmark
> benchmarks are running at the same numbers I've been seeing since
> about 3.0 - somewhere around 18k files/s for a single thread, and
> 110-115k files/s for 8 threads.
> 
> I just ran your variant, and I'm getting about 20kfile/s for a
> single thread, which is about right because you're using smaller
> directories than I am (22500 files per dir vs 100k in my tests).
> 

I had data for an fsmark-single test running with 30M files and FWIW the
3.4 performance figures were in line with 3.0 and later kernels.

> > > How many worker threads have been spawned on these machines
> > > that are showing the regression?
> > 
> > 20 or 21 generally. An example list as spotted by top looks like
> 
> Pretty normal.
> 
> > > What is the context switch rate on the machines whenteh test is running?
> .....
> > Vanilla average context switch rate	4278.53
> > Revert average context switch rate	1095
> 
> That seems about right, too.
> 
> > > Can you run latencytop to see
> > > if there is excessive starvation/wait times for allocation
> > > completion?
> > 
> > I'm not sure what format you are looking for.
> 
> Where the context switches are coming from, and how long they are
> abeing stalled for.

Noted. Capturing latency_stats over time is enough to do that. It won't
give a per-process breakdown but in the majority of cases that is not a
problem. Extracting the data is a bit annoying but not impossible and
better than parsing latencytop. Ideally, latencytop would be able to log
data in some sensible format. hmmm.

> Just to get the context switch locations, you
> can use perf on the sched:sched_switch event, but that doesn't give
> you stall times.

No, but both can be captured and roughly correlated with each other
given sufficient motivation.

> Local testing tells me that about 40% of the
> switches are from xfs_alloc_vextent, 55% are from the work threads,
> and the rest are CPU idling events, which is exactly as I'd expect.
> 
> > > A pert top profile comparison might be informative,
> > > too...
> > > 
> > 
> > I'm not sure if this is what you really wanted. I thought an oprofile or
> > perf report would have made more sense but I recorded perf top over time
> > anyway and it's at the end of the mail.
> 
> perf report and oprofile give you CPU usage across the run, it's not
> instantaneous and that's where all the interesting information is.
> e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
> it might be 100% for 1s - that's the behaviour run profiles cannot
> give you insight into....
> 

Fair point. I'll fix up the timestamping and keep the monitor for future
reference.

> As it is, the output you posted is nothing unusual.
> 

Grand. I had taken a look but I saw nothing particularly unusual but I
also was not 100% sure what I should be looking for.,

> > For just these XFS tests I've uploaded a tarball of the logs to
> > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz
> 
> Ok, so the main thing I missed when first looking at this is that
> you are concerned about single thread regressions.

In this specific test, yes. In the original data I posted I had threaded
benchmarks but they did not show the regression. This was a rerun with
just the single threaded case. Generally I run both because I see bug
reports involving both types of test.

> Well, I can't
> reproduce your results here. Single threaded with or without the
> workqueue based allocation gives me roughly 20k +/-0.5k files/s one
> a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM
> machine.  That's the same results I've been seeing since I wrote
> this patch almost 12 months ago....
> 
> So, given that this is a metadata intensive workload, the only
> extent allocation is going to be through inode and directory block
> allocation. These paths do not consume a large amount of stack, so
> we can tell the allocator not to switch to workqueue stack for these
> allocations easily.
> 
> The patch below does this. It completely removes all the allocation
> based context switches from the no-data fsmark workloads being used
> for this testing. It makes no noticable difference to performance
> here, so I'm interested if it solves the regression you are seeing
> on your machines.
> 

It does. nodefer-metadata is just your patch applied on top of 3.4 and is
the right-most column. It's within the noise for the reverted patches and
approximately the same performance as 3.3. If you look at the timing
stats at the bottom you'll see that hte patch brings the System time way
down so consider this a

Tested-by: Mel Gorman <mgorman@suse.de>

FS-Mark Single Threaded
                 fsmark-single      single-3.4.0      single-3.4.0      single-3.4.0      single-3.4.0
                 3.4.0-vanilla   revert-aa464191   revert-c999a223       revert-both  nodefer-metadata
Files/s  min       14176.40 ( 0.00%)    17830.60 (25.78%)    24186.70 (70.61%)      25108.00 (77.11%)    25448.40 (79.51%)
Files/s  mean      16783.35 ( 0.00%)    25029.69 (49.13%)    37513.72 (123.52%)     38169.97 (127.43%)   36393.09 (116.84%)
Files/s  stddev     1007.26 ( 0.00%)     2644.87 (162.58%)    5344.99 (430.65%)      5599.65 (455.93%)    5961.48 (491.85%)
Files/s  max       18475.40 ( 0.00%)    27966.10 (51.37%)    45564.60 (146.62%)     47918.10 (159.36%)   47146.20 (155.18%)
Overhead min      593978.00 ( 0.00%)   386173.00 (34.99%)   253812.00 (57.27%)     247396.00 (58.35%)   248906.00 (58.10%)
Overhead mean     637782.80 ( 0.00%)   429229.33 (32.70%)   322868.20 (49.38%)     287141.73 (54.98%)   284274.93 (55.43%)
Overhead stddev    72440.72 ( 0.00%)   100056.96 (-38.12%)  175001.08 (-141.58%)   102018.14 (-40.83%)  114055.47 (-57.45%)
Overhead max      855637.00 ( 0.00%)   753541.00 (11.93%)   880531.00 (-2.91%)     637932.00 (25.44%)   710720.00 (16.94%)

MMTests Statistics: duration
Sys Time Running Test (seconds)              44.06     32.25     24.19     23.99     24.38
User+Sys Time Running Test (seconds)         50.19     36.35     27.24      26.7     27.12
Total Elapsed Time (seconds)                 59.21     44.76     34.95     34.14     36.11

Thanks.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page reclaim performance on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-04 15:52     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__pagereclaim-performance-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

largedd is showing that in 3.0 that reclaim started writing pages. This
has been "fixed" by deferring the writeback to flusher threads and
immediately reclaiming. This is mostly fixed now but it would be worth
refreshing the memory as to why this happened in the first place.

Slab shrinking is another area to pay attention to. Scanning in recent
kernels is consistent with earlier kernels but there are cases where
the number of inode being reclaimed has changed significantly. Inode
stealing in one case is why higher since 3.1.

postmark detected that there was excessive swapping for some kernels
between 3.0 and 3.3 but that it's not always reproducible and depends
on both the workload and the amount of available memory. It does indicate
that 3.1 and 3.2 might have been very bad kernels for interactivity
performance under memory pressure.

largedd in is showing in some cases that there was excessive swap activity
in 3.1 and 3.2 kernels. This has been fixed but the fixes were not
backported.

Benchmark notes
===============

Each of the benchmarks trigger page reclaim in a fairly simple manner. The
intention is not to be exhaustive but to test basic reclaim patterns that
the VM should never get wrong. Regressions may also be due to changes in
the IO scheduler or underlying filesystem.

The workloads are predominately file-based. Anonymous page reclaim stress
testing is covered by another test.

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

postmark
  o 15000 transactions
  o File size ranged 3096 bytes to 5M
  o 100 subdirectories
  o Total footprint approximately 4*TOTAL_RAM

  This workload is a single-threaded benchmark intended to measure
  filesystem performance for many short-lived and small files. It's
  primary weakness is that it does no application processing and
  so the page aging is basically on per-file granularity. 

largedd
  o Target size 8*TOTAL_RAM

  This downloads a large file and makes copies with dd until the
  target footprint size is reached.

fsmark
  o Parallel directories were used
  o 1 Thread per CPU
  o 30M Filesize
  o 16 directories
  o 256 files per directory
  o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-25458/1  -D  16  -N  256  -n  532  -L  15  -S0  -s  31457280
  Thread: ./fs_mark  -d  /tmp/fsmark-28217/1  -d  /tmp/fsmark-28217/2  -d  /tmp/fsmark-28217/3  -d
/tmp/fsmark-28217/4  -d  /tmp/fsmark-28217/5  -d  /tmp/fsmark-28217/6  -d  /tmp/fsmark-28217/7  -d
/tmp/fsmark-28217/8  -d  /tmp/fsmark-28217/9  -d  /tmp/fsmark-28217/10  -d  /tmp/fsmark-28217/11  -d
/tmp/fsmark-28217/12  -d  /tmp/fsmark-28217/13  -d  /tmp/fsmark-28217/14  -d  /tmp/fsmark-28217/15  -d
/tmp/fsmark-28217/16  -d  /tmp/fsmark-28217/17  -d  /tmp/fsmark-28217/18  -d  /tmp/fsmark-28217/19  -d
/tmp/fsmark-28217/20  -d  /tmp/fsmark-28217/21  -d  /tmp/fsmark-28217/22  -d  /tmp/fsmark-28217/23  -d
/tmp/fsmark-28217/24  -d  /tmp/fsmark-28217/25  -d  /tmp/fsmark-28217/26  -d  /tmp/fsmark-28217/27  -d
/tmp/fsmark-28217/28  -d  /tmp/fsmark-28217/29  -d  /tmp/fsmark-28217/30  -d  /tmp/fsmark-28217/31  -d
/tmp/fsmark-28217/32  -D  16  -N  256  -n  16  -L  15  -S0  -s  31457280

micro
  o Total mapping size 10*TOTAL_RAM
  o NR_CPU threads
  o 5 iterations

  This creates one process per CPU and creates a large file-backed mapping
  up to the total mapping size. Each of the threads does a streaming
  read of the mapping for a number of iterations. It then restarts with
  a streaming write.

  Completion time is the primary factor here but be careful as a good
  completion time can be due to excessive reclaim which has an adverse
  effect on many other workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Generally good but pay attention to largedd
===========================================================

fsmark-single
-------------
  Generally this is looking good. There was a small performance dip in
  3.3 that has been recovered. All the reclaim was done from kswapd in
  all cases and efficiency was high.

fsmark-threaded
---------------
  Generally looking good although the higher variance in 3.4 is a surprise.
  It is interesting to note that in 2.6.32 and 2.6.34 that direct reclaim
  was used and that this is no longer the case. kswapd efficiency is high
  throughout as the scanning velocity is steady. A point of interest is
  that slabs were not scanned in 3.4. This could be good or bad.

postmark
--------
  postmark performance took a dip in 3.2 and 3.3 and this roughly correlates
  with some other IO tests. It is very interesting to note that there was
  swap activity in 3.1 and 3.2 that has been since resolved but is something
  that -stable users might care about. Otherwise kswapd efficiency and
  scanning rates look good.

largedd
-------
  Figures generally look good here. All the reclaim is done by kswapd
  as expected. It is very interesting to note that in 3.0 and 3.1 that
  reclaim started writing out pages and that it is now being deferred
  to flusher threads and then immediately reclaimed.

micro
-----
  Completion times generally look good and are improving. Some direct
  reclaim is happening but at a steady percentage on each release.
  Efficiency is much lower than other workloads but at least it is
  consistent.

  Slab shrinking may need examination. It clearly shows that scanning
  is taking place but fewer inodes are being reclaimed in recent
  kernels. This is not necessarily bad but worth paying attention to.

  One point of concern is that kswapd CPU usage is high in recent
  kernels for this test. The graph is a complete mess and needs
  closer examination.
   
==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok but postmark shows high inode steals
==========================================================

fsmark-single
-------------
  This is all over the place. 3.0 was bad, 3.1 was good but performance
  has declined since then. The reclaim statistics look ok although slabs
  are being scanned without being reclaimed which is curious. kswapd is
  scanning less but that is not necessarily the cause of the problem.

fsmark-threaded
---------------
  In contrast to the single-threaded case, this is looking steady although
  overall kernels are slower than 2.6.32 was.  Direct reclaim velocity is
  slightly higher but still a small percentage of overall scanning. It's
  cause for some concern but not an immediate panic as reclaim efficiency
  is high.

postmark
-------
  postmark performance took a serious dip in 3.2 and 3.3 and while it
  has recovered in 3.4 a bit, it's still far below the peak seen in 3.1.
  For the most part, reclaim statistics look ok with the exception of
  slab shrinking. Inode stealing is way up 3.1 and later kernels.

largedd
-------
  Recent figures look good but in 3.1 and 3.2 there was excessive swapping.
  This was matched by pages reclaimed by direct reclaim in the same kernels
  and this likely shot interactive performance to hell with doing large
  copies on those kernels. This roughly matches other bug reports so it's
  interesting but clearly the patches did not get backported to -stable.

micro
-----
   Completion times looking generally good. 3.1 and 3.2 show better times
   but as it is matched by excessive amounts of reclaim it is not likely
   to be a good trade-off so lets not "fix" that.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

fsmark-single
-------------
  There was a performance dip in 3.4 but otherwise looks ok. Reclaim stats
  look fine.

fsmark-threaded
---------------
  Other than something crazy happening to variance in 3.1, figures look ok.
  There is some direct reclaim going on but the efficiency is high. There
  are 8 threads on this machine and only one disk so some stalling due
  to direct reclaim would be expected if IO was not keeping up.

postmark
--------
  postmark performance took a serious dip in 3.1 and 3.2 which is a kernel
  version earlier than the same dip seen in hydra. No explanation for this
  but it is matched by swap activity in the same kernels so it might be
  an indication there was general swapping-related damage in the 3.0-3.3
  time-frame.

largedd
-------
  Recent figures look good. Again, some swap activity is visible in 3.1 and
  3.2 that has been since fixed but obviously not backported.

micro
-----
  Okish I suppose. Completion times have improved but are all over the place
  and as seen elsewhere good completion times on micro can be sometimes
  due to excessive reclaim and is not necessarily a good thing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page reclaim performance on ext3
@ 2012-07-04 15:52     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04 15:52 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__pagereclaim-performance-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

largedd is showing that in 3.0 that reclaim started writing pages. This
has been "fixed" by deferring the writeback to flusher threads and
immediately reclaiming. This is mostly fixed now but it would be worth
refreshing the memory as to why this happened in the first place.

Slab shrinking is another area to pay attention to. Scanning in recent
kernels is consistent with earlier kernels but there are cases where
the number of inode being reclaimed has changed significantly. Inode
stealing in one case is why higher since 3.1.

postmark detected that there was excessive swapping for some kernels
between 3.0 and 3.3 but that it's not always reproducible and depends
on both the workload and the amount of available memory. It does indicate
that 3.1 and 3.2 might have been very bad kernels for interactivity
performance under memory pressure.

largedd in is showing in some cases that there was excessive swap activity
in 3.1 and 3.2 kernels. This has been fixed but the fixes were not
backported.

Benchmark notes
===============

Each of the benchmarks trigger page reclaim in a fairly simple manner. The
intention is not to be exhaustive but to test basic reclaim patterns that
the VM should never get wrong. Regressions may also be due to changes in
the IO scheduler or underlying filesystem.

The workloads are predominately file-based. Anonymous page reclaim stress
testing is covered by another test.

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

postmark
  o 15000 transactions
  o File size ranged 3096 bytes to 5M
  o 100 subdirectories
  o Total footprint approximately 4*TOTAL_RAM

  This workload is a single-threaded benchmark intended to measure
  filesystem performance for many short-lived and small files. It's
  primary weakness is that it does no application processing and
  so the page aging is basically on per-file granularity. 

largedd
  o Target size 8*TOTAL_RAM

  This downloads a large file and makes copies with dd until the
  target footprint size is reached.

fsmark
  o Parallel directories were used
  o 1 Thread per CPU
  o 30M Filesize
  o 16 directories
  o 256 files per directory
  o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-25458/1  -D  16  -N  256  -n  532  -L  15  -S0  -s  31457280
  Thread: ./fs_mark  -d  /tmp/fsmark-28217/1  -d  /tmp/fsmark-28217/2  -d  /tmp/fsmark-28217/3  -d
/tmp/fsmark-28217/4  -d  /tmp/fsmark-28217/5  -d  /tmp/fsmark-28217/6  -d  /tmp/fsmark-28217/7  -d
/tmp/fsmark-28217/8  -d  /tmp/fsmark-28217/9  -d  /tmp/fsmark-28217/10  -d  /tmp/fsmark-28217/11  -d
/tmp/fsmark-28217/12  -d  /tmp/fsmark-28217/13  -d  /tmp/fsmark-28217/14  -d  /tmp/fsmark-28217/15  -d
/tmp/fsmark-28217/16  -d  /tmp/fsmark-28217/17  -d  /tmp/fsmark-28217/18  -d  /tmp/fsmark-28217/19  -d
/tmp/fsmark-28217/20  -d  /tmp/fsmark-28217/21  -d  /tmp/fsmark-28217/22  -d  /tmp/fsmark-28217/23  -d
/tmp/fsmark-28217/24  -d  /tmp/fsmark-28217/25  -d  /tmp/fsmark-28217/26  -d  /tmp/fsmark-28217/27  -d
/tmp/fsmark-28217/28  -d  /tmp/fsmark-28217/29  -d  /tmp/fsmark-28217/30  -d  /tmp/fsmark-28217/31  -d
/tmp/fsmark-28217/32  -D  16  -N  256  -n  16  -L  15  -S0  -s  31457280

micro
  o Total mapping size 10*TOTAL_RAM
  o NR_CPU threads
  o 5 iterations

  This creates one process per CPU and creates a large file-backed mapping
  up to the total mapping size. Each of the threads does a streaming
  read of the mapping for a number of iterations. It then restarts with
  a streaming write.

  Completion time is the primary factor here but be careful as a good
  completion time can be due to excessive reclaim which has an adverse
  effect on many other workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Generally good but pay attention to largedd
===========================================================

fsmark-single
-------------
  Generally this is looking good. There was a small performance dip in
  3.3 that has been recovered. All the reclaim was done from kswapd in
  all cases and efficiency was high.

fsmark-threaded
---------------
  Generally looking good although the higher variance in 3.4 is a surprise.
  It is interesting to note that in 2.6.32 and 2.6.34 that direct reclaim
  was used and that this is no longer the case. kswapd efficiency is high
  throughout as the scanning velocity is steady. A point of interest is
  that slabs were not scanned in 3.4. This could be good or bad.

postmark
--------
  postmark performance took a dip in 3.2 and 3.3 and this roughly correlates
  with some other IO tests. It is very interesting to note that there was
  swap activity in 3.1 and 3.2 that has been since resolved but is something
  that -stable users might care about. Otherwise kswapd efficiency and
  scanning rates look good.

largedd
-------
  Figures generally look good here. All the reclaim is done by kswapd
  as expected. It is very interesting to note that in 3.0 and 3.1 that
  reclaim started writing out pages and that it is now being deferred
  to flusher threads and then immediately reclaimed.

micro
-----
  Completion times generally look good and are improving. Some direct
  reclaim is happening but at a steady percentage on each release.
  Efficiency is much lower than other workloads but at least it is
  consistent.

  Slab shrinking may need examination. It clearly shows that scanning
  is taking place but fewer inodes are being reclaimed in recent
  kernels. This is not necessarily bad but worth paying attention to.

  One point of concern is that kswapd CPU usage is high in recent
  kernels for this test. The graph is a complete mess and needs
  closer examination.
   
==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok but postmark shows high inode steals
==========================================================

fsmark-single
-------------
  This is all over the place. 3.0 was bad, 3.1 was good but performance
  has declined since then. The reclaim statistics look ok although slabs
  are being scanned without being reclaimed which is curious. kswapd is
  scanning less but that is not necessarily the cause of the problem.

fsmark-threaded
---------------
  In contrast to the single-threaded case, this is looking steady although
  overall kernels are slower than 2.6.32 was.  Direct reclaim velocity is
  slightly higher but still a small percentage of overall scanning. It's
  cause for some concern but not an immediate panic as reclaim efficiency
  is high.

postmark
-------
  postmark performance took a serious dip in 3.2 and 3.3 and while it
  has recovered in 3.4 a bit, it's still far below the peak seen in 3.1.
  For the most part, reclaim statistics look ok with the exception of
  slab shrinking. Inode stealing is way up 3.1 and later kernels.

largedd
-------
  Recent figures look good but in 3.1 and 3.2 there was excessive swapping.
  This was matched by pages reclaimed by direct reclaim in the same kernels
  and this likely shot interactive performance to hell with doing large
  copies on those kernels. This roughly matches other bug reports so it's
  interesting but clearly the patches did not get backported to -stable.

micro
-----
   Completion times looking generally good. 3.1 and 3.2 show better times
   but as it is matched by excessive amounts of reclaim it is not likely
   to be a good trade-off so lets not "fix" that.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

fsmark-single
-------------
  There was a performance dip in 3.4 but otherwise looks ok. Reclaim stats
  look fine.

fsmark-threaded
---------------
  Other than something crazy happening to variance in 3.1, figures look ok.
  There is some direct reclaim going on but the efficiency is high. There
  are 8 threads on this machine and only one disk so some stalling due
  to direct reclaim would be expected if IO was not keeping up.

postmark
--------
  postmark performance took a serious dip in 3.1 and 3.2 which is a kernel
  version earlier than the same dip seen in hydra. No explanation for this
  but it is matched by swap activity in the same kernels so it might be
  an indication there was general swapping-related damage in the 3.0-3.3
  time-frame.

largedd
-------
  Recent figures look good. Again, some swap activity is visible in 3.1 and
  3.2 that has been since fixed but obviously not backported.

micro
-----
  Okish I suppose. Completion times have improved but are all over the place
  and as seen elsewhere good completion times on micro can be sometimes
  due to excessive reclaim and is not necessarily a good thing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page reclaim performance on ext4
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-04 15:53     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-ext4

Configuration:	global-dhp__pagereclaim-performance-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

fsmark is showing a performance dip in 3.4 on a 32-bit machine that is
not matched by results elsewhere.

A number of tests show that there was swapping activity in 3.1 and 3.2
which should have been completely unnecessary. It has been fixed but the
fixes should have been backported as some were based on reports of poor
interactivity performance during IO.

One largedd test on hydra showed that a number of dirty pages are reaching
the end of the LRU. This is not visible on other tests but should be
monitored.

I added linux-ext4 to the list because there are some performance drops
that are not visible elsewhere so may be filesystem-specific.

Benchmark notes
===============

Each of the benchmarks trigger page reclaim in a fairly simple manner. The
intention is not to be exhaustive but to test basic reclaim patterns that
the VM should never get wrong. Regressions may also be due to changes in
the IO scheduler or underlying filesystem.

The workloads are predominately file-based. Anonymous page reclaim stress
testing is covered by another test.

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

postmark
  o 15000 transactions
  o File size ranged 3096 bytes to 5M
  o 100 subdirectories
  o Total footprint approximately 4*TOTAL_RAM

  This workload is a single-threaded benchmark intended to measure
  filesystem performance for many short-lived and small files. It's
  primary weakness is that it does no application processing and
  so the page aging is basically on per-file granularity. 

largedd
  o Target size 8*TOTAL_RAM

  This downloads a large file and makes copies with dd until the
  target footprint size is reached.

fsmark
  o Parallel directories were used
  o 1 Thread per CPU
  o 30M Filesize
  o 16 directories
  o 256 files per directory
  o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-25458/1  -D  16  -N  256  -n  532  -L  15  -S0  -s  31457280
  Thread: ./fs_mark  -d  /tmp/fsmark-28217/1  -d  /tmp/fsmark-28217/2  -d  /tmp/fsmark-28217/3  -d
/tmp/fsmark-28217/4  -d  /tmp/fsmark-28217/5  -d  /tmp/fsmark-28217/6  -d  /tmp/fsmark-28217/7  -d
/tmp/fsmark-28217/8  -d  /tmp/fsmark-28217/9  -d  /tmp/fsmark-28217/10  -d  /tmp/fsmark-28217/11  -d
/tmp/fsmark-28217/12  -d  /tmp/fsmark-28217/13  -d  /tmp/fsmark-28217/14  -d  /tmp/fsmark-28217/15  -d
/tmp/fsmark-28217/16  -d  /tmp/fsmark-28217/17  -d  /tmp/fsmark-28217/18  -d  /tmp/fsmark-28217/19  -d
/tmp/fsmark-28217/20  -d  /tmp/fsmark-28217/21  -d  /tmp/fsmark-28217/22  -d  /tmp/fsmark-28217/23  -d
/tmp/fsmark-28217/24  -d  /tmp/fsmark-28217/25  -d  /tmp/fsmark-28217/26  -d  /tmp/fsmark-28217/27  -d
/tmp/fsmark-28217/28  -d  /tmp/fsmark-28217/29  -d  /tmp/fsmark-28217/30  -d  /tmp/fsmark-28217/31  -d
/tmp/fsmark-28217/32  -D  16  -N  256  -n  16  -L  15  -S0  -s  31457280

micro
  o Total mapping size 10*TOTAL_RAM
  o NR_CPU threads
  o 5 iterations

  This creates one process per CPU and creates a large file-backed mapping
  up to the total mapping size. Each of the threads does a streaming
  read of the mapping for a number of iterations. It then restarts with
  a streaming write.

  Completion time is the primary factor here but be careful as a good
  completion time can be due to excessive reclaim which has an adverse
  effect on many other workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Generally good but swapping in 3.1 and 3.2
===========================================================

fsmark-single
-------------
  There was a performance dip in 3.4 that is not visible for ext3 and so
  may be filesystem-specific. From a reclaim perspective the figures look ok.

fsmark-threaded
---------------
  This also shows a large performance dip in 3.4 and an increase in variance
  but again the reclaim stats look ok.

postmark
--------
  As seen on ext3 there was a performance dip for kernels 3.1 to 3.3 that
  is not quite been recovered in 3.4. There was also swapping activity
  for the 3.1 and 3.2 kernels which may partially explain the problem.

largedd
-------
  Completion times are mixed. Again some swapping activity is visible in
  3.1 and 3.2 which has been resolved but not backported.

micro
-----
  Completion figures are not looking bad for 3.4
   
==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok but postmark shows high inode steals
==========================================================

fsmark-single
-------------
  Performance was declining until 3.1 and been steady every since.
  Some minor amounts of swapping is visible in 3.1

fsmark-threaded
---------------
 Surprisingly stead performance but recent kernels show some direct
 reclaim is going on. It's a tiny percentage and lower than it has
 been historically but keep an eye on it.

postmark
-------
  As with other tests 3.1 saw a performance dip and swapping activity. 3.2
  was also swapping although performance was not affected. While the reclaim
  figures currently look ok, the actual performance sucks. As the same is
  not visible on ext3, this may be a filesystem problem.

largedd
-------
  Completion figures look good but as before, swapping in 3.1 and 3.2.
  What is of concern is the pages reclaimed by PageReclaim are excessively
  high in 3.3 and 3.4. This implies that a large number of dirty pages
  are reaching the end of the LRU and this can be a problem. Minimally
  it increases kswapd CPU usage but can also be indicate a flushing
  problem.

micro
-----
  Looks ok.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Generally ok, but swapping in 3.1 and 3.2
==========================================================

fsmark-single
-------------
  Steady performance throughout. Tiny swap activity visible on 3.1 and 3.2
  which caused no harm but does correlate with other tests.

fsmark-threaded
---------------
  Also steady with the exception of 3.2 which is bizzare from a reclaim
  perspective. There was no direct reclaim scanning but a lot of inodes
  were reclaimed.

postmark
--------
  Same performance dip in 3.1 and 3.2 and accompanied by the same swapping
  problem.

largedd
-------
  Completion figures generally looking ok although again 3.1 is bad from
  a swapping and direct reclaim perspective.

micro
-----
  Looks ok

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page reclaim performance on ext4
@ 2012-07-04 15:53     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-ext4

Configuration:	global-dhp__pagereclaim-performance-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

fsmark is showing a performance dip in 3.4 on a 32-bit machine that is
not matched by results elsewhere.

A number of tests show that there was swapping activity in 3.1 and 3.2
which should have been completely unnecessary. It has been fixed but the
fixes should have been backported as some were based on reports of poor
interactivity performance during IO.

One largedd test on hydra showed that a number of dirty pages are reaching
the end of the LRU. This is not visible on other tests but should be
monitored.

I added linux-ext4 to the list because there are some performance drops
that are not visible elsewhere so may be filesystem-specific.

Benchmark notes
===============

Each of the benchmarks trigger page reclaim in a fairly simple manner. The
intention is not to be exhaustive but to test basic reclaim patterns that
the VM should never get wrong. Regressions may also be due to changes in
the IO scheduler or underlying filesystem.

The workloads are predominately file-based. Anonymous page reclaim stress
testing is covered by another test.

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

postmark
  o 15000 transactions
  o File size ranged 3096 bytes to 5M
  o 100 subdirectories
  o Total footprint approximately 4*TOTAL_RAM

  This workload is a single-threaded benchmark intended to measure
  filesystem performance for many short-lived and small files. It's
  primary weakness is that it does no application processing and
  so the page aging is basically on per-file granularity. 

largedd
  o Target size 8*TOTAL_RAM

  This downloads a large file and makes copies with dd until the
  target footprint size is reached.

fsmark
  o Parallel directories were used
  o 1 Thread per CPU
  o 30M Filesize
  o 16 directories
  o 256 files per directory
  o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-25458/1  -D  16  -N  256  -n  532  -L  15  -S0  -s  31457280
  Thread: ./fs_mark  -d  /tmp/fsmark-28217/1  -d  /tmp/fsmark-28217/2  -d  /tmp/fsmark-28217/3  -d
/tmp/fsmark-28217/4  -d  /tmp/fsmark-28217/5  -d  /tmp/fsmark-28217/6  -d  /tmp/fsmark-28217/7  -d
/tmp/fsmark-28217/8  -d  /tmp/fsmark-28217/9  -d  /tmp/fsmark-28217/10  -d  /tmp/fsmark-28217/11  -d
/tmp/fsmark-28217/12  -d  /tmp/fsmark-28217/13  -d  /tmp/fsmark-28217/14  -d  /tmp/fsmark-28217/15  -d
/tmp/fsmark-28217/16  -d  /tmp/fsmark-28217/17  -d  /tmp/fsmark-28217/18  -d  /tmp/fsmark-28217/19  -d
/tmp/fsmark-28217/20  -d  /tmp/fsmark-28217/21  -d  /tmp/fsmark-28217/22  -d  /tmp/fsmark-28217/23  -d
/tmp/fsmark-28217/24  -d  /tmp/fsmark-28217/25  -d  /tmp/fsmark-28217/26  -d  /tmp/fsmark-28217/27  -d
/tmp/fsmark-28217/28  -d  /tmp/fsmark-28217/29  -d  /tmp/fsmark-28217/30  -d  /tmp/fsmark-28217/31  -d
/tmp/fsmark-28217/32  -D  16  -N  256  -n  16  -L  15  -S0  -s  31457280

micro
  o Total mapping size 10*TOTAL_RAM
  o NR_CPU threads
  o 5 iterations

  This creates one process per CPU and creates a large file-backed mapping
  up to the total mapping size. Each of the threads does a streaming
  read of the mapping for a number of iterations. It then restarts with
  a streaming write.

  Completion time is the primary factor here but be careful as a good
  completion time can be due to excessive reclaim which has an adverse
  effect on many other workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Generally good but swapping in 3.1 and 3.2
===========================================================

fsmark-single
-------------
  There was a performance dip in 3.4 that is not visible for ext3 and so
  may be filesystem-specific. From a reclaim perspective the figures look ok.

fsmark-threaded
---------------
  This also shows a large performance dip in 3.4 and an increase in variance
  but again the reclaim stats look ok.

postmark
--------
  As seen on ext3 there was a performance dip for kernels 3.1 to 3.3 that
  is not quite been recovered in 3.4. There was also swapping activity
  for the 3.1 and 3.2 kernels which may partially explain the problem.

largedd
-------
  Completion times are mixed. Again some swapping activity is visible in
  3.1 and 3.2 which has been resolved but not backported.

micro
-----
  Completion figures are not looking bad for 3.4
   
==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok but postmark shows high inode steals
==========================================================

fsmark-single
-------------
  Performance was declining until 3.1 and been steady every since.
  Some minor amounts of swapping is visible in 3.1

fsmark-threaded
---------------
 Surprisingly stead performance but recent kernels show some direct
 reclaim is going on. It's a tiny percentage and lower than it has
 been historically but keep an eye on it.

postmark
-------
  As with other tests 3.1 saw a performance dip and swapping activity. 3.2
  was also swapping although performance was not affected. While the reclaim
  figures currently look ok, the actual performance sucks. As the same is
  not visible on ext3, this may be a filesystem problem.

largedd
-------
  Completion figures look good but as before, swapping in 3.1 and 3.2.
  What is of concern is the pages reclaimed by PageReclaim are excessively
  high in 3.3 and 3.4. This implies that a large number of dirty pages
  are reaching the end of the LRU and this can be a problem. Minimally
  it increases kswapd CPU usage but can also be indicate a flushing
  problem.

micro
-----
  Looks ok.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Generally ok, but swapping in 3.1 and 3.2
==========================================================

fsmark-single
-------------
  Steady performance throughout. Tiny swap activity visible on 3.1 and 3.2
  which caused no harm but does correlate with other tests.

fsmark-threaded
---------------
  Also steady with the exception of 3.2 which is bizzare from a reclaim
  perspective. There was no direct reclaim scanning but a lot of inodes
  were reclaimed.

postmark
--------
  Same performance dip in 3.1 and 3.2 and accompanied by the same swapping
  problem.

largedd
-------
  Completion figures generally looking ok although again 3.1 is bad from
  a swapping and direct reclaim perspective.

micro
-----
  Looks ok

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page reclaim performance on xfs
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-04 15:53     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__pagereclaim-performance-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

For the most part this is looking good. There is excessive swapping
visible on 3.1 and 3.2 which is seen elsewhere.

There is a concern with postmark figures. It is showing that we started
entering direct reclaim in 3.1 on hydra and while the swapping problem
has been addressed, we are still using direct reclaim and in some cases
it is quite a high percentage. This did not happen on older kernels.

Benchmark notes
===============

Each of the benchmarks trigger page reclaim in a fairly simple manner. The
intention is not to be exhaustive but to test basic reclaim patterns that
the VM should never get wrong. Regressions may also be due to changes in
the IO scheduler or underlying filesystem.

The workloads are predominately file-based. Anonymous page reclaim stress
testing is covered by another test.

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

postmark
  o 15000 transactions
  o File size ranged 3096 bytes to 5M
  o 100 subdirectories
  o Total footprint approximately 4*TOTAL_RAM

  This workload is a single-threaded benchmark intended to measure
  filesystem performance for many short-lived and small files. It's
  primary weakness is that it does no application processing and
  so the page aging is basically on per-file granularity. 

largedd
  o Target size 8*TOTAL_RAM

  This downloads a large file and makes copies with dd until the
  target footprint size is reached.

fsmark
  o Parallel directories were used
  o 1 Thread per CPU
  o 30M Filesize
  o 16 directories
  o 256 files per directory
  o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-25458/1  -D  16  -N  256  -n  532  -L  15  -S0  -s  31457280
  Thread: ./fs_mark  -d  /tmp/fsmark-28217/1  -d  /tmp/fsmark-28217/2  -d  /tmp/fsmark-28217/3  -d
/tmp/fsmark-28217/4  -d  /tmp/fsmark-28217/5  -d  /tmp/fsmark-28217/6  -d  /tmp/fsmark-28217/7  -d
/tmp/fsmark-28217/8  -d  /tmp/fsmark-28217/9  -d  /tmp/fsmark-28217/10  -d  /tmp/fsmark-28217/11  -d
/tmp/fsmark-28217/12  -d  /tmp/fsmark-28217/13  -d  /tmp/fsmark-28217/14  -d  /tmp/fsmark-28217/15  -d
/tmp/fsmark-28217/16  -d  /tmp/fsmark-28217/17  -d  /tmp/fsmark-28217/18  -d  /tmp/fsmark-28217/19  -d
/tmp/fsmark-28217/20  -d  /tmp/fsmark-28217/21  -d  /tmp/fsmark-28217/22  -d  /tmp/fsmark-28217/23  -d
/tmp/fsmark-28217/24  -d  /tmp/fsmark-28217/25  -d  /tmp/fsmark-28217/26  -d  /tmp/fsmark-28217/27  -d
/tmp/fsmark-28217/28  -d  /tmp/fsmark-28217/29  -d  /tmp/fsmark-28217/30  -d  /tmp/fsmark-28217/31  -d
/tmp/fsmark-28217/32  -D  16  -N  256  -n  16  -L  15  -S0  -s  31457280

micro
  o Total mapping size 10*TOTAL_RAM
  o NR_CPU threads
  o 5 iterations

  This creates one process per CPU and creates a large file-backed mapping
  up to the total mapping size. Each of the threads does a streaming
  read of the mapping for a number of iterations. It then restarts with
  a streaming write.

  Completion time is the primary factor here but be careful as a good
  completion time can be due to excessive reclaim which has an adverse
  effect on many other workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Generally good but postmark shows direct reclaim
===========================================================

fsmark-single
-------------
  Generally good, steady performance throughout. There was direct
  reclaim activity in early kernels but not any more.

fsmark-threaded
---------------
  Generally good as well.

postmark
--------
  This is interesting. There was a mild performance dip in 3.2 but
  while the excessive swapping is visible in 3.1 and 3.2 as seen
  on other tests, it did not translate into a performance drop.
  What is of concern is that direct reclaim figures are still high
  for recent kernels even if it is not swapping.

largedd
-------
  Completion times have suffered a little and the usual swapping
  in 3.1 and 3.2 is visible but it's tiny.

micro
-----
  Looking great.
   
==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Generally ok, but swapping in 3.1 and 3.2
==========================================================

fsmark-single
-------------
  Generally good, swap in 3.1 and 3.2

fsmark-threaded
---------------
  Generally good, no direct reclaim on recent kernels

postmark
-------
  Looking great other than swapping in 3.1 and 3.2 which again
  does not appear to translate into a performance drop. Direct
  reclaim started around kernel 3.1 and this has not eased off.
  It's a sizable percentage.

largedd
-------
  Completion times ok, swap on 3.1 and 3.2 is not.

micro
-----
  Ok.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Generally ok, but swapping in 3.1 and 3.2
==========================================================

fsmark-single
-------------
  Generally good. No swapping visible on 3.1 and 3.2 but this machine
  also has more memory.

fsmark-threaded
---------------
  Generally good although 3.2 showed that a lot of inodes were reclaimed.
  This matches a similar test on ext4 so something odd happened there.

postmark
--------
  Looking great for performance although some swapping in 3.1 and 3.2
  and direct reclaim scanning is still high.

largedd
-------
  Completion times look good 

micro
-----
  Look ok.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Page reclaim performance on xfs
@ 2012-07-04 15:53     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__pagereclaim-performance-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

For the most part this is looking good. There is excessive swapping
visible on 3.1 and 3.2 which is seen elsewhere.

There is a concern with postmark figures. It is showing that we started
entering direct reclaim in 3.1 on hydra and while the swapping problem
has been addressed, we are still using direct reclaim and in some cases
it is quite a high percentage. This did not happen on older kernels.

Benchmark notes
===============

Each of the benchmarks trigger page reclaim in a fairly simple manner. The
intention is not to be exhaustive but to test basic reclaim patterns that
the VM should never get wrong. Regressions may also be due to changes in
the IO scheduler or underlying filesystem.

The workloads are predominately file-based. Anonymous page reclaim stress
testing is covered by another test.

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

postmark
  o 15000 transactions
  o File size ranged 3096 bytes to 5M
  o 100 subdirectories
  o Total footprint approximately 4*TOTAL_RAM

  This workload is a single-threaded benchmark intended to measure
  filesystem performance for many short-lived and small files. It's
  primary weakness is that it does no application processing and
  so the page aging is basically on per-file granularity. 

largedd
  o Target size 8*TOTAL_RAM

  This downloads a large file and makes copies with dd until the
  target footprint size is reached.

fsmark
  o Parallel directories were used
  o 1 Thread per CPU
  o 30M Filesize
  o 16 directories
  o 256 files per directory
  o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration
  o 15 iterations
  Single: ./fs_mark  -d  /tmp/fsmark-25458/1  -D  16  -N  256  -n  532  -L  15  -S0  -s  31457280
  Thread: ./fs_mark  -d  /tmp/fsmark-28217/1  -d  /tmp/fsmark-28217/2  -d  /tmp/fsmark-28217/3  -d
/tmp/fsmark-28217/4  -d  /tmp/fsmark-28217/5  -d  /tmp/fsmark-28217/6  -d  /tmp/fsmark-28217/7  -d
/tmp/fsmark-28217/8  -d  /tmp/fsmark-28217/9  -d  /tmp/fsmark-28217/10  -d  /tmp/fsmark-28217/11  -d
/tmp/fsmark-28217/12  -d  /tmp/fsmark-28217/13  -d  /tmp/fsmark-28217/14  -d  /tmp/fsmark-28217/15  -d
/tmp/fsmark-28217/16  -d  /tmp/fsmark-28217/17  -d  /tmp/fsmark-28217/18  -d  /tmp/fsmark-28217/19  -d
/tmp/fsmark-28217/20  -d  /tmp/fsmark-28217/21  -d  /tmp/fsmark-28217/22  -d  /tmp/fsmark-28217/23  -d
/tmp/fsmark-28217/24  -d  /tmp/fsmark-28217/25  -d  /tmp/fsmark-28217/26  -d  /tmp/fsmark-28217/27  -d
/tmp/fsmark-28217/28  -d  /tmp/fsmark-28217/29  -d  /tmp/fsmark-28217/30  -d  /tmp/fsmark-28217/31  -d
/tmp/fsmark-28217/32  -D  16  -N  256  -n  16  -L  15  -S0  -s  31457280

micro
  o Total mapping size 10*TOTAL_RAM
  o NR_CPU threads
  o 5 iterations

  This creates one process per CPU and creates a large file-backed mapping
  up to the total mapping size. Each of the threads does a streaming
  read of the mapping for a number of iterations. It then restarts with
  a streaming write.

  Completion time is the primary factor here but be careful as a good
  completion time can be due to excessive reclaim which has an adverse
  effect on many other workloads.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Generally good but postmark shows direct reclaim
===========================================================

fsmark-single
-------------
  Generally good, steady performance throughout. There was direct
  reclaim activity in early kernels but not any more.

fsmark-threaded
---------------
  Generally good as well.

postmark
--------
  This is interesting. There was a mild performance dip in 3.2 but
  while the excessive swapping is visible in 3.1 and 3.2 as seen
  on other tests, it did not translate into a performance drop.
  What is of concern is that direct reclaim figures are still high
  for recent kernels even if it is not swapping.

largedd
-------
  Completion times have suffered a little and the usual swapping
  in 3.1 and 3.2 is visible but it's tiny.

micro
-----
  Looking great.
   
==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Generally ok, but swapping in 3.1 and 3.2
==========================================================

fsmark-single
-------------
  Generally good, swap in 3.1 and 3.2

fsmark-threaded
---------------
  Generally good, no direct reclaim on recent kernels

postmark
-------
  Looking great other than swapping in 3.1 and 3.2 which again
  does not appear to translate into a performance drop. Direct
  reclaim started around kernel 3.1 and this has not eased off.
  It's a sizable percentage.

largedd
-------
  Completion times ok, swap on 3.1 and 3.2 is not.

micro
-----
  Ok.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Generally ok, but swapping in 3.1 and 3.2
==========================================================

fsmark-single
-------------
  Generally good. No swapping visible on 3.1 and 3.2 but this machine
  also has more memory.

fsmark-threaded
---------------
  Generally good although 3.2 showed that a lot of inodes were reclaimed.
  This matches a similar test on ext4 so something odd happened there.

postmark
--------
  Looking great for performance although some swapping in 3.1 and 3.2
  and direct reclaim scanning is still high.

largedd
-------
  Completion times look good 

micro
-----
  Look ok.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Interactivity during IO on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-05 14:56     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-05 14:56 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-interactive-performance-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

There are some terrible results in here that might explain some of the
interactivity mess if the distribution defaulted to ext3 or was was chosen
by the user for any reason. In some cases average read latency has doubled,
tripled and in one case almost quadrupled since 2.6.32. Worse, we are not
consistently good or bad. I see patterns like great release, bad release,
good release, bad again etc.

Benchmark notes
===============

NOTE: This configuration is new and very experimental. This is my first
      time looking at the results of this type of test so flaws are
      inevitable. There is ample scope for improvement but I had to
      start somewhere.

This configuration is very different in that it is trying to analyse the
impact of IO on interactive performance.  Some interactivity problems are
due to an application trying to read() cache-cold data such as configuration
files or cached images. If there is a lot of IO going on, the application
may stall while this happens.  This is a limited scenario for measuring
interactivity but a common one.

These tests are fairly standard except that there is a background
application running in parallel. It begins by creating a 100M file and
using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is
complete it will try to read 1M from the file every few seconds and record
the latency. When it reaches the end of the file, it dumps it from cache
and starts again.

This latency is a *proxy* measure of interactivity, not a true measure. A
variation would be to measure the time for small writes for applications
that are logging data or applications like gnome-terminal that do small
writes to /tmp as part of its buffer management. The main strength is
that if we get this basic case wrong, then the complex cases are almost
certainly screwed as well.

There are two areas to pay attention to. One is completion time and how
it is affected by the small reads taking place in parallel. A comprehensive
analysis would show exactly how much the workload is affected by a parallel
read but right now I'm just looking at wall time.

The second area to pay attention to is the read latencies paying particular
attention to the average latency and the max latencies. The variations are
harder to draw decent conclusions from. A sensible option would be to plot
a CDF to get a better idea what the probability of a given read latency is
but for now that's a TODO item. As it is, the graphs are barely usable and
I'll be giving that more thought.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

fsmark-single
-------------
  Completion times since 3.2 have been badly affected which coincides with
  the introduction of IO-less dirty page throttling. 3.3 was particularly
  bad.

  2.6.32 was TERRIBLE in terms of read-latencies with the average latency
  and max latencies looking awful. The 90th percentile was close to 4
  seconds and as a result the graphs are even more of a complete mess than
  they might have been otherwise.

  Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the
  reads were below 206ms but in 3.2 this had grown to 273ms. The latency
  of the other 5% results increased from 481ms to 774ms.

  3.4 is looking better at least.

fsmark-threaded
---------------
  With multiple writers, completion times have been affected and again 3.2
  showed a big increase.

  Again, 2.6.32 is a complete disaster and mucks up all the graphs.

  Otherwise, our average read latencies do not look too bad. However, our
  worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst
  a read() can take 4.3 seconds when there are multiple parallel writers.
  This must be fairly rare as 99% of the latencies were below 1 second but
  a 4 second stall in an application sometimes would feel pretty bad.

  Maximum latencies have improved a bit in 3.4 but are still around a half
  second higher than 3.0 and 3.1 kernels.
  
postmark
--------
  This is interesting in that 3.2 kernels results show an improvement in
  maximum read latencies and 3.4 is looking worse. The completion times
  for postmark were very badly affected in 3.4. Almost the opposite of what
  the fsmark workloads showed. It's hard to draw any sensible conclusions
  from this that match up with fsmark.

largedd
-------
  Completion times are more or less unaffected.

  Maximum read latencies are affected though. In 2.6.39, our maximum latency
  was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain 
  some of the interactivity complains around those kernels when a large
  cp was going on. Right now, things are looking very good.

micro
-----
  Completion times look ok.

  2.6.32 is again hilariously bad.

  3.1 also showed very poor maximum latencies but 3.2 and later kernels
  look good.


==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

fsmark-single
-------------
  Completion times are all over the place with a big increase in 3.2 that
  improved a bit since but not as good as 3.1 kernels were.

  Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more
  meaningful. Our maximum latencies have jumped around a lot with 3.2
  being particularly bad and 3.4 not being much better. 3.1 and 3.3 were
  both good in terms of maximum latency.

  Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms.
  3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either.

fsmark-threaded
---------------
  Completion times are more or less ok.

   Maximum read latency is worse with increases of around 500ms in worst
   latency and even the 90th percentile is not looking great.

   Average latency is completely shot.

postmark
--------
  Again impossible to draw sensible conclusions from this. The throughput
  graph makes a nice sawtooth pattern suitable for poking you in the eye
  until it bleeds.

  It's all over the place in terms of completion times. Average latency
  figures are relatively ok but still regressed. Maximum latencies have
  increased.

largedd
-------
  Completion times are more or less steady although 3.2 showed a large
  jump in the length time it took to copy the files. 3.2 took almost
  10 minutes more to copy the files than 3.1 or 3.3.

  Maximum latencies in 3.2 were very high and the 90th percentile also
  looked pretty bad. 3.4 is better but still way worse than 2.6.32.

  Average latency would be laughable if it was not so tragic.

micro
-----
  This was looking better until 3.4 when max latencies jumped but by
  and large this looks good.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

fsmark-single
-------------
  Completion times are more or less ok. They've been worse since 3.2 
  but still better than 3.2 by a big margin.

  Read latencies are another story. Maximum latency has increased a
  LOT from 1.3 seconds to 3.1 seconds in kernel 3.4. Kernel 3.0 had
  a maximum latency of 6.5 seconds!

  The 90th percentile figures are not much better with latencies of
  more than 1 second being recorded from all the way back to 2.6.39.

  Average latencies have more than tripled from 230ms to 812ms.

fsmark-threaded
---------------
  Completion times are generally good.

  Read latencies are completely screwed. Kernel 3.2 had a maximum latency
  of 15 seconds! 3.4 has improved but it's still way too high. Even
  the 90th percentile figures look completely crap and average latency
  is of course bad with such high latencies being recorded.

postmark
--------
  Once again the throughput figures make a nice stab stab shape for the eyes.

  The latency figures are sufficiently crap that it depresses me to talk
  about them.

largedd
-------
  Completion times look decent.

  Which does not get over the shock of the latency figures were again are
  shocking.  6 second maximum latencies in 3.3 and 3.4 kernels although
  3.2 was actually quite good. Even 90th percentile maximum latencies have
  almost doubled since 3.2 and of course the average latencies have almost
  tripled in line with other results.

micro
-----
  Completion times look decent. 

  The read latencies are ok. The average latency is higher because the
  latencies to the 90th percentile are higher but the maximum latencies
  have improved so overall I guess this is a win.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Interactivity during IO on ext3
@ 2012-07-05 14:56     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-05 14:56 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-interactive-performance-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

There are some terrible results in here that might explain some of the
interactivity mess if the distribution defaulted to ext3 or was was chosen
by the user for any reason. In some cases average read latency has doubled,
tripled and in one case almost quadrupled since 2.6.32. Worse, we are not
consistently good or bad. I see patterns like great release, bad release,
good release, bad again etc.

Benchmark notes
===============

NOTE: This configuration is new and very experimental. This is my first
      time looking at the results of this type of test so flaws are
      inevitable. There is ample scope for improvement but I had to
      start somewhere.

This configuration is very different in that it is trying to analyse the
impact of IO on interactive performance.  Some interactivity problems are
due to an application trying to read() cache-cold data such as configuration
files or cached images. If there is a lot of IO going on, the application
may stall while this happens.  This is a limited scenario for measuring
interactivity but a common one.

These tests are fairly standard except that there is a background
application running in parallel. It begins by creating a 100M file and
using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is
complete it will try to read 1M from the file every few seconds and record
the latency. When it reaches the end of the file, it dumps it from cache
and starts again.

This latency is a *proxy* measure of interactivity, not a true measure. A
variation would be to measure the time for small writes for applications
that are logging data or applications like gnome-terminal that do small
writes to /tmp as part of its buffer management. The main strength is
that if we get this basic case wrong, then the complex cases are almost
certainly screwed as well.

There are two areas to pay attention to. One is completion time and how
it is affected by the small reads taking place in parallel. A comprehensive
analysis would show exactly how much the workload is affected by a parallel
read but right now I'm just looking at wall time.

The second area to pay attention to is the read latencies paying particular
attention to the average latency and the max latencies. The variations are
harder to draw decent conclusions from. A sensible option would be to plot
a CDF to get a better idea what the probability of a given read latency is
but for now that's a TODO item. As it is, the graphs are barely usable and
I'll be giving that more thought.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

fsmark-single
-------------
  Completion times since 3.2 have been badly affected which coincides with
  the introduction of IO-less dirty page throttling. 3.3 was particularly
  bad.

  2.6.32 was TERRIBLE in terms of read-latencies with the average latency
  and max latencies looking awful. The 90th percentile was close to 4
  seconds and as a result the graphs are even more of a complete mess than
  they might have been otherwise.

  Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the
  reads were below 206ms but in 3.2 this had grown to 273ms. The latency
  of the other 5% results increased from 481ms to 774ms.

  3.4 is looking better at least.

fsmark-threaded
---------------
  With multiple writers, completion times have been affected and again 3.2
  showed a big increase.

  Again, 2.6.32 is a complete disaster and mucks up all the graphs.

  Otherwise, our average read latencies do not look too bad. However, our
  worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst
  a read() can take 4.3 seconds when there are multiple parallel writers.
  This must be fairly rare as 99% of the latencies were below 1 second but
  a 4 second stall in an application sometimes would feel pretty bad.

  Maximum latencies have improved a bit in 3.4 but are still around a half
  second higher than 3.0 and 3.1 kernels.
  
postmark
--------
  This is interesting in that 3.2 kernels results show an improvement in
  maximum read latencies and 3.4 is looking worse. The completion times
  for postmark were very badly affected in 3.4. Almost the opposite of what
  the fsmark workloads showed. It's hard to draw any sensible conclusions
  from this that match up with fsmark.

largedd
-------
  Completion times are more or less unaffected.

  Maximum read latencies are affected though. In 2.6.39, our maximum latency
  was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain 
  some of the interactivity complains around those kernels when a large
  cp was going on. Right now, things are looking very good.

micro
-----
  Completion times look ok.

  2.6.32 is again hilariously bad.

  3.1 also showed very poor maximum latencies but 3.2 and later kernels
  look good.


==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

fsmark-single
-------------
  Completion times are all over the place with a big increase in 3.2 that
  improved a bit since but not as good as 3.1 kernels were.

  Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more
  meaningful. Our maximum latencies have jumped around a lot with 3.2
  being particularly bad and 3.4 not being much better. 3.1 and 3.3 were
  both good in terms of maximum latency.

  Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms.
  3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either.

fsmark-threaded
---------------
  Completion times are more or less ok.

   Maximum read latency is worse with increases of around 500ms in worst
   latency and even the 90th percentile is not looking great.

   Average latency is completely shot.

postmark
--------
  Again impossible to draw sensible conclusions from this. The throughput
  graph makes a nice sawtooth pattern suitable for poking you in the eye
  until it bleeds.

  It's all over the place in terms of completion times. Average latency
  figures are relatively ok but still regressed. Maximum latencies have
  increased.

largedd
-------
  Completion times are more or less steady although 3.2 showed a large
  jump in the length time it took to copy the files. 3.2 took almost
  10 minutes more to copy the files than 3.1 or 3.3.

  Maximum latencies in 3.2 were very high and the 90th percentile also
  looked pretty bad. 3.4 is better but still way worse than 2.6.32.

  Average latency would be laughable if it was not so tragic.

micro
-----
  This was looking better until 3.4 when max latencies jumped but by
  and large this looks good.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

fsmark-single
-------------
  Completion times are more or less ok. They've been worse since 3.2 
  but still better than 3.2 by a big margin.

  Read latencies are another story. Maximum latency has increased a
  LOT from 1.3 seconds to 3.1 seconds in kernel 3.4. Kernel 3.0 had
  a maximum latency of 6.5 seconds!

  The 90th percentile figures are not much better with latencies of
  more than 1 second being recorded from all the way back to 2.6.39.

  Average latencies have more than tripled from 230ms to 812ms.

fsmark-threaded
---------------
  Completion times are generally good.

  Read latencies are completely screwed. Kernel 3.2 had a maximum latency
  of 15 seconds! 3.4 has improved but it's still way too high. Even
  the 90th percentile figures look completely crap and average latency
  is of course bad with such high latencies being recorded.

postmark
--------
  Once again the throughput figures make a nice stab stab shape for the eyes.

  The latency figures are sufficiently crap that it depresses me to talk
  about them.

largedd
-------
  Completion times look decent.

  Which does not get over the shock of the latency figures were again are
  shocking.  6 second maximum latencies in 3.3 and 3.4 kernels although
  3.2 was actually quite good. Even 90th percentile maximum latencies have
  almost doubled since 3.2 and of course the average latencies have almost
  tripled in line with other results.

micro
-----
  Completion times look decent. 

  The read latencies are ok. The average latency is higher because the
  latencies to the 90th percentile are higher but the maximum latencies
  have improved so overall I guess this is a win.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Interactivity during IO on ext4
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-05 14:57     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-05 14:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-interactive-performance-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

Unlike ext3, these figures look generally good. There are a few wrinkles in
there but indications are that interactivity jitter experienced by users
may have a filesystem-specific component.  One possibility is that there
are differences in how metadata reads are sent to the IO scheduler but I
did not confirm this.

Benchmark notes
===============

NOTE: This configuration is new and very experimental. This is my first
      time looking at the results of this type of test so flaws are
      inevitable. There is ample scope for improvement but I had to
      start somewhere.

This configuration is very different in that it is trying to analyse the
impact of IO on interactive performance.  Some interactivity problems are
due to an application trying to read() cache-cold data such as configuration
files or cached images. If there is a lot of IO going on, the application
may stall while this happens.  This is a limited scenario for measuring
interactivity but a common one.

These tests are fairly standard except that there is a background
application running in parallel. It begins by creating a 100M file and
using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is
complete it will try to read 1M from the file every few seconds and record
the latency. When it reaches the end of the file, it dumps it from cache
and starts again.

This latency is a *proxy* measure of interactivity, not a true measure. A
variation would be to measure the time for small writes for applications
that are logging data or applications like gnome-terminal that do small
writes to /tmp as part of its buffer management. The main strength is
that if we get this basic case wrong, then the complex cases are almost
certainly screwed as well.

There are two areas to pay attention to. One is completion time and how
it is affected by the small reads taking place in parallel. A comprehensive
analysis would show exactly how much the workload is affected by a parallel
read but right now I'm just looking at wall time.

The second area to pay attention to is the read latencies paying particular
attention to the average latency and the max latencies. The variations are
harder to draw decent conclusions from. A sensible option would be to plot
a CDF to get a better idea what the probability of a given read latency is
but for now that's a TODO item. As it is, the graphs are barely usable and
I'll be giving that more thought.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		
===========================================================

fsmark-single
-------------
  Completion times are more or less ok. 3.2 showed a big improvement
  which is not in line with what was experienced in ext3.

  As with ext3, kernel 2.6.32 was a disaster but otherwise our maximum
  read latencies were looking up until 3.3 when there was a big jump that
  was not fixed in 3.4. By and large though the average latencies are 
  looking good and while the max latency is bad, the 99th percentile
  was looking good implying that the worst latencies are rarely
  experienced.

fsmark-threaded
---------------
  Completion times look generally good with 3.1 being an exception.

  Latencies are also looking good.
  
postmark
--------
  Similar story. Completion times and latencies generally look good.

largedd
-------
  Completion times were higher from 2.6.39 up until 3.3 taking nearly
  two minutes to complete the copy in some cases.

  This is reflected in some of the maximum latencies in that window
  but by and large the read latencies are much improved.

micro
-----
  Looking good all round.


==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

fsmark-single
-------------
  Completion times have degraded slightly but are acceptable.

  All the latency figures look good with some big improvements.

fsmark-threaded
---------------
  Same story, generally looking good with big improvements.

postmark
--------
  Completion times are a bit varied but latencies look good.

largedd
-------
  Completion times look good.

  Latency has improved since 2.6.32 but there is a big wrinkle
  in there. Maximum latency was 337ms in kernel 3.2 but in 3.3
  it was 707ms and in 3.4 was 990ms. The 99th percentile figures
  look good but something happened to allow bigger outliers.

micro
-----
  For the most part, looks good but there was a big jump in the
  maximum latency in kernel 3.4. Like largedd, the 99th percentil
  did not look as bad so it might be an outlier.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

fsmark-single
-------------
  Completion times have degraded slightly but are acceptble.

  All the latency figures look good with some big improvements.

fsmark-threaded
---------------
  Completion times are improved although curiously it is not reflected
  in the performance figures for fsmark itself.

  Maximum latency figures generally look good other than a mild jump
  in 3.2 that has almost being recovered.

postmark
--------
  Completion times have varied a lot and 3.4 is particularly high.

  The latency figures in general regressed in 3.4 in comparison to
  3.3 but by and large the figures look good.

largedd
-------

  Completion times generally look good but were noticably worse for
  a number of releases between 2.6.39 and 3.2. This same window showed
  much higher latency figures with kernel 3.1 showing a maximum latency
  of 1.3 seconds for example. These were mostly outliers though as
  the 99th percentile generally looked ok.

micro
-----
  Generally much improved.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Interactivity during IO on ext4
@ 2012-07-05 14:57     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-05 14:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-interactive-performance-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4
Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro

Summary
=======

Unlike ext3, these figures look generally good. There are a few wrinkles in
there but indications are that interactivity jitter experienced by users
may have a filesystem-specific component.  One possibility is that there
are differences in how metadata reads are sent to the IO scheduler but I
did not confirm this.

Benchmark notes
===============

NOTE: This configuration is new and very experimental. This is my first
      time looking at the results of this type of test so flaws are
      inevitable. There is ample scope for improvement but I had to
      start somewhere.

This configuration is very different in that it is trying to analyse the
impact of IO on interactive performance.  Some interactivity problems are
due to an application trying to read() cache-cold data such as configuration
files or cached images. If there is a lot of IO going on, the application
may stall while this happens.  This is a limited scenario for measuring
interactivity but a common one.

These tests are fairly standard except that there is a background
application running in parallel. It begins by creating a 100M file and
using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is
complete it will try to read 1M from the file every few seconds and record
the latency. When it reaches the end of the file, it dumps it from cache
and starts again.

This latency is a *proxy* measure of interactivity, not a true measure. A
variation would be to measure the time for small writes for applications
that are logging data or applications like gnome-terminal that do small
writes to /tmp as part of its buffer management. The main strength is
that if we get this basic case wrong, then the complex cases are almost
certainly screwed as well.

There are two areas to pay attention to. One is completion time and how
it is affected by the small reads taking place in parallel. A comprehensive
analysis would show exactly how much the workload is affected by a parallel
read but right now I'm just looking at wall time.

The second area to pay attention to is the read latencies paying particular
attention to the average latency and the max latencies. The variations are
harder to draw decent conclusions from. A sensible option would be to plot
a CDF to get a better idea what the probability of a given read latency is
but for now that's a TODO item. As it is, the graphs are barely usable and
I'll be giving that more thought.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		
===========================================================

fsmark-single
-------------
  Completion times are more or less ok. 3.2 showed a big improvement
  which is not in line with what was experienced in ext3.

  As with ext3, kernel 2.6.32 was a disaster but otherwise our maximum
  read latencies were looking up until 3.3 when there was a big jump that
  was not fixed in 3.4. By and large though the average latencies are 
  looking good and while the max latency is bad, the 99th percentile
  was looking good implying that the worst latencies are rarely
  experienced.

fsmark-threaded
---------------
  Completion times look generally good with 3.1 being an exception.

  Latencies are also looking good.
  
postmark
--------
  Similar story. Completion times and latencies generally look good.

largedd
-------
  Completion times were higher from 2.6.39 up until 3.3 taking nearly
  two minutes to complete the copy in some cases.

  This is reflected in some of the maximum latencies in that window
  but by and large the read latencies are much improved.

micro
-----
  Looking good all round.


==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

fsmark-single
-------------
  Completion times have degraded slightly but are acceptable.

  All the latency figures look good with some big improvements.

fsmark-threaded
---------------
  Same story, generally looking good with big improvements.

postmark
--------
  Completion times are a bit varied but latencies look good.

largedd
-------
  Completion times look good.

  Latency has improved since 2.6.32 but there is a big wrinkle
  in there. Maximum latency was 337ms in kernel 3.2 but in 3.3
  it was 707ms and in 3.4 was 990ms. The 99th percentile figures
  look good but something happened to allow bigger outliers.

micro
-----
  For the most part, looks good but there was a big jump in the
  maximum latency in kernel 3.4. Like largedd, the 99th percentil
  did not look as bad so it might be an outlier.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

fsmark-single
-------------
  Completion times have degraded slightly but are acceptble.

  All the latency figures look good with some big improvements.

fsmark-threaded
---------------
  Completion times are improved although curiously it is not reflected
  in the performance figures for fsmark itself.

  Maximum latency figures generally look good other than a mild jump
  in 3.2 that has almost being recovered.

postmark
--------
  Completion times have varied a lot and 3.4 is particularly high.

  The latency figures in general regressed in 3.4 in comparison to
  3.3 but by and large the figures look good.

largedd
-------

  Completion times generally look good but were noticably worse for
  a number of releases between 2.6.39 and 3.2. This same window showed
  much higher latency figures with kernel 3.1 showing a maximum latency
  of 1.3 seconds for example. These were mostly outliers though as
  the 99th percentile generally looked ok.

micro
-----
  Generally much improved.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Interactivity during IO on ext3
  2012-07-05 14:56     ` Mel Gorman
@ 2012-07-10  9:49       ` Jan Kara
  -1 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2012-07-10  9:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Thu 05-07-12 15:56:52, Mel Gorman wrote:
> Configuration:	global-dhp__io-interactive-performance-ext3
> Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3
> Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro
> 
> Summary
> =======
> 
> There are some terrible results in here that might explain some of the
> interactivity mess if the distribution defaulted to ext3 or was was chosen
> by the user for any reason. In some cases average read latency has doubled,
> tripled and in one case almost quadrupled since 2.6.32. Worse, we are not
> consistently good or bad. I see patterns like great release, bad release,
> good release, bad again etc.
> 
> Benchmark notes
> ===============
> 
> NOTE: This configuration is new and very experimental. This is my first
>       time looking at the results of this type of test so flaws are
>       inevitable. There is ample scope for improvement but I had to
>       start somewhere.
> 
> This configuration is very different in that it is trying to analyse the
> impact of IO on interactive performance.  Some interactivity problems are
> due to an application trying to read() cache-cold data such as configuration
> files or cached images. If there is a lot of IO going on, the application
> may stall while this happens.  This is a limited scenario for measuring
> interactivity but a common one.
> 
> These tests are fairly standard except that there is a background
> application running in parallel. It begins by creating a 100M file and
> using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is
> complete it will try to read 1M from the file every few seconds and record
> the latency. When it reaches the end of the file, it dumps it from cache
> and starts again.
> 
> This latency is a *proxy* measure of interactivity, not a true measure. A
> variation would be to measure the time for small writes for applications
> that are logging data or applications like gnome-terminal that do small
> writes to /tmp as part of its buffer management. The main strength is
> that if we get this basic case wrong, then the complex cases are almost
> certainly screwed as well.
> 
> There are two areas to pay attention to. One is completion time and how
> it is affected by the small reads taking place in parallel. A comprehensive
> analysis would show exactly how much the workload is affected by a parallel
> read but right now I'm just looking at wall time.
> 
> The second area to pay attention to is the read latencies paying particular
> attention to the average latency and the max latencies. The variations are
> harder to draw decent conclusions from. A sensible option would be to plot
> a CDF to get a better idea what the probability of a given read latency is
> but for now that's a TODO item. As it is, the graphs are barely usable and
> I'll be giving that more thought.
> 
> ===========================================================
> Machine:	arnold
> Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html
> Arch:		x86
> CPUs:		1 socket, 2 threads
> Model:		Pentium 4
> Disk:		Single Rotary Disk
> ===========================================================
> 
> fsmark-single
> -------------
>   Completion times since 3.2 have been badly affected which coincides with
>   the introduction of IO-less dirty page throttling. 3.3 was particularly
>   bad.
> 
>   2.6.32 was TERRIBLE in terms of read-latencies with the average latency
>   and max latencies looking awful. The 90th percentile was close to 4
>   seconds and as a result the graphs are even more of a complete mess than
>   they might have been otherwise.
> 
>   Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the
>   reads were below 206ms but in 3.2 this had grown to 273ms. The latency
>   of the other 5% results increased from 481ms to 774ms.
> 
>   3.4 is looking better at least.
  Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged
one more fix which should help the read latency. But all in all it's hard
to tackle the latency problems with ext3 - we have a journal which
synchronizes all the writes so we write to it with a high priority
(we use WRITE_SYNC when there's some contention on the journal). But that
naturally competes with reads and creates higher read latency.
 
> fsmark-threaded
> ---------------
>   With multiple writers, completion times have been affected and again 3.2
>   showed a big increase.
> 
>   Again, 2.6.32 is a complete disaster and mucks up all the graphs.
> 
>   Otherwise, our average read latencies do not look too bad. However, our
>   worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst
>   a read() can take 4.3 seconds when there are multiple parallel writers.
>   This must be fairly rare as 99% of the latencies were below 1 second but
>   a 4 second stall in an application sometimes would feel pretty bad.
> 
>   Maximum latencies have improved a bit in 3.4 but are still around a half
>   second higher than 3.0 and 3.1 kernels.
>   
> postmark
> --------
>   This is interesting in that 3.2 kernels results show an improvement in
>   maximum read latencies and 3.4 is looking worse. The completion times
>   for postmark were very badly affected in 3.4. Almost the opposite of what
>   the fsmark workloads showed. It's hard to draw any sensible conclusions
>   from this that match up with fsmark.
> 
> largedd
> -------
>   Completion times are more or less unaffected.
> 
>   Maximum read latencies are affected though. In 2.6.39, our maximum latency
>   was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain 
>   some of the interactivity complains around those kernels when a large
>   cp was going on. Right now, things are looking very good.
> 
> micro
> -----
>   Completion times look ok.
> 
>   2.6.32 is again hilariously bad.
> 
>   3.1 also showed very poor maximum latencies but 3.2 and later kernels
>   look good.
> 
> 
> ==========================================================
> Machine:	hydra
> Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html
> Arch:		x86-64
> CPUs:		1 socket, 4 threads
> Model:		AMD Phenom II X4 940
> Disk:		Single Rotary Disk
> ==========================================================
> 
> fsmark-single
> -------------
>   Completion times are all over the place with a big increase in 3.2 that
>   improved a bit since but not as good as 3.1 kernels were.
> 
>   Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more
>   meaningful. Our maximum latencies have jumped around a lot with 3.2
>   being particularly bad and 3.4 not being much better. 3.1 and 3.3 were
>   both good in terms of maximum latency.
> 
>   Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms.
>   3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either.
  So I wonder what makes a difference between this machine and the previous
one. The results seem completely different. Is it the amount of memory? Is
it the difference in the disk? Or even the difference in the CPU?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Interactivity during IO on ext3
@ 2012-07-10  9:49       ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2012-07-10  9:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Thu 05-07-12 15:56:52, Mel Gorman wrote:
> Configuration:	global-dhp__io-interactive-performance-ext3
> Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3
> Benchmarks:	postmark largedd fsmark-single fsmark-threaded micro
> 
> Summary
> =======
> 
> There are some terrible results in here that might explain some of the
> interactivity mess if the distribution defaulted to ext3 or was was chosen
> by the user for any reason. In some cases average read latency has doubled,
> tripled and in one case almost quadrupled since 2.6.32. Worse, we are not
> consistently good or bad. I see patterns like great release, bad release,
> good release, bad again etc.
> 
> Benchmark notes
> ===============
> 
> NOTE: This configuration is new and very experimental. This is my first
>       time looking at the results of this type of test so flaws are
>       inevitable. There is ample scope for improvement but I had to
>       start somewhere.
> 
> This configuration is very different in that it is trying to analyse the
> impact of IO on interactive performance.  Some interactivity problems are
> due to an application trying to read() cache-cold data such as configuration
> files or cached images. If there is a lot of IO going on, the application
> may stall while this happens.  This is a limited scenario for measuring
> interactivity but a common one.
> 
> These tests are fairly standard except that there is a background
> application running in parallel. It begins by creating a 100M file and
> using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is
> complete it will try to read 1M from the file every few seconds and record
> the latency. When it reaches the end of the file, it dumps it from cache
> and starts again.
> 
> This latency is a *proxy* measure of interactivity, not a true measure. A
> variation would be to measure the time for small writes for applications
> that are logging data or applications like gnome-terminal that do small
> writes to /tmp as part of its buffer management. The main strength is
> that if we get this basic case wrong, then the complex cases are almost
> certainly screwed as well.
> 
> There are two areas to pay attention to. One is completion time and how
> it is affected by the small reads taking place in parallel. A comprehensive
> analysis would show exactly how much the workload is affected by a parallel
> read but right now I'm just looking at wall time.
> 
> The second area to pay attention to is the read latencies paying particular
> attention to the average latency and the max latencies. The variations are
> harder to draw decent conclusions from. A sensible option would be to plot
> a CDF to get a better idea what the probability of a given read latency is
> but for now that's a TODO item. As it is, the graphs are barely usable and
> I'll be giving that more thought.
> 
> ===========================================================
> Machine:	arnold
> Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html
> Arch:		x86
> CPUs:		1 socket, 2 threads
> Model:		Pentium 4
> Disk:		Single Rotary Disk
> ===========================================================
> 
> fsmark-single
> -------------
>   Completion times since 3.2 have been badly affected which coincides with
>   the introduction of IO-less dirty page throttling. 3.3 was particularly
>   bad.
> 
>   2.6.32 was TERRIBLE in terms of read-latencies with the average latency
>   and max latencies looking awful. The 90th percentile was close to 4
>   seconds and as a result the graphs are even more of a complete mess than
>   they might have been otherwise.
> 
>   Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the
>   reads were below 206ms but in 3.2 this had grown to 273ms. The latency
>   of the other 5% results increased from 481ms to 774ms.
> 
>   3.4 is looking better at least.
  Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged
one more fix which should help the read latency. But all in all it's hard
to tackle the latency problems with ext3 - we have a journal which
synchronizes all the writes so we write to it with a high priority
(we use WRITE_SYNC when there's some contention on the journal). But that
naturally competes with reads and creates higher read latency.
 
> fsmark-threaded
> ---------------
>   With multiple writers, completion times have been affected and again 3.2
>   showed a big increase.
> 
>   Again, 2.6.32 is a complete disaster and mucks up all the graphs.
> 
>   Otherwise, our average read latencies do not look too bad. However, our
>   worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst
>   a read() can take 4.3 seconds when there are multiple parallel writers.
>   This must be fairly rare as 99% of the latencies were below 1 second but
>   a 4 second stall in an application sometimes would feel pretty bad.
> 
>   Maximum latencies have improved a bit in 3.4 but are still around a half
>   second higher than 3.0 and 3.1 kernels.
>   
> postmark
> --------
>   This is interesting in that 3.2 kernels results show an improvement in
>   maximum read latencies and 3.4 is looking worse. The completion times
>   for postmark were very badly affected in 3.4. Almost the opposite of what
>   the fsmark workloads showed. It's hard to draw any sensible conclusions
>   from this that match up with fsmark.
> 
> largedd
> -------
>   Completion times are more or less unaffected.
> 
>   Maximum read latencies are affected though. In 2.6.39, our maximum latency
>   was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain 
>   some of the interactivity complains around those kernels when a large
>   cp was going on. Right now, things are looking very good.
> 
> micro
> -----
>   Completion times look ok.
> 
>   2.6.32 is again hilariously bad.
> 
>   3.1 also showed very poor maximum latencies but 3.2 and later kernels
>   look good.
> 
> 
> ==========================================================
> Machine:	hydra
> Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html
> Arch:		x86-64
> CPUs:		1 socket, 4 threads
> Model:		AMD Phenom II X4 940
> Disk:		Single Rotary Disk
> ==========================================================
> 
> fsmark-single
> -------------
>   Completion times are all over the place with a big increase in 3.2 that
>   improved a bit since but not as good as 3.1 kernels were.
> 
>   Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more
>   meaningful. Our maximum latencies have jumped around a lot with 3.2
>   being particularly bad and 3.4 not being much better. 3.1 and 3.3 were
>   both good in terms of maximum latency.
> 
>   Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms.
>   3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either.
  So I wonder what makes a difference between this machine and the previous
one. The results seem completely different. Is it the amount of memory? Is
it the difference in the disk? Or even the difference in the CPU?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Interactivity during IO on ext3
  2012-07-10  9:49       ` Jan Kara
@ 2012-07-10 11:30         ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-10 11:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Tue, Jul 10, 2012 at 11:49:40AM +0200, Jan Kara wrote:
> > ===========================================================
> > Machine:	arnold
> > Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html
> > Arch:		x86
> > CPUs:		1 socket, 2 threads
> > Model:		Pentium 4
> > Disk:		Single Rotary Disk
> > ===========================================================
> > 
> > fsmark-single
> > -------------
> >   Completion times since 3.2 have been badly affected which coincides with
> >   the introduction of IO-less dirty page throttling. 3.3 was particularly
> >   bad.
> > 
> >   2.6.32 was TERRIBLE in terms of read-latencies with the average latency
> >   and max latencies looking awful. The 90th percentile was close to 4
> >   seconds and as a result the graphs are even more of a complete mess than
> >   they might have been otherwise.
> > 
> >   Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the
> >   reads were below 206ms but in 3.2 this had grown to 273ms. The latency
> >   of the other 5% results increased from 481ms to 774ms.
> > 
> >   3.4 is looking better at least.
>
>   Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged
> one more fix which should help the read latency.

When 3.5 comes out, I'll be queue up the same tests. Ideally I would be
running against each rc but the machines are used for other tests as well
and these ones take too long for continual testing to be practical.

> But all in all it's hard
> to tackle the latency problems with ext3 - we have a journal which
> synchronizes all the writes so we write to it with a high priority
> (we use WRITE_SYNC when there's some contention on the journal). But that
> naturally competes with reads and creates higher read latency.
>  

Thanks for the good explanation. I'll just know to look out for this in
interactivity-related or IO-latency bugs.

> > <SNIP>
> > ==========================================================
> > Machine:	hydra
> > Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html
> > Arch:		x86-64
> > CPUs:		1 socket, 4 threads
> > Model:		AMD Phenom II X4 940
> > Disk:		Single Rotary Disk
> > ==========================================================
> > 
> > fsmark-single
> > -------------
> >   Completion times are all over the place with a big increase in 3.2 that
> >   improved a bit since but not as good as 3.1 kernels were.
> > 
> >   Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more
> >   meaningful. Our maximum latencies have jumped around a lot with 3.2
> >   being particularly bad and 3.4 not being much better. 3.1 and 3.3 were
> >   both good in terms of maximum latency.
> > 
> >   Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms.
> >   3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either.
>
>   So I wonder what makes a difference between this machine and the previous
> one. The results seem completely different. Is it the amount of memory? Is
> it the difference in the disk? Or even the difference in the CPU?
> 

Two big differences are 32-bit versus 64-bit and the 32-bit machine having
4G of RAM and the 64-bit machine having 8G.  On the 32-bit machine, bounce
buffering may have been an issue but as -S0 was specified (no sync) there
would also be differences on when dirty page balancing took place.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Interactivity during IO on ext3
@ 2012-07-10 11:30         ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-10 11:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Tue, Jul 10, 2012 at 11:49:40AM +0200, Jan Kara wrote:
> > ===========================================================
> > Machine:	arnold
> > Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html
> > Arch:		x86
> > CPUs:		1 socket, 2 threads
> > Model:		Pentium 4
> > Disk:		Single Rotary Disk
> > ===========================================================
> > 
> > fsmark-single
> > -------------
> >   Completion times since 3.2 have been badly affected which coincides with
> >   the introduction of IO-less dirty page throttling. 3.3 was particularly
> >   bad.
> > 
> >   2.6.32 was TERRIBLE in terms of read-latencies with the average latency
> >   and max latencies looking awful. The 90th percentile was close to 4
> >   seconds and as a result the graphs are even more of a complete mess than
> >   they might have been otherwise.
> > 
> >   Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the
> >   reads were below 206ms but in 3.2 this had grown to 273ms. The latency
> >   of the other 5% results increased from 481ms to 774ms.
> > 
> >   3.4 is looking better at least.
>
>   Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged
> one more fix which should help the read latency.

When 3.5 comes out, I'll be queue up the same tests. Ideally I would be
running against each rc but the machines are used for other tests as well
and these ones take too long for continual testing to be practical.

> But all in all it's hard
> to tackle the latency problems with ext3 - we have a journal which
> synchronizes all the writes so we write to it with a high priority
> (we use WRITE_SYNC when there's some contention on the journal). But that
> naturally competes with reads and creates higher read latency.
>  

Thanks for the good explanation. I'll just know to look out for this in
interactivity-related or IO-latency bugs.

> > <SNIP>
> > ==========================================================
> > Machine:	hydra
> > Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html
> > Arch:		x86-64
> > CPUs:		1 socket, 4 threads
> > Model:		AMD Phenom II X4 940
> > Disk:		Single Rotary Disk
> > ==========================================================
> > 
> > fsmark-single
> > -------------
> >   Completion times are all over the place with a big increase in 3.2 that
> >   improved a bit since but not as good as 3.1 kernels were.
> > 
> >   Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more
> >   meaningful. Our maximum latencies have jumped around a lot with 3.2
> >   being particularly bad and 3.4 not being much better. 3.1 and 3.3 were
> >   both good in terms of maximum latency.
> > 
> >   Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms.
> >   3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either.
>
>   So I wonder what makes a difference between this machine and the previous
> one. The results seem completely different. Is it the amount of memory? Is
> it the difference in the disk? Or even the difference in the CPU?
> 

Two big differences are 32-bit versus 64-bit and the 32-bit machine having
4G of RAM and the 64-bit machine having 8G.  On the 32-bit machine, bounce
buffering may have been an issue but as -S0 was specified (no sync) there
would also be differences on when dirty page balancing took place.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Scheduler
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:12     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:12 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__scheduler-performance
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance
Benchmarks:	hackbench-pipes hackbench-sockets pipetest starve lmbench

Summary
=======

This is a mixed bag. The results on an I7 generally look great! There are
some major improvemnets in there and I think this may be due to scheduler
developers working with the latest chips. The other machines did not far
as well. Look at pipetest on hydra for an example of a particularly bad
set of results.

Benchmark notes
===============

starve (http://www.hpl.hp.com/research/linux/kernel/o1-starve.php) was
  chosen because even though it is designed to isolate a bug in the O(1)
  scheduler, it is still interesting to monitor for performance regressions.
  It does not take any special parameters.

hackbench was chosen because it's a general scheduler benchmark that is
  sensitive to regressions in the scheduler fast-path. It is difficult
  to draw conclusions from as it is somewhat sensitive to the starting
  conditions of the machine but trends over time may be observed. It is
  run in both pipe and sockets mode and for each number of clients, it is
  run for 30 iterations.

pipetest is a scheduler ping-pong test that measures context switch latency.
  It runs for 30 iterations.

lmbench is just running the lat_ctx test and is another measure of context
  switch latency.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Context switch latency is regressing.
===========================================================

starve is looking ok except for 3.0 and 3.1 where System CPU time and elapsed
	time increased. This was fixed in later kernels but worth noting
	for users of -stable.

lmbench showed a small regression in 3.0 where context switch latency was
	increased and this has not been recovered yet. 3.3.6 was particularly
	bad for low numbers of clients.

hackbench-pipes looks ok in comparison to 2.6.32. The "Time ratio" graph
	shows that kernels are below the red line reflecting that most
	kernels are faster. However, it also shows that 2.6.34 was the
	"best" kernel and recent kernels have regressed slightly

hackbench-sockets regressed badly after 2.6.34 until 3.3 which should be
	investigated. Again this is most obvious in the Time Ratio graph

pipetest is showing major regressions in latency since some time between 2.6.34
	and 2.6.39.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		pipetest is particularly bad.
==========================================================

starve is generally ok although again, 3.0 and 3.1 both regressed on System
	CPU time. This was improved on kernels after that but it's still
	a little worse than 2.6.32 was.

lmbench shows no regression in 3.0 unlike on arnold but later kernels are
	much worse with the latency of 3.4 being generally higher than it
	was in 3.2

hackbench-pipes generally looks ok.

hackbench-sockets is generally bad. 3.1 was particularly bad and while
	3.4 has improved the situation a bit, it is still worse than 2.6.32.

pipetest is showing major regressions. 3.2 regressed particularly badly.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Generally great.
==========================================================

starve is generally ok. 3.0 regressed in terms of System CPU time but
	recent kernels are very good. This might reflect that a lot
	of people are testing with later Intel processors to the
	detriment of older models.

lmbench is looking superb.

hackbench-pipes looks great.

hackbench-sockets does not look as great but it's still very good.

pipetest is generally looking good in comparison to 2.6.32. However,
	I am concerned that 3.4 is worse than 3.3.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Scheduler
@ 2012-07-23 21:12     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:12 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__scheduler-performance
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance
Benchmarks:	hackbench-pipes hackbench-sockets pipetest starve lmbench

Summary
=======

This is a mixed bag. The results on an I7 generally look great! There are
some major improvemnets in there and I think this may be due to scheduler
developers working with the latest chips. The other machines did not far
as well. Look at pipetest on hydra for an example of a particularly bad
set of results.

Benchmark notes
===============

starve (http://www.hpl.hp.com/research/linux/kernel/o1-starve.php) was
  chosen because even though it is designed to isolate a bug in the O(1)
  scheduler, it is still interesting to monitor for performance regressions.
  It does not take any special parameters.

hackbench was chosen because it's a general scheduler benchmark that is
  sensitive to regressions in the scheduler fast-path. It is difficult
  to draw conclusions from as it is somewhat sensitive to the starting
  conditions of the machine but trends over time may be observed. It is
  run in both pipe and sockets mode and for each number of clients, it is
  run for 30 iterations.

pipetest is a scheduler ping-pong test that measures context switch latency.
  It runs for 30 iterations.

lmbench is just running the lat_ctx test and is another measure of context
  switch latency.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Context switch latency is regressing.
===========================================================

starve is looking ok except for 3.0 and 3.1 where System CPU time and elapsed
	time increased. This was fixed in later kernels but worth noting
	for users of -stable.

lmbench showed a small regression in 3.0 where context switch latency was
	increased and this has not been recovered yet. 3.3.6 was particularly
	bad for low numbers of clients.

hackbench-pipes looks ok in comparison to 2.6.32. The "Time ratio" graph
	shows that kernels are below the red line reflecting that most
	kernels are faster. However, it also shows that 2.6.34 was the
	"best" kernel and recent kernels have regressed slightly

hackbench-sockets regressed badly after 2.6.34 until 3.3 which should be
	investigated. Again this is most obvious in the Time Ratio graph

pipetest is showing major regressions in latency since some time between 2.6.34
	and 2.6.39.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		pipetest is particularly bad.
==========================================================

starve is generally ok although again, 3.0 and 3.1 both regressed on System
	CPU time. This was improved on kernels after that but it's still
	a little worse than 2.6.32 was.

lmbench shows no regression in 3.0 unlike on arnold but later kernels are
	much worse with the latency of 3.4 being generally higher than it
	was in 3.2

hackbench-pipes generally looks ok.

hackbench-sockets is generally bad. 3.1 was particularly bad and while
	3.4 has improved the situation a bit, it is still worse than 2.6.32.

pipetest is showing major regressions. 3.2 regressed particularly badly.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Generally great.
==========================================================

starve is generally ok. 3.0 regressed in terms of System CPU time but
	recent kernels are very good. This might reflect that a lot
	of people are testing with later Intel processors to the
	detriment of older models.

lmbench is looking superb.

hackbench-pipes looks great.

hackbench-sockets does not look as great but it's still very good.

pipetest is generally looking good in comparison to 2.6.32. However,
	I am concerned that 3.4 is worse than 3.3.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Sysbench read-only on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:13     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:13 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__io-sysbench-large-ro-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3
Benchmarks:	sysbench

Summary
=======

Very large number of regressions.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

sysbench is an OLTP-like benchmark. The test type was "complex" and
read-only. The table size was 50,000,000 rows regardless of memory size
but far exceeds the memory size of any of the test machines. sysbench
was chosen because it's a reasonably complex OLTP-like benchmark with
straight-forward prerequisites.

The backing database was postgres.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

sysbench
--------
  Oddly two clients is better but 1 or 4 is worse. 

  Swapping for kernels 3.1 and 3.2 is crazy. Direct reclaim started since
  2.6.39 and has not eased off but in the context of the overall test is
  very low.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

sysbench
--------
  There are a lot of regressions here that were mostly introduced between
  2.6.39 and 3.0. In general, this is looking bad.

  Swapping in kernel 3.1 was higher.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

  Generally this is telling a much better story but this could be because
  of the much larger memory size of this machine offsetting some other
  regression.

  Swapping in 3.1 and 3.2.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Sysbench read-only on ext3
@ 2012-07-23 21:13     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:13 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__io-sysbench-large-ro-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3
Benchmarks:	sysbench

Summary
=======

Very large number of regressions.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

sysbench is an OLTP-like benchmark. The test type was "complex" and
read-only. The table size was 50,000,000 rows regardless of memory size
but far exceeds the memory size of any of the test machines. sysbench
was chosen because it's a reasonably complex OLTP-like benchmark with
straight-forward prerequisites.

The backing database was postgres.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

sysbench
--------
  Oddly two clients is better but 1 or 4 is worse. 

  Swapping for kernels 3.1 and 3.2 is crazy. Direct reclaim started since
  2.6.39 and has not eased off but in the context of the overall test is
  very low.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

sysbench
--------
  There are a lot of regressions here that were mostly introduced between
  2.6.39 and 3.0. In general, this is looking bad.

  Swapping in kernel 3.1 was higher.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

  Generally this is telling a much better story but this could be because
  of the much larger memory size of this machine offsetting some other
  regression.

  Swapping in 3.1 and 3.2.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Sysbench read-only on ext4
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:14     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:14 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__io-sysbench-large-ro-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4
Benchmarks:	sysbench

Summary
=======

Looking better in places than ext3 but still of concern.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

sysbench is an OLTP-like benchmark. The test type was "complex" and
read-only. The table size was 50,000,000 rows regardless of memory size
but far exceeds the memory size of any of the test machines. sysbench
was chosen because it's a reasonably complex OLTP-like benchmark with
straight-forward prerequisites.

The backing database was postgres.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

sysbench
--------
  Generally regresssed.

  Swapping for kernels 3.1 and 3.2 is very high.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

sysbench
--------
  For low number of clients, this has generally improved.

  Swapping in kernel 3.1 was high.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

  Generally this is telling a much better story but this could be because
  of the much larger memory size of this machine offsetting some other
  regression.

  Swapping in 3.1 and 3.2.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Sysbench read-only on ext4
@ 2012-07-23 21:14     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:14 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__io-sysbench-large-ro-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4
Benchmarks:	sysbench

Summary
=======

Looking better in places than ext3 but still of concern.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

sysbench is an OLTP-like benchmark. The test type was "complex" and
read-only. The table size was 50,000,000 rows regardless of memory size
but far exceeds the memory size of any of the test machines. sysbench
was chosen because it's a reasonably complex OLTP-like benchmark with
straight-forward prerequisites.

The backing database was postgres.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

sysbench
--------
  Generally regresssed.

  Swapping for kernels 3.1 and 3.2 is very high.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

sysbench
--------
  For low number of clients, this has generally improved.

  Swapping in kernel 3.1 was high.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

  Generally this is telling a much better story but this could be because
  of the much larger memory size of this machine offsetting some other
  regression.

  Swapping in 3.1 and 3.2.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Sysbench read-only on xfs
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:15     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__io-sysbench-large-ro-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs
Benchmarks:	sysbench

Summary
=======

Looking better in places than ext3 but still of concern.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

sysbench is an OLTP-like benchmark. The test type was "complex" and
read-only. The table size was 50,000,000 rows regardless of memory size
but far exceeds the memory size of any of the test machines. sysbench
was chosen because it's a reasonably complex OLTP-like benchmark with
straight-forward prerequisites.

The backing database was postgres.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

sysbench
--------
  Everything regressed.

  Swapping for kernels 3.1 and 3.2 was nuts.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

sysbench
--------
  For low number of clients, this has generally improved but regressed
  for larger number of clients.

  Swapping in kernel 3.1 was high.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

  Generally this is telling a much better story but this could be because
  of the much larger memory size of this machine offsetting some other
  regression.

  Swapping in 3.1 and 3.2.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Sysbench read-only on xfs
@ 2012-07-23 21:15     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__io-sysbench-large-ro-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs
Benchmarks:	sysbench

Summary
=======

Looking better in places than ext3 but still of concern.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

sysbench is an OLTP-like benchmark. The test type was "complex" and
read-only. The table size was 50,000,000 rows regardless of memory size
but far exceeds the memory size of any of the test machines. sysbench
was chosen because it's a reasonably complex OLTP-like benchmark with
straight-forward prerequisites.

The backing database was postgres.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

sysbench
--------
  Everything regressed.

  Swapping for kernels 3.1 and 3.2 was nuts.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

sysbench
--------
  For low number of clients, this has generally improved but regressed
  for larger number of clients.

  Swapping in kernel 3.1 was high.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

  Generally this is telling a much better story but this could be because
  of the much larger memory size of this machine offsetting some other
  regression.

  Swapping in 3.1 and 3.2.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] memcachetest and parallel IO on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:17     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__parallelio-memcachetest-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3
Benchmarks:	parallelio

Summary
=======

  Indications are not very clear as different machines point to different
  kernels. Very broadly speaking, swapping got worse between 2.6.39 and 3.0
  and then again between 3.2 and 3.3.

Benchmark notes
===============

This is an experimental benchmark designed to measure the impact of
background IO on a target workload.

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

The target workload in this case is memcached and memcachetest. This is a
benchmark of memcached and the workload is mostly anonymous.  The benchmark
was chosen as it was a random client that is considered a valid benchmark
for memcache and does not consume much memory in the client.  The server
was configured to use 80% of memory.

In the background, dd is used to generate IO of varying sizes. As the sizes
increase, memory pressure may push the target workload out of memory. The
benchmark is meant to measure how much the target workload is affected
and may be used as a proxy measure for page reclaim decisions.

Unlike other benchmarks, only the run with the worst throughput is displayed.
This benchmark varies quite a bit depending on the reference pattern from
the client. This hides the interesting result in the noise so we only
consider the worst case.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

parallelio-memcachetest
-----------------------
  Even for small amounts of background IO the memcached process is being
  pushed into swap. This is due to a regression somewhere between 2.6.34
  and 2.6.39 and a much larger regression between 2.6.39 and 3.0.  This is
  even worse in 3.3 and 3.4.

  The "page reclaim immediate" figures started increasing from 3.2 implying
  that a lot of dirty LRU pages are reaching the end of the LRU lists.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

parallelio-memcachetest
-----------------------
  Performance was reasonable until relatively recent kernels. The results
  show that for 3.3 and later kernels that swapping started for moderate
  amounts of IO (1624M) and performance dropped off sharply as a result.

  As with arnold, dirty pages are reaching the end of the LRU list.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

parallelio-memcachetest
-----------------------
  This is showing everything smells of roses and the IO is not interfering
  at all. It is possible that this is due to the amount of memory and that
  the IO is being completed fast enough.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] memcachetest and parallel IO on ext3
@ 2012-07-23 21:17     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__parallelio-memcachetest-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3
Benchmarks:	parallelio

Summary
=======

  Indications are not very clear as different machines point to different
  kernels. Very broadly speaking, swapping got worse between 2.6.39 and 3.0
  and then again between 3.2 and 3.3.

Benchmark notes
===============

This is an experimental benchmark designed to measure the impact of
background IO on a target workload.

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

The target workload in this case is memcached and memcachetest. This is a
benchmark of memcached and the workload is mostly anonymous.  The benchmark
was chosen as it was a random client that is considered a valid benchmark
for memcache and does not consume much memory in the client.  The server
was configured to use 80% of memory.

In the background, dd is used to generate IO of varying sizes. As the sizes
increase, memory pressure may push the target workload out of memory. The
benchmark is meant to measure how much the target workload is affected
and may be used as a proxy measure for page reclaim decisions.

Unlike other benchmarks, only the run with the worst throughput is displayed.
This benchmark varies quite a bit depending on the reference pattern from
the client. This hides the interesting result in the noise so we only
consider the worst case.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

parallelio-memcachetest
-----------------------
  Even for small amounts of background IO the memcached process is being
  pushed into swap. This is due to a regression somewhere between 2.6.34
  and 2.6.39 and a much larger regression between 2.6.39 and 3.0.  This is
  even worse in 3.3 and 3.4.

  The "page reclaim immediate" figures started increasing from 3.2 implying
  that a lot of dirty LRU pages are reaching the end of the LRU lists.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

parallelio-memcachetest
-----------------------
  Performance was reasonable until relatively recent kernels. The results
  show that for 3.3 and later kernels that swapping started for moderate
  amounts of IO (1624M) and performance dropped off sharply as a result.

  As with arnold, dirty pages are reaching the end of the LRU list.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

parallelio-memcachetest
-----------------------
  This is showing everything smells of roses and the IO is not interfering
  at all. It is possible that this is due to the amount of memory and that
  the IO is being completed fast enough.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] memcachetest and parallel IO on xfs
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:19     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__parallelio-memcachetest-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs
Benchmarks:	parallelio

Summary
=======

Indications are that there was a large regression in page reclaim decisions
between 2.6.39 and 3.0 as swapping increased a lot.

Benchmark notes
===============

This is an experimental benchmark designed to measure the impact of
background IO on a target workload.

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

The target workload in this case is memcached and memcachetest. This is a
benchmark of memcached and the workload is mostly anonymous.  The benchmark
was chosen as it was a random client that is considered a valid benchmark
for memcache and does not consume much memory in the client.  The server
was configured to use 80% of memory.

In the background, dd is used to generate IO of varying sizes. As the sizes
increase, memory pressure may push the target workload out of memory. The
benchmark is meant to measure how much the target workload is affected
and may be used as a proxy measure for page reclaim decisions.

Unlike other benchmarks, only the run with the worst throughput is displayed.
This benchmark varies quite a bit depending on the reference pattern from
the client. This hides the interesting result in the noise so we only
consider the worst case.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

parallelio-memcachetest
-----------------------

  Even for small amounts of background IO the memcached process is being
  pushed into swap for 3.3 and 3.4 although earlier kernels fared better.
  There are indications that there was a serious regression between 2.6.39
  and 3.0 as throughput dropped for larger amounts of IO and swapping was
  high.

  The "page reclaim immediate" figures started increasing from 3.2 implying
  that a lot of dirty LRU pages are reaching the end of the LRU lists.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

parallelio-memcachetest
-----------------------

  Performance again dropped sharply betwen 2.6.39 and 3.0 with huge jumps
  in the amount of swap IO.

  As with arnold, dirty pages are reaching the end of the LRU list.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

  No results available.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] memcachetest and parallel IO on xfs
@ 2012-07-23 21:19     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__parallelio-memcachetest-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs
Benchmarks:	parallelio

Summary
=======

Indications are that there was a large regression in page reclaim decisions
between 2.6.39 and 3.0 as swapping increased a lot.

Benchmark notes
===============

This is an experimental benchmark designed to measure the impact of
background IO on a target workload.

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

The target workload in this case is memcached and memcachetest. This is a
benchmark of memcached and the workload is mostly anonymous.  The benchmark
was chosen as it was a random client that is considered a valid benchmark
for memcache and does not consume much memory in the client.  The server
was configured to use 80% of memory.

In the background, dd is used to generate IO of varying sizes. As the sizes
increase, memory pressure may push the target workload out of memory. The
benchmark is meant to measure how much the target workload is affected
and may be used as a proxy measure for page reclaim decisions.

Unlike other benchmarks, only the run with the worst throughput is displayed.
This benchmark varies quite a bit depending on the reference pattern from
the client. This hides the interesting result in the noise so we only
consider the worst case.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

parallelio-memcachetest
-----------------------

  Even for small amounts of background IO the memcached process is being
  pushed into swap for 3.3 and 3.4 although earlier kernels fared better.
  There are indications that there was a serious regression between 2.6.39
  and 3.0 as throughput dropped for larger amounts of IO and swapping was
  high.

  The "page reclaim immediate" figures started increasing from 3.2 implying
  that a lot of dirty LRU pages are reaching the end of the LRU lists.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

parallelio-memcachetest
-----------------------

  Performance again dropped sharply betwen 2.6.39 and 3.0 with huge jumps
  in the amount of swap IO.

  As with arnold, dirty pages are reaching the end of the LRU list.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

  No results available.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Stress high-order allocations on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:20     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__stress-highalloc-performance-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3
Benchmarks:	kernbench vmr-stream sysbench stress-highalloc

Summary
=======

Allocation success rates of huge pages were looking great until 3.4 when
they dropped through the floor.

Benchmark notes
===============

All machines were booted with mem=4096M due to limitations of the test

This is an old series of benchmarks that stressed anti-fragmentation
and the allocation of huge pages. It is being replaced with other series
of tests which will be more representative but it still produces some
interesting results. I tend to use these results as an early warning
system before doing a more detailed series of tests.

Only the results from the stress-highalloc benchmark are actually of
interest and the other benchmarks are just there to age the machine
in terms of fragmentation.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

stress-highalloc
----------------

Generally this is going in the right direction. High-order allocations
are reasonably successful and where they drop, they have been matched
by a large reduction in the length of time it takes to complete the test.
Success rates in 3.4 did drop sharply though.


==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

stress-highalloc
----------------

Until 3.4, this was looking good. Unfortunately in 3.4 there was a massive
drop in success rates. This correlates with the removal of lumpy reclaim
which compaction indirectly depended upon. This strongly indicates that
enough memory is not being reclaimed for compaction to make forward
progress or compaction is being disabled routinely due to failed attempts
at compaction.

The success rates at the end of the test when the machine is idle are 
still high implying that anti-fragmentation itself is still working
as expected.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

Same as hydra, this was looking good until 3.4 and then success rates dropped
through the floor.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Stress high-order allocations on ext3
@ 2012-07-23 21:20     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Configuration:	global-dhp__stress-highalloc-performance-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3
Benchmarks:	kernbench vmr-stream sysbench stress-highalloc

Summary
=======

Allocation success rates of huge pages were looking great until 3.4 when
they dropped through the floor.

Benchmark notes
===============

All machines were booted with mem=4096M due to limitations of the test

This is an old series of benchmarks that stressed anti-fragmentation
and the allocation of huge pages. It is being replaced with other series
of tests which will be more representative but it still produces some
interesting results. I tend to use these results as an early warning
system before doing a more detailed series of tests.

Only the results from the stress-highalloc benchmark are actually of
interest and the other benchmarks are just there to age the machine
in terms of fragmentation.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

stress-highalloc
----------------

Generally this is going in the right direction. High-order allocations
are reasonably successful and where they drop, they have been matched
by a large reduction in the length of time it takes to complete the test.
Success rates in 3.4 did drop sharply though.


==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

stress-highalloc
----------------

Until 3.4, this was looking good. Unfortunately in 3.4 there was a massive
drop in success rates. This correlates with the removal of lumpy reclaim
which compaction indirectly depended upon. This strongly indicates that
enough memory is not being reclaimed for compaction to make forward
progress or compaction is being disabled routinely due to failed attempts
at compaction.

The success rates at the end of the test when the machine is idle are 
still high implying that anti-fragmentation itself is still working
as expected.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

Same as hydra, this was looking good until 3.4 and then success rates dropped
through the floor.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] dbench4 async on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:21     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:21 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-dbench4-async-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
Benchmarks:	dbench4

Summary
=======

In general there was a massive drop in throughput after 3.0. Very broadly
speaking it looks like the Read operation got faster but at the cost of
a big regression in the Flush operation.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying
number of clients were used up to 64*NR_CPU. osync, sync-directory and
fsync were all off.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

dbench4
-------

  Generally worse with a big drop in throughput after 3.0 for small number
  of clients. In some cases there is an improvement in latency for 3.0
  and later kernels but not always.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench4
-------
  Similar to arnold, big drop in throughput after 3.0 for small numbers
  of clients. Unlike arnold, this is matched by an improvement in latency
  so it may be the case that IO is more fair even if dbench complains
  about the latency. Very very broadly speaking, it looks like the read
  operation got a lot faster but flush got a lot slower.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

dbench4
-------
  Same story, big drop in throughput after 3.0 with flush again looking very
  expensive for 3.1 and later kernels. Latency figures are a mixed bag.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] dbench4 async on ext3
@ 2012-07-23 21:21     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:21 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-dbench4-async-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
Benchmarks:	dbench4

Summary
=======

In general there was a massive drop in throughput after 3.0. Very broadly
speaking it looks like the Read operation got faster but at the cost of
a big regression in the Flush operation.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying
number of clients were used up to 64*NR_CPU. osync, sync-directory and
fsync were all off.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

dbench4
-------

  Generally worse with a big drop in throughput after 3.0 for small number
  of clients. In some cases there is an improvement in latency for 3.0
  and later kernels but not always.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench4
-------
  Similar to arnold, big drop in throughput after 3.0 for small numbers
  of clients. Unlike arnold, this is matched by an improvement in latency
  so it may be the case that IO is more fair even if dbench complains
  about the latency. Very very broadly speaking, it looks like the read
  operation got a lot faster but flush got a lot slower.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

dbench4
-------
  Same story, big drop in throughput after 3.0 with flush again looking very
  expensive for 3.1 and later kernels. Latency figures are a mixed bag.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] dbench4 async on ext4
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:23     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:23 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-dbench4-async-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4
Benchmarks:	dbench4

Summary
=======

Nothing majorly exciting although throughput has been declining
slightly in a number of cases. However, this is not consistent
between machines and latency has also been variable. Broadly
speaking, there is not need to take any action here.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying
number of clients were used up to 64*NR_CPU. osync, sync-directory and
fsync were all off.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

dbench4
-------

  In very vague terms, throughput has been getting worse over time but
  it's very gradual. Latency has also been getting worse.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench4
-------

  This is a mixed bag, there are gains and losses and it's hard to draw
  any meaningful conclusion.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

dbench4
-------

  For the most part, there are few changes of note. Latency has
  been getting better particularly in 3.2 and later kernels.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] dbench4 async on ext4
@ 2012-07-23 21:23     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:23 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel

Configuration:	global-dhp__io-dbench4-async-ext4
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4
Benchmarks:	dbench4

Summary
=======

Nothing majorly exciting although throughput has been declining
slightly in a number of cases. However, this is not consistent
between machines and latency has also been variable. Broadly
speaking, there is not need to take any action here.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying
number of clients were used up to 64*NR_CPU. osync, sync-directory and
fsync were all off.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
===========================================================

dbench4
-------

  In very vague terms, throughput has been getting worse over time but
  it's very gradual. Latency has also been getting worse.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok
==========================================================

dbench4
-------

  This is a mixed bag, there are gains and losses and it's hard to draw
  any meaningful conclusion.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		
==========================================================

dbench4
-------

  For the most part, there are few changes of note. Latency has
  been getting better particularly in 3.2 and later kernels.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Threaded IO Performance on ext3
  2012-06-29 11:19   ` Mel Gorman
@ 2012-07-23 21:24     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:24 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel


Configuration:	global-dhp__io-threaded-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3
Benchmarks:	tiobench

Summary
=======

Some good results but some 3.x kernels were bad and this varied between
machines. In some, 3.1 and 3.2 were particularly bad. 3.4 regressed on
one machine with a large amount of memory.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

The size parameter for tiobench was 2*RAM. This is barely sufficient for
	this particular test where the size parameter should be multiple
	times the size of memory. The running time of the benchmark is
	already excessive and this is not likely to be changed.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This has regressed in almost all cases although for this machine the
  main damage was between 2.6.32 and 2.6.34. 3.2.9 performed particularly
  badly. It's interesting to note that 3.1 and 3.2 kernels both swapped
  and unexpected swapping has been seen in other tests.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This is a mixed bag. For low numbers of clients, throughput on
  sequential reads has improved with the exception of 3.2.9 which
  was a disaster. For larger number of clients, it is a mix of
  gains and losses. This could be due to weakness in the methodology
  due to both a small filesize and a small number of iterations.

  Random read has improved.

  With the exception of 3.2.9, sequential writes have generally
  improved.

  Random write has a number of regressions and 3.2.9 is a diaster.

  Kernels 3.1 and 3.2 had unexpected swapping.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like hydra, sequential reads were generally better for low numbers of
  clients. 3.4 is notable in that it regressed. Unlike hydra, 3.1 was
  the first bad kernel for sequential reads unlikely hydra where it was
  3.2. There are differences in the memory sizes and therefore the filesize
  and it implies that there is not a single cause of the regression.

  Random read has improved.

  Sequential writes have generally improved although it is interesting
  to note that 3.1 was a kernel that regressed. 3.4 is better than 2.6.32
  but it is interesting to note that it has regressed in comparison to 3.3.

  Random write has generally improved but again 3.4 is worse than 3.3.

  Like the other machines, 3.1 and 3.2 saw unexpected swapping.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Threaded IO Performance on ext3
@ 2012-07-23 21:24     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:24 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel


Configuration:	global-dhp__io-threaded-ext3
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3
Benchmarks:	tiobench

Summary
=======

Some good results but some 3.x kernels were bad and this varied between
machines. In some, 3.1 and 3.2 were particularly bad. 3.4 regressed on
one machine with a large amount of memory.

Benchmark notes
===============

mkfs was run on system startup. No attempt was made to age it. No
special mkfs or mount options were used.

The size parameter for tiobench was 2*RAM. This is barely sufficient for
	this particular test where the size parameter should be multiple
	times the size of memory. The running time of the benchmark is
	already excessive and this is not likely to be changed.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This has regressed in almost all cases although for this machine the
  main damage was between 2.6.32 and 2.6.34. 3.2.9 performed particularly
  badly. It's interesting to note that 3.1 and 3.2 kernels both swapped
  and unexpected swapping has been seen in other tests.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This is a mixed bag. For low numbers of clients, throughput on
  sequential reads has improved with the exception of 3.2.9 which
  was a disaster. For larger number of clients, it is a mix of
  gains and losses. This could be due to weakness in the methodology
  due to both a small filesize and a small number of iterations.

  Random read has improved.

  With the exception of 3.2.9, sequential writes have generally
  improved.

  Random write has a number of regressions and 3.2.9 is a diaster.

  Kernels 3.1 and 3.2 had unexpected swapping.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like hydra, sequential reads were generally better for low numbers of
  clients. 3.4 is notable in that it regressed. Unlike hydra, 3.1 was
  the first bad kernel for sequential reads unlikely hydra where it was
  3.2. There are differences in the memory sizes and therefore the filesize
  and it implies that there is not a single cause of the regression.

  Random read has improved.

  Sequential writes have generally improved although it is interesting
  to note that 3.1 was a kernel that regressed. 3.4 is better than 2.6.32
  but it is interesting to note that it has regressed in comparison to 3.3.

  Random write has generally improved but again 3.4 is worse than 3.3.

  Like the other machines, 3.1 and 3.2 saw unexpected swapping.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Threaded IO Performance on xfs
  2012-06-29 11:19   ` Mel Gorman
  (?)
@ 2012-07-23 21:25     ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:25 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs

Configuration:	global-dhp__io-threaded-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs
Benchmarks:	tiobench

Summary
=======

There have been many improvements in the sequential read/write case but
3.4 is noticably worse than 3.3 in a number of cases.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

The size parameter for tiobench was 2*RAM. This is barely sufficient for
	this particular test where the size parameter should be multiple
	times the size of memory. The running time of the benchmark is
	already excessive and this is not likely to be changed.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This is a mixed bag. For low numbers of clients, throughput on
  sequential reads has improved.  For larger number of clients, there
  are many regressions but this is not consistent.  This could be due to
  weakness in the methodology due to both a small filesize and a small
  number of iterations.

  Random read is generally bad.

  For many kernels sequential write is good with the notable exception
  of 2.6.39 and 3.0 kernels.

  There was unexpected swapping on 3.1 and 3.2 kernels.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like arnold, performance for sequential read is good for low number
  of clients.

  Random read looks good.

  With the exception of 3.0 in general and single threaded writes for all
  kernels, sequential writes have generally improved.

  Random write has a number of regressions.

  Kernels 3.1 and 3.2 had unexpected swapping.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like hydra, sequential reads were generally better for low numbers of
  clients. 3.4 is notable in that it regressed and 3.1 was also bad which
  is roughly similar to what was seen on ext3. There are differences in
  the memory sizes and therefore the filesize and it implies that there
  is not a single cause of the regression.

  Random read has generally improved except with the obvious exception of
  the single-threaded case.

  Sequential writes have generally improved but it is interesting to note
  that 3.4 is worse than 3.3 and this was also seen for ext3.

  Random write is a mixed bad but again 3.4 is worse than 3.3.

  Like the other machines, 3.1 and 3.2 saw unexpected swapping.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Threaded IO Performance on xfs
@ 2012-07-23 21:25     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:25 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs

Configuration:	global-dhp__io-threaded-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs
Benchmarks:	tiobench

Summary
=======

There have been many improvements in the sequential read/write case but
3.4 is noticably worse than 3.3 in a number of cases.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

The size parameter for tiobench was 2*RAM. This is barely sufficient for
	this particular test where the size parameter should be multiple
	times the size of memory. The running time of the benchmark is
	already excessive and this is not likely to be changed.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This is a mixed bag. For low numbers of clients, throughput on
  sequential reads has improved.  For larger number of clients, there
  are many regressions but this is not consistent.  This could be due to
  weakness in the methodology due to both a small filesize and a small
  number of iterations.

  Random read is generally bad.

  For many kernels sequential write is good with the notable exception
  of 2.6.39 and 3.0 kernels.

  There was unexpected swapping on 3.1 and 3.2 kernels.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like arnold, performance for sequential read is good for low number
  of clients.

  Random read looks good.

  With the exception of 3.0 in general and single threaded writes for all
  kernels, sequential writes have generally improved.

  Random write has a number of regressions.

  Kernels 3.1 and 3.2 had unexpected swapping.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like hydra, sequential reads were generally better for low numbers of
  clients. 3.4 is notable in that it regressed and 3.1 was also bad which
  is roughly similar to what was seen on ext3. There are differences in
  the memory sizes and therefore the filesize and it implies that there
  is not a single cause of the regression.

  Random read has generally improved except with the obvious exception of
  the single-threaded case.

  Sequential writes have generally improved but it is interesting to note
  that 3.4 is worse than 3.3 and this was also seen for ext3.

  Random write is a mixed bad but again 3.4 is worse than 3.3.

  Like the other machines, 3.1 and 3.2 saw unexpected swapping.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [MMTests] Threaded IO Performance on xfs
@ 2012-07-23 21:25     ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-23 21:25 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-kernel, xfs

Configuration:	global-dhp__io-threaded-xfs
Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs
Benchmarks:	tiobench

Summary
=======

There have been many improvements in the sequential read/write case but
3.4 is noticably worse than 3.3 in a number of cases.

Benchmark notes
===============

mkfs was run on system startup.
mkfs parameters -f -d agcount=8
mount options inode64,delaylog,logbsize=262144,nobarrier for the most part.
        On kernels to old to support delaylog was removed. On kernels
        where it was the default, it was specified and the warning ignored.

The size parameter for tiobench was 2*RAM. This is barely sufficient for
	this particular test where the size parameter should be multiple
	times the size of memory. The running time of the benchmark is
	already excessive and this is not likely to be changed.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
==========================================================

tiobench
--------
  This is a mixed bag. For low numbers of clients, throughput on
  sequential reads has improved.  For larger number of clients, there
  are many regressions but this is not consistent.  This could be due to
  weakness in the methodology due to both a small filesize and a small
  number of iterations.

  Random read is generally bad.

  For many kernels sequential write is good with the notable exception
  of 2.6.39 and 3.0 kernels.

  There was unexpected swapping on 3.1 and 3.2 kernels.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like arnold, performance for sequential read is good for low number
  of clients.

  Random read looks good.

  With the exception of 3.0 in general and single threaded writes for all
  kernels, sequential writes have generally improved.

  Random write has a number of regressions.

  Kernels 3.1 and 3.2 had unexpected swapping.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
==========================================================

tiobench
--------

  Like hydra, sequential reads were generally better for low numbers of
  clients. 3.4 is notable in that it regressed and 3.1 was also bad which
  is roughly similar to what was seen on ext3. There are differences in
  the memory sizes and therefore the filesize and it implies that there
  is not a single cause of the regression.

  Random read has generally improved except with the obvious exception of
  the single-threaded case.

  Sequential writes have generally improved but it is interesting to note
  that 3.4 is worse than 3.3 and this was also seen for ext3.

  Random write is a mixed bad but again 3.4 is worse than 3.3.

  Like the other machines, 3.1 and 3.2 saw unexpected swapping.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Sysbench read-only on ext3
  2012-07-23 21:13     ` Mel Gorman
@ 2012-07-24  2:29       ` Mike Galbraith
  -1 siblings, 0 replies; 108+ messages in thread
From: Mike Galbraith @ 2012-07-24  2:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote:

> The backing database was postgres.

FWIW, that wouldn't have been my choice.  I don't know if it still does,
but it used to use userland spinlocks to achieve scalability.  Turning
your CPUs into space heaters to combat concurrency issues makes a pretty
flat graph, but probably doesn't test kernels as well as something that
did not do that.

-Mike



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Sysbench read-only on ext3
@ 2012-07-24  2:29       ` Mike Galbraith
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Galbraith @ 2012-07-24  2:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote:

> The backing database was postgres.

FWIW, that wouldn't have been my choice.  I don't know if it still does,
but it used to use userland spinlocks to achieve scalability.  Turning
your CPUs into space heaters to combat concurrency issues makes a pretty
flat graph, but probably doesn't test kernels as well as something that
did not do that.

-Mike


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Sysbench read-only on ext3
  2012-07-24  2:29       ` Mike Galbraith
@ 2012-07-24  8:19         ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-24  8:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-mm, linux-kernel

On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote:
> On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote:
> 
> > The backing database was postgres.
> 
> FWIW, that wouldn't have been my choice.  I don't know if it still does,
> but it used to use userland spinlocks to achieve scalability. 

The tests used to support mysql but the code bit-rotted and eventually
got deleted. I'm not going to get into a mysql vs postgres discussion on
which is better :O

Were you thinking of mysql or something else as an alternative?
Completely different test?

> Turning
> your CPUs into space heaters to combat concurrency issues makes a pretty
> flat graph, but probably doesn't test kernels as well as something that
> did not do that.
> 

I did not check the source, but even if it is true then your comments only
applies to testing scalability of locking. If someone really cares to check,
the postgres version was 9.0.4. However, even if they are using user-space
locking, the test is still useful for looking at the IO performance,
page reclaim decisions and so on.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Sysbench read-only on ext3
@ 2012-07-24  8:19         ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-07-24  8:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-mm, linux-kernel

On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote:
> On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote:
> 
> > The backing database was postgres.
> 
> FWIW, that wouldn't have been my choice.  I don't know if it still does,
> but it used to use userland spinlocks to achieve scalability. 

The tests used to support mysql but the code bit-rotted and eventually
got deleted. I'm not going to get into a mysql vs postgres discussion on
which is better :O

Were you thinking of mysql or something else as an alternative?
Completely different test?

> Turning
> your CPUs into space heaters to combat concurrency issues makes a pretty
> flat graph, but probably doesn't test kernels as well as something that
> did not do that.
> 

I did not check the source, but even if it is true then your comments only
applies to testing scalability of locking. If someone really cares to check,
the postgres version was 9.0.4. However, even if they are using user-space
locking, the test is still useful for looking at the IO performance,
page reclaim decisions and so on.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Sysbench read-only on ext3
  2012-07-24  8:19         ` Mel Gorman
@ 2012-07-24  8:32           ` Mike Galbraith
  -1 siblings, 0 replies; 108+ messages in thread
From: Mike Galbraith @ 2012-07-24  8:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Tue, 2012-07-24 at 09:19 +0100, Mel Gorman wrote: 
> On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote:
> > On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote:
> > 
> > > The backing database was postgres.
> > 
> > FWIW, that wouldn't have been my choice.  I don't know if it still does,
> > but it used to use userland spinlocks to achieve scalability. 
> 
> The tests used to support mysql but the code bit-rotted and eventually
> got deleted. I'm not going to get into a mysql vs postgres discussion on
> which is better :O
> 
> Were you thinking of mysql or something else as an alternative?
> Completely different test?

Which db is under the hood doesn't matter much, but those spinlocks got
me thinking.

> > Turning
> > your CPUs into space heaters to combat concurrency issues makes a pretty
> > flat graph, but probably doesn't test kernels as well as something that
> > did not do that.
> > 
> 
> I did not check the source, but even if it is true then your comments only
> applies to testing scalability of locking. If someone really cares to check,
> the postgres version was 9.0.4. However, even if they are using user-space
> locking, the test is still useful for looking at the IO performance,
> page reclaim decisions and so on.

I was thinking while you're spinning in userspace, you're not giving the
kernel decisions to make.  But you're right.  If they didn't have
spinning locks, they'd have sleeping locks.  With spinning locks they
can be less smart I suppose.

-Mike



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] Sysbench read-only on ext3
@ 2012-07-24  8:32           ` Mike Galbraith
  0 siblings, 0 replies; 108+ messages in thread
From: Mike Galbraith @ 2012-07-24  8:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Tue, 2012-07-24 at 09:19 +0100, Mel Gorman wrote: 
> On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote:
> > On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote:
> > 
> > > The backing database was postgres.
> > 
> > FWIW, that wouldn't have been my choice.  I don't know if it still does,
> > but it used to use userland spinlocks to achieve scalability. 
> 
> The tests used to support mysql but the code bit-rotted and eventually
> got deleted. I'm not going to get into a mysql vs postgres discussion on
> which is better :O
> 
> Were you thinking of mysql or something else as an alternative?
> Completely different test?

Which db is under the hood doesn't matter much, but those spinlocks got
me thinking.

> > Turning
> > your CPUs into space heaters to combat concurrency issues makes a pretty
> > flat graph, but probably doesn't test kernels as well as something that
> > did not do that.
> > 
> 
> I did not check the source, but even if it is true then your comments only
> applies to testing scalability of locking. If someone really cares to check,
> the postgres version was 9.0.4. However, even if they are using user-space
> locking, the test is still useful for looking at the IO performance,
> page reclaim decisions and so on.

I was thinking while you're spinning in userspace, you're not giving the
kernel decisions to make.  But you're right.  If they didn't have
spinning locks, they'd have sleeping locks.  With spinning locks they
can be less smart I suppose.

-Mike


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] dbench4 async on ext3
  2012-07-23 21:21     ` Mel Gorman
@ 2012-08-16 14:52       ` Jan Kara
  -1 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2012-08-16 14:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Mon 23-07-12 22:21:46, Mel Gorman wrote:
> Configuration:	global-dhp__io-dbench4-async-ext3
> Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
> Benchmarks:	dbench4
> 
> Summary
> =======
> 
> In general there was a massive drop in throughput after 3.0. Very broadly
> speaking it looks like the Read operation got faster but at the cost of
> a big regression in the Flush operation.
  This looks bad. Also quickly looking into changelogs I don't see any
change which could cause this. I'll try to reproduce this and track it
down.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] dbench4 async on ext3
@ 2012-08-16 14:52       ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2012-08-16 14:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Mon 23-07-12 22:21:46, Mel Gorman wrote:
> Configuration:	global-dhp__io-dbench4-async-ext3
> Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
> Benchmarks:	dbench4
> 
> Summary
> =======
> 
> In general there was a massive drop in throughput after 3.0. Very broadly
> speaking it looks like the Read operation got faster but at the cost of
> a big regression in the Flush operation.
  This looks bad. Also quickly looking into changelogs I don't see any
change which could cause this. I'll try to reproduce this and track it
down.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] dbench4 async on ext3
  2012-07-23 21:21     ` Mel Gorman
@ 2012-08-21 22:00       ` Jan Kara
  -1 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2012-08-21 22:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Mon 23-07-12 22:21:46, Mel Gorman wrote:
> Configuration:	global-dhp__io-dbench4-async-ext3
> Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
> Benchmarks:	dbench4
> 
> Summary
> =======
> 
> In general there was a massive drop in throughput after 3.0. Very broadly
> speaking it looks like the Read operation got faster but at the cost of
> a big regression in the Flush operation.
  Mel, I had a look into this and it's actually very likely only a
configuration issue. In 3.1 ext3 started to default to enabled barriers
(barrier=1 in mount options) which is a safer but slower choice. When I set
barriers explicitely, I see no performance difference for dbench4 between
3.0 and 3.1.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] dbench4 async on ext3
@ 2012-08-21 22:00       ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2012-08-21 22:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Mon 23-07-12 22:21:46, Mel Gorman wrote:
> Configuration:	global-dhp__io-dbench4-async-ext3
> Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
> Benchmarks:	dbench4
> 
> Summary
> =======
> 
> In general there was a massive drop in throughput after 3.0. Very broadly
> speaking it looks like the Read operation got faster but at the cost of
> a big regression in the Flush operation.
  Mel, I had a look into this and it's actually very likely only a
configuration issue. In 3.1 ext3 started to default to enabled barriers
(barrier=1 in mount options) which is a safer but slower choice. When I set
barriers explicitely, I see no performance difference for dbench4 between
3.0 and 3.1.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] dbench4 async on ext3
  2012-08-21 22:00       ` Jan Kara
@ 2012-08-22 10:48         ` Mel Gorman
  -1 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-08-22 10:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Wed, Aug 22, 2012 at 12:00:38AM +0200, Jan Kara wrote:
> On Mon 23-07-12 22:21:46, Mel Gorman wrote:
> > Configuration:	global-dhp__io-dbench4-async-ext3
> > Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
> > Benchmarks:	dbench4
> > 
> > Summary
> > =======
> > 
> > In general there was a massive drop in throughput after 3.0. Very broadly
> > speaking it looks like the Read operation got faster but at the cost of
> > a big regression in the Flush operation.
>
>   Mel, I had a look into this and it's actually very likely only a
> configuration issue. In 3.1 ext3 started to default to enabled barriers
> (barrier=1 in mount options) which is a safer but slower choice. When I set
> barriers explicitely, I see no performance difference for dbench4 between
> 3.0 and 3.1.
> 

I've confirmed that disabling barriers fixed it, for one test machine and
one test at least. I'll reschedule the tests to run with barriers disabled
at some point in the future. Thanks for tracking it down, I was at least
two weeks away before I got the chance to even look.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [MMTests] dbench4 async on ext3
@ 2012-08-22 10:48         ` Mel Gorman
  0 siblings, 0 replies; 108+ messages in thread
From: Mel Gorman @ 2012-08-22 10:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel

On Wed, Aug 22, 2012 at 12:00:38AM +0200, Jan Kara wrote:
> On Mon 23-07-12 22:21:46, Mel Gorman wrote:
> > Configuration:	global-dhp__io-dbench4-async-ext3
> > Result: 	http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3
> > Benchmarks:	dbench4
> > 
> > Summary
> > =======
> > 
> > In general there was a massive drop in throughput after 3.0. Very broadly
> > speaking it looks like the Read operation got faster but at the cost of
> > a big regression in the Flush operation.
>
>   Mel, I had a look into this and it's actually very likely only a
> configuration issue. In 3.1 ext3 started to default to enabled barriers
> (barrier=1 in mount options) which is a safer but slower choice. When I set
> barriers explicitely, I see no performance difference for dbench4 between
> 3.0 and 3.1.
> 

I've confirmed that disabling barriers fixed it, for one test machine and
one test at least. I'll reschedule the tests to run with barriers disabled
at some point in the future. Thanks for tracking it down, I was at least
two weeks away before I got the chance to even look.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2012-08-22 10:54 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-20 11:32 MMTests 0.04 Mel Gorman
2012-06-20 11:32 ` Mel Gorman
2012-06-29 11:19 ` Mel Gorman
2012-06-29 11:19   ` Mel Gorman
2012-06-29 11:21   ` [MMTests] Page allocator Mel Gorman
2012-06-29 11:21     ` Mel Gorman
2012-06-29 11:22   ` [MMTests] Network performance Mel Gorman
2012-06-29 11:22     ` Mel Gorman
2012-06-29 11:23   ` [MMTests] IO metadata on ext3 Mel Gorman
2012-06-29 11:23     ` Mel Gorman
2012-06-29 11:24   ` [MMTests] IO metadata on ext4 Mel Gorman
2012-06-29 11:24     ` Mel Gorman
2012-06-29 11:25   ` [MMTests] IO metadata on XFS Mel Gorman
2012-06-29 11:25     ` Mel Gorman
2012-06-29 11:25     ` Mel Gorman
2012-07-01 23:54     ` Dave Chinner
2012-07-01 23:54       ` Dave Chinner
2012-07-01 23:54       ` Dave Chinner
2012-07-02  6:32       ` Christoph Hellwig
2012-07-02  6:32         ` Christoph Hellwig
2012-07-02  6:32         ` Christoph Hellwig
2012-07-02 14:32         ` Mel Gorman
2012-07-02 14:32           ` Mel Gorman
2012-07-02 14:32           ` Mel Gorman
2012-07-02 19:35           ` Mel Gorman
2012-07-02 19:35             ` Mel Gorman
2012-07-02 19:35             ` Mel Gorman
2012-07-03  0:19             ` Dave Chinner
2012-07-03  0:19               ` Dave Chinner
2012-07-03  0:19               ` Dave Chinner
2012-07-03 10:59               ` Mel Gorman
2012-07-03 10:59                 ` Mel Gorman
2012-07-03 10:59                 ` Mel Gorman
2012-07-03 11:44                 ` Mel Gorman
2012-07-03 11:44                   ` Mel Gorman
2012-07-03 11:44                   ` Mel Gorman
2012-07-03 12:31                 ` Daniel Vetter
2012-07-03 12:31                   ` Daniel Vetter
2012-07-03 12:31                   ` Daniel Vetter
2012-07-03 13:08                   ` Mel Gorman
2012-07-03 13:08                     ` Mel Gorman
2012-07-03 13:08                     ` Mel Gorman
2012-07-03 13:28                   ` Eugeni Dodonov
2012-07-03 13:28                     ` Eugeni Dodonov
2012-07-04  0:47                 ` Dave Chinner
2012-07-04  0:47                   ` Dave Chinner
2012-07-04  0:47                   ` Dave Chinner
2012-07-04  9:51                   ` Mel Gorman
2012-07-04  9:51                     ` Mel Gorman
2012-07-04  9:51                     ` Mel Gorman
2012-07-03 13:04             ` Mel Gorman
2012-07-03 13:04               ` Mel Gorman
2012-07-03 13:04               ` Mel Gorman
2012-07-03 14:04               ` Daniel Vetter
2012-07-03 14:04                 ` Daniel Vetter
2012-07-03 14:04                 ` Daniel Vetter
2012-07-02 13:30       ` Mel Gorman
2012-07-02 13:30         ` Mel Gorman
2012-07-02 13:30         ` Mel Gorman
2012-07-04 15:52   ` [MMTests] Page reclaim performance on ext3 Mel Gorman
2012-07-04 15:52     ` Mel Gorman
2012-07-04 15:53   ` [MMTests] Page reclaim performance on ext4 Mel Gorman
2012-07-04 15:53     ` Mel Gorman
2012-07-04 15:53   ` [MMTests] Page reclaim performance on xfs Mel Gorman
2012-07-04 15:53     ` Mel Gorman
2012-07-05 14:56   ` [MMTests] Interactivity during IO on ext3 Mel Gorman
2012-07-05 14:56     ` Mel Gorman
2012-07-10  9:49     ` Jan Kara
2012-07-10  9:49       ` Jan Kara
2012-07-10 11:30       ` Mel Gorman
2012-07-10 11:30         ` Mel Gorman
2012-07-05 14:57   ` [MMTests] Interactivity during IO on ext4 Mel Gorman
2012-07-05 14:57     ` Mel Gorman
2012-07-23 21:12   ` [MMTests] Scheduler Mel Gorman
2012-07-23 21:12     ` Mel Gorman
2012-07-23 21:13   ` [MMTests] Sysbench read-only on ext3 Mel Gorman
2012-07-23 21:13     ` Mel Gorman
2012-07-24  2:29     ` Mike Galbraith
2012-07-24  2:29       ` Mike Galbraith
2012-07-24  8:19       ` Mel Gorman
2012-07-24  8:19         ` Mel Gorman
2012-07-24  8:32         ` Mike Galbraith
2012-07-24  8:32           ` Mike Galbraith
2012-07-23 21:14   ` [MMTests] Sysbench read-only on ext4 Mel Gorman
2012-07-23 21:14     ` Mel Gorman
2012-07-23 21:15   ` [MMTests] Sysbench read-only on xfs Mel Gorman
2012-07-23 21:15     ` Mel Gorman
2012-07-23 21:17   ` [MMTests] memcachetest and parallel IO on ext3 Mel Gorman
2012-07-23 21:17     ` Mel Gorman
2012-07-23 21:19   ` [MMTests] memcachetest and parallel IO on xfs Mel Gorman
2012-07-23 21:19     ` Mel Gorman
2012-07-23 21:20   ` [MMTests] Stress high-order allocations on ext3 Mel Gorman
2012-07-23 21:20     ` Mel Gorman
2012-07-23 21:21   ` [MMTests] dbench4 async " Mel Gorman
2012-07-23 21:21     ` Mel Gorman
2012-08-16 14:52     ` Jan Kara
2012-08-16 14:52       ` Jan Kara
2012-08-21 22:00     ` Jan Kara
2012-08-21 22:00       ` Jan Kara
2012-08-22 10:48       ` Mel Gorman
2012-08-22 10:48         ` Mel Gorman
2012-07-23 21:23   ` [MMTests] dbench4 async on ext4 Mel Gorman
2012-07-23 21:23     ` Mel Gorman
2012-07-23 21:24   ` [MMTests] Threaded IO Performance on ext3 Mel Gorman
2012-07-23 21:24     ` Mel Gorman
2012-07-23 21:25   ` [MMTests] Threaded IO Performance on xfs Mel Gorman
2012-07-23 21:25     ` Mel Gorman
2012-07-23 21:25     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.