* MMTests 0.04 @ 2012-06-20 11:32 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-20 11:32 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel MMTests 0.04 is a configurable test suite that runs a number of common workloads of interest to MM developers. Apparently I never sent a release note for 0.03 so here is the changelog for both v0.04 o Add benchmarks for tbench, pipetest, lmbench, starve, memcachedtest o Add basic benchmark to run trinity fuzz testing tool o Add monitor that runs parallel IO in the background. Measures how much IO interferes with a target workload. o Allow limited run of sysbench to save time o Add helpers for running oprofile, taken from libhugetlbfs o Add fsmark configurations suitable for page reclaim and metadata tests o Add a mailserver simulator (needs work, takes too long to run) o Tune page fault test configuration for page allocator comparisons o Allow greater skew when running STREAM on NUMA machines o Add a monitor that roughly measures interactive app startup times o Add a monitor that tracks read() latency (useful for interactivity tests) o Add script for calculating quartiles (incomplete, not tested properly) o Add config examples for measuring interactivity during IO (not validated) o Add background allocator for hugepage allocations (not validated) o Patch SystemTap installation to work with 3.4 and later kernels o Allow use of out-of-box THP configuration v0.03 o Add a page allocator micro-benchmark o Add monitor for tracking processes stuck in D state o Add a page fault micro-benchmark o Add a memory compaction micro-benchmark o Patch a tiobench divide-by-0 error o Adapt systemtap for >= 3.3 kernel o Reporting scripts for kernbench o Reporting scripts for ddresidency At LSF/MM at some point a request was made that a series of tests be identified that were of interest to MM developers and that could be used for testing the Linux memory management subsystem. There is renewed interest in some sort of general testing framework during discussions for Kernel Summit 2012 so here is what I use. http://www.csn.ul.ie/~mel/projects/mmtests/ http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.04-mmtests-0.01.tar.gz In this release there are a number of stock configuration options added. For example config-global-dhp__pagealloc-performance runs a number of tests that may be able to identify performance regressions or gains in the page allocator. Similarly there network and scheduler configs. There are also more complex options. config-global-dhp__parallelio-memcachetest will run memcachetest in the foreground while doing IO of different sizes in the background to measure how much unrelated IO affects the throughput of an in-memory database. This release is a little half-baked but decided to release anyway due to current discussions. By my own admission there are areas that need cleaning up and there is some serious cut&paste-itis going on in parts. I wanted to get that all fixed up before releasing but that could take too long. The biggest warts by far are in how reports are generated due to being able to crank a new one out in 3 minutes where as doing it properly would require redesign. What should have happened is that the stats generation and reporting be completely separated but that can be still fixed because the raw data is captured. The stats reporting in general needs better work because while some tests know how to make a better estimate of mean by filtering outliers it is not being handled consistently and the methodology needs work. The raw data is there which I considered to be a higher priority initially. I ran a number of tests against kernels since 2.6.32 and there is a lot of interesting stuff in there. Unfortunately I have not had the chance to dig through it all and validate all the tests are working exactly as expected so they are not all available. However, this is an example report for one test configuration on one machine. It's a bit ugly but that was not a high priority. The other tests work on a similar principal http://www.csn.ul.ie/~mel/postings/mmtests-20120620/global-dhp__pagealloc-performance/comparison.html Just glancing through, it's possible to see interesting things and additional investigation work that is required. o Something awful happened in 3.2.9 across the board according to this machine o Kernbench in 3.3 and 3.4 was still not great in comparison to 3.1 o Page allocator performance was ruined for a large number of releases but generally improved in 3.3 although it's still a bit all over the place [*] o hackbench-pipes had a fun history but is mostly good in 3.4 o hackbench-sockets had a similarly fun history o aim9 shows that page-test took a big drop after 2.6.32 and has not recovered yet. Some of the other tests are also very alarming o STREAM is ok at least but that is not heavily dependant on kernel [*] It was this report that led to commit cc9a6c877 and the effect is visible if you squint hard enough. Needs a graph generator and a double checking that true-mean is measuring the right thing. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* MMTests 0.04 @ 2012-06-20 11:32 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-20 11:32 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel MMTests 0.04 is a configurable test suite that runs a number of common workloads of interest to MM developers. Apparently I never sent a release note for 0.03 so here is the changelog for both v0.04 o Add benchmarks for tbench, pipetest, lmbench, starve, memcachedtest o Add basic benchmark to run trinity fuzz testing tool o Add monitor that runs parallel IO in the background. Measures how much IO interferes with a target workload. o Allow limited run of sysbench to save time o Add helpers for running oprofile, taken from libhugetlbfs o Add fsmark configurations suitable for page reclaim and metadata tests o Add a mailserver simulator (needs work, takes too long to run) o Tune page fault test configuration for page allocator comparisons o Allow greater skew when running STREAM on NUMA machines o Add a monitor that roughly measures interactive app startup times o Add a monitor that tracks read() latency (useful for interactivity tests) o Add script for calculating quartiles (incomplete, not tested properly) o Add config examples for measuring interactivity during IO (not validated) o Add background allocator for hugepage allocations (not validated) o Patch SystemTap installation to work with 3.4 and later kernels o Allow use of out-of-box THP configuration v0.03 o Add a page allocator micro-benchmark o Add monitor for tracking processes stuck in D state o Add a page fault micro-benchmark o Add a memory compaction micro-benchmark o Patch a tiobench divide-by-0 error o Adapt systemtap for >= 3.3 kernel o Reporting scripts for kernbench o Reporting scripts for ddresidency At LSF/MM at some point a request was made that a series of tests be identified that were of interest to MM developers and that could be used for testing the Linux memory management subsystem. There is renewed interest in some sort of general testing framework during discussions for Kernel Summit 2012 so here is what I use. http://www.csn.ul.ie/~mel/projects/mmtests/ http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.04-mmtests-0.01.tar.gz In this release there are a number of stock configuration options added. For example config-global-dhp__pagealloc-performance runs a number of tests that may be able to identify performance regressions or gains in the page allocator. Similarly there network and scheduler configs. There are also more complex options. config-global-dhp__parallelio-memcachetest will run memcachetest in the foreground while doing IO of different sizes in the background to measure how much unrelated IO affects the throughput of an in-memory database. This release is a little half-baked but decided to release anyway due to current discussions. By my own admission there are areas that need cleaning up and there is some serious cut&paste-itis going on in parts. I wanted to get that all fixed up before releasing but that could take too long. The biggest warts by far are in how reports are generated due to being able to crank a new one out in 3 minutes where as doing it properly would require redesign. What should have happened is that the stats generation and reporting be completely separated but that can be still fixed because the raw data is captured. The stats reporting in general needs better work because while some tests know how to make a better estimate of mean by filtering outliers it is not being handled consistently and the methodology needs work. The raw data is there which I considered to be a higher priority initially. I ran a number of tests against kernels since 2.6.32 and there is a lot of interesting stuff in there. Unfortunately I have not had the chance to dig through it all and validate all the tests are working exactly as expected so they are not all available. However, this is an example report for one test configuration on one machine. It's a bit ugly but that was not a high priority. The other tests work on a similar principal http://www.csn.ul.ie/~mel/postings/mmtests-20120620/global-dhp__pagealloc-performance/comparison.html Just glancing through, it's possible to see interesting things and additional investigation work that is required. o Something awful happened in 3.2.9 across the board according to this machine o Kernbench in 3.3 and 3.4 was still not great in comparison to 3.1 o Page allocator performance was ruined for a large number of releases but generally improved in 3.3 although it's still a bit all over the place [*] o hackbench-pipes had a fun history but is mostly good in 3.4 o hackbench-sockets had a similarly fun history o aim9 shows that page-test took a big drop after 2.6.32 and has not recovered yet. Some of the other tests are also very alarming o STREAM is ok at least but that is not heavily dependant on kernel [*] It was this report that led to commit cc9a6c877 and the effect is visible if you squint hard enough. Needs a graph generator and a double checking that true-mean is measuring the right thing. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: MMTests 0.04 2012-06-20 11:32 ` Mel Gorman @ 2012-06-29 11:19 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:19 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel On Wed, Jun 20, 2012 at 12:32:52PM +0100, Mel Gorman wrote: > MMTests 0.04 is a configurable test suite that runs a number of common > workloads of interest to MM developers. Apparently I never sent a release > note for 0.03 so here is the changelog for both > Using MMTests 0.04 I ran a number of tests between 2.6.32 and 3.4 on three test machines. None of them are particularly powerful but the results are still useful because it's worth knowing how we are doing for some ordinary cases over time. There were 34 test configurations in all taking between 3-5 days to run all the tests for a single kernel. I expect that not all the results will be useful when I look closer but that can be improved. I have not looked at all the results yet and will only talk about the ones I have had a chance to read. I know the presentation is ugly but it was not a high priority to make them very pretty. The analysis is also superficial as it's time consuming to do a full analysis for any of these tests. In general the stats need improving but this is also something that can be improved over time once the raw data can be collected. Right now I tend to look closer at the data when I am trying to narrow a problem down to a specific area or when a regression might have been introduced. When this happens I can usually apply what stats I need manually or rerun the specific test with additional monitoring which is less than ideal for automation. Due to the superficial nature I suggest you take these summaries with a grain of salt and draw your own conclusions. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: MMTests 0.04 @ 2012-06-29 11:19 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:19 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel On Wed, Jun 20, 2012 at 12:32:52PM +0100, Mel Gorman wrote: > MMTests 0.04 is a configurable test suite that runs a number of common > workloads of interest to MM developers. Apparently I never sent a release > note for 0.03 so here is the changelog for both > Using MMTests 0.04 I ran a number of tests between 2.6.32 and 3.4 on three test machines. None of them are particularly powerful but the results are still useful because it's worth knowing how we are doing for some ordinary cases over time. There were 34 test configurations in all taking between 3-5 days to run all the tests for a single kernel. I expect that not all the results will be useful when I look closer but that can be improved. I have not looked at all the results yet and will only talk about the ones I have had a chance to read. I know the presentation is ugly but it was not a high priority to make them very pretty. The analysis is also superficial as it's time consuming to do a full analysis for any of these tests. In general the stats need improving but this is also something that can be improved over time once the raw data can be collected. Right now I tend to look closer at the data when I am trying to narrow a problem down to a specific area or when a regression might have been introduced. When this happens I can usually apply what stats I need manually or rerun the specific test with additional monitoring which is less than ideal for automation. Due to the superficial nature I suggest you take these summaries with a grain of salt and draw your own conclusions. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page allocator 2012-06-29 11:19 ` Mel Gorman @ 2012-06-29 11:21 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:21 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__pagealloc-performance Benchmarks: kernbench vmr-aim9 vmr-stream pagealloc pft hackbench-pipes hackbench-sockets Summary ======= kernbench and aim9 is looking bad in a lot of areas. The page allocator itself was in very bad shape for a long time but this has improved in 3.4. If there are reports of page allocator intensive workloads suffering badly in recent kernels then it may be worth backporting the barrier fix. Benchmark notes =============== kernbench is a similar average of five compiles of vmlinux. vmr-aim9 is a number of micro-benchmarks. The results of this are very sensitive to a number of factors but it can be useful early warning system. vmr-stream is the STREAM memory benchmark and variations in it can be indicative of problems with cache usage. pagealloc is a page allocator microbenchmark run via SystemTap. The page allocator is rarely a major component of a workloads time but it can be a source of slow degrataion of overall performance. pft is a microbenchmark for page fault rates. hackbench is usually used for scheduler comparisons but it can sometimes highlight problems in the page allocator as well. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagealloc-performance/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Good, except for aim9 =========================================================== kernbench --------- 2.6.32 looks quite bad which was surprising. That aside, there was a major degrading of performance between 2.6.34 and 2.6.39 that is only being resolved now. System CPU time was steadily getting worse for quite some time. pagealloc --------- Page allocator performance was completely screwed for a long time with massive additional latencies in the alloc path. This was fixed in 3.4 by removing barriers introduced for cpusets. hackbench-pipes --------------- Generally looks ok. hackbench-sockets ----------------- Some very poor results although it seems to have recovered recently. 2.6.39 through to 3.2 were all awful. vmr-aim9 -------- page_test, brk_test, exec_test and fork_test all took a major pounding between 2.6.34 and 2.6.39. It has been improving since but still is far short of 2.6.32 levels in some cases. vmr-stream ---------- Generally looks ok. pft --- Indications are we scaled better over time with a greater number of faults being handled when spread across CPUs. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Bad, both kernbench and aim9 need care ========================================================== kernbench --------- 2.6.32 looked quite bad and great in 2.6.34. Between 2.6.34 and 2.6.39 it regressed again and got worse after that. System CPU time looks generally good but 3.2 and later kernels are in bad shape in terms of overall elapsed time. pagealloc --------- As with arnold, page allocator performance was completely screwed for a long time but mostly resolved in 3.4. hackbench-pipes --------------- This has varied considerably over time. Currently looking good but there was a time when high number of clients regressed considerably. Judging from when it got fixed this might be a scheduler problem rather than a page allocator one. hackbench-sockets ----------------- This is marginal at the moment and has had some serious regressions in the past. vmr-aim9 -------- Like with arnold, a lot of tests took a complete hammering mostly between 2.6.34 and 2.6.39 with the exception of exec_test which got screwed at 2.6.34 as well. Like arnold, it has improved in 3.4 but is still far short of 2.6.32 vmr-stream ---------- Generally looks ok. pft --- Unlike arnold, the figures are worse here. It looks like we were not handling as many faults for some time although this is better now. It might be related to the page allocator being crap for a long time. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Bad, both kernbench and aim9 need care ========================================================== kernbench --------- As before, 2.6.32 looked bad. 2.6.34 was good but we got worse after that. Elapsed time in 3.4 is screwed. pagealloc --------- As with the other two, page allocator performance was bad for a long time but not quite as bad as the others. Maybe barriers are cheaper on the I7 than they are on the other machines. Still, 3.4 is looking good. hackbench-pipes --------------- Looks great. I suspect a lot of scheduler developers must have modern Intel CPUs for testing with hackbench-sockets ----------------- Not so great. Performance dropped for a while but is looking marginally better now. vmr-aim9 -------- Variation of the same story. In general this is looking worse but was not as consistently bad as the other two machines. Performance in 3.4 is a mixed bag. vmr-stream ---------- Generally looks ok. pft --- Generally looking good, tests are completing faster. There were regressions in old kernels but it has been looking better recently. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page allocator @ 2012-06-29 11:21 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:21 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__pagealloc-performance Benchmarks: kernbench vmr-aim9 vmr-stream pagealloc pft hackbench-pipes hackbench-sockets Summary ======= kernbench and aim9 is looking bad in a lot of areas. The page allocator itself was in very bad shape for a long time but this has improved in 3.4. If there are reports of page allocator intensive workloads suffering badly in recent kernels then it may be worth backporting the barrier fix. Benchmark notes =============== kernbench is a similar average of five compiles of vmlinux. vmr-aim9 is a number of micro-benchmarks. The results of this are very sensitive to a number of factors but it can be useful early warning system. vmr-stream is the STREAM memory benchmark and variations in it can be indicative of problems with cache usage. pagealloc is a page allocator microbenchmark run via SystemTap. The page allocator is rarely a major component of a workloads time but it can be a source of slow degrataion of overall performance. pft is a microbenchmark for page fault rates. hackbench is usually used for scheduler comparisons but it can sometimes highlight problems in the page allocator as well. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagealloc-performance/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Good, except for aim9 =========================================================== kernbench --------- 2.6.32 looks quite bad which was surprising. That aside, there was a major degrading of performance between 2.6.34 and 2.6.39 that is only being resolved now. System CPU time was steadily getting worse for quite some time. pagealloc --------- Page allocator performance was completely screwed for a long time with massive additional latencies in the alloc path. This was fixed in 3.4 by removing barriers introduced for cpusets. hackbench-pipes --------------- Generally looks ok. hackbench-sockets ----------------- Some very poor results although it seems to have recovered recently. 2.6.39 through to 3.2 were all awful. vmr-aim9 -------- page_test, brk_test, exec_test and fork_test all took a major pounding between 2.6.34 and 2.6.39. It has been improving since but still is far short of 2.6.32 levels in some cases. vmr-stream ---------- Generally looks ok. pft --- Indications are we scaled better over time with a greater number of faults being handled when spread across CPUs. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Bad, both kernbench and aim9 need care ========================================================== kernbench --------- 2.6.32 looked quite bad and great in 2.6.34. Between 2.6.34 and 2.6.39 it regressed again and got worse after that. System CPU time looks generally good but 3.2 and later kernels are in bad shape in terms of overall elapsed time. pagealloc --------- As with arnold, page allocator performance was completely screwed for a long time but mostly resolved in 3.4. hackbench-pipes --------------- This has varied considerably over time. Currently looking good but there was a time when high number of clients regressed considerably. Judging from when it got fixed this might be a scheduler problem rather than a page allocator one. hackbench-sockets ----------------- This is marginal at the moment and has had some serious regressions in the past. vmr-aim9 -------- Like with arnold, a lot of tests took a complete hammering mostly between 2.6.34 and 2.6.39 with the exception of exec_test which got screwed at 2.6.34 as well. Like arnold, it has improved in 3.4 but is still far short of 2.6.32 vmr-stream ---------- Generally looks ok. pft --- Unlike arnold, the figures are worse here. It looks like we were not handling as many faults for some time although this is better now. It might be related to the page allocator being crap for a long time. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Bad, both kernbench and aim9 need care ========================================================== kernbench --------- As before, 2.6.32 looked bad. 2.6.34 was good but we got worse after that. Elapsed time in 3.4 is screwed. pagealloc --------- As with the other two, page allocator performance was bad for a long time but not quite as bad as the others. Maybe barriers are cheaper on the I7 than they are on the other machines. Still, 3.4 is looking good. hackbench-pipes --------------- Looks great. I suspect a lot of scheduler developers must have modern Intel CPUs for testing with hackbench-sockets ----------------- Not so great. Performance dropped for a while but is looking marginally better now. vmr-aim9 -------- Variation of the same story. In general this is looking worse but was not as consistently bad as the other two machines. Performance in 3.4 is a mixed bag. vmr-stream ---------- Generally looks ok. pft --- Generally looking good, tests are completing faster. There were regressions in old kernels but it has been looking better recently. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Network performance 2012-06-29 11:19 ` Mel Gorman @ 2012-06-29 11:22 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:22 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, netdev Configuration: global-dhp__network-performance Benchmarks: netperf-udp, netperf-tcp, tbench4 Summary ======= Some tests look good but netperf-tcp tests show a number of problems. Benchmark notes =============== netperf used the TCP_STREAM or UDP_STREAM tests. Server and client were bound to CPU 0 and 1 respectively. To improve the chances of getting an accurate reading "-i 50,6 -I 99,1" was specified on the command line. Personally I tend to find netperf figures a bit unreliable and can vary depending on the exact starting conditions. This might be due to the test being run against localhost or because there is no other machine activity to smooth outliers related to cache coloring. Suggestions on how to mitigate this are welcome. tbench was from dbench 4 and ran for 3 minutes. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__network-performance/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Ok, but netperf-tcp has problems =========================================================== netperf-udp ----------- For the most part, this looks good. 2.6.34 and 3.2.9 were both bad kernels for some reason but currently it looks fine. I tend to find that netperf figures fluctuate easily and t netperf-tcp ----------- This is less healthy, it looks like there is a fairly consistent regression of 2-5%. tbench4 ------- Some of these tests failed to run and the logs are unclear as to why but only happened on this machine. It's only now that I noticed. While results are looking ok now, there were some regressions for 3.0 until 3.2 kernels that might be of concern to -stable users. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok, but netperf-tcp has problems ========================================================== netperf-udp ----------- This is looking great. There was a high in 3.1 that has been lost since but it's still better overall in comparison to 2.6.32. netperf-tcp ----------- This is less healthy with a lot of regression. 3.4 has mostly regressed to the tune of 2-13% versus 2.6.32. tbench4 ------- For the most part, this is looking ok. 2 clients seems to be particularly problematic for some reason but otherwise looks good. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Bad, tbench is ok just otherwise poor ========================================================== netperf-udp ----------- This is not a happy story. There was a big drop between 3.2 and 3.3 and the regression is still there in comparison to 2.6.32 netperf-tcp ----------- This has consistently regressed since 2.6.34 with the regression very roughly around the 10% mark. tbench4 ------- Unlike the other tests, this is looking reasonably good with performance gains until the number of clients gets really high. It was interesting to note that 2.6.34 was a particularly good kernel for tbench and while current kernels are better then 2.6.32, they are not as good as 2.6.34. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Network performance @ 2012-06-29 11:22 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:22 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, netdev Configuration: global-dhp__network-performance Benchmarks: netperf-udp, netperf-tcp, tbench4 Summary ======= Some tests look good but netperf-tcp tests show a number of problems. Benchmark notes =============== netperf used the TCP_STREAM or UDP_STREAM tests. Server and client were bound to CPU 0 and 1 respectively. To improve the chances of getting an accurate reading "-i 50,6 -I 99,1" was specified on the command line. Personally I tend to find netperf figures a bit unreliable and can vary depending on the exact starting conditions. This might be due to the test being run against localhost or because there is no other machine activity to smooth outliers related to cache coloring. Suggestions on how to mitigate this are welcome. tbench was from dbench 4 and ran for 3 minutes. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__network-performance/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Ok, but netperf-tcp has problems =========================================================== netperf-udp ----------- For the most part, this looks good. 2.6.34 and 3.2.9 were both bad kernels for some reason but currently it looks fine. I tend to find that netperf figures fluctuate easily and t netperf-tcp ----------- This is less healthy, it looks like there is a fairly consistent regression of 2-5%. tbench4 ------- Some of these tests failed to run and the logs are unclear as to why but only happened on this machine. It's only now that I noticed. While results are looking ok now, there were some regressions for 3.0 until 3.2 kernels that might be of concern to -stable users. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok, but netperf-tcp has problems ========================================================== netperf-udp ----------- This is looking great. There was a high in 3.1 that has been lost since but it's still better overall in comparison to 2.6.32. netperf-tcp ----------- This is less healthy with a lot of regression. 3.4 has mostly regressed to the tune of 2-13% versus 2.6.32. tbench4 ------- For the most part, this is looking ok. 2 clients seems to be particularly problematic for some reason but otherwise looks good. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Bad, tbench is ok just otherwise poor ========================================================== netperf-udp ----------- This is not a happy story. There was a big drop between 3.2 and 3.3 and the regression is still there in comparison to 2.6.32 netperf-tcp ----------- This has consistently regressed since 2.6.34 with the regression very roughly around the 10% mark. tbench4 ------- Unlike the other tests, this is looking reasonably good with performance gains until the number of clients gets really high. It was interesting to note that 2.6.34 was a particularly good kernel for tbench and while current kernels are better then 2.6.32, they are not as good as 2.6.34. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-06-29 11:23 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:23 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-metadata-ext3 Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= While the fsmark figures look ok, fsmark in single threaded mode has a small number of large outliers towards the min end of the scale. The resulting standard deviation fuzzes results but filtering is not necessarily the best answer. A similar effect is visible when running in threaded mode except that there is clustering around some values that could be presented better. Arithmetic mean is unsuitable for this sort of data. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Fine =========================================================== dbench ------ For single clients, we're doing reasonably well. There was a big spike for large number of clients in 2.6.34 and to a lesser extent in 2.6.39.4 but much of this is due to the operations taking place in memory without reaching disk. There were also fairness issues and the indicated throughput figures are far higher than the disks capabilities so I do not consider this to be a regression. There was a mild dip for 3.2.x and 3.3.x that has been recovered somewhat in 3.4. As this was when IO-Less Dirty Throttling got merged it is hardly a surprise and a dip in dbench is not worth backing that out for. Recent kernels appear to deal worse for large number of clients. However, I very strongly suspect this is due to improved fairness in IO. The high throughput figures are due to one client making an unfair amount of progress while other clients stall. fsmark-single ------------- Again, this is looking good. Files/sec has improved slightly with the exception of a dip in 3.2 and 3.3 which again may be due to IO-Less dirty throttling. I have a slight concern with the overhead measurements. Somewhere between 3.0.23 and 3.1.10 the overhead started deviating a lot more. Ideally this should be bisected because it difficult to blame IO-Less throttling with any certainity. IO-Less Throttling. fsmark-threaded --------------- Looks better but due to high deviations it's hard to be 100% sure. If this is of interest then the thing to do is do a proper measurement of whether the results are significant or not although with 15 samples it still will be fuzzy. Right now, there is little to be concerned about. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Very similar story to the arnold machine. Big spike in 2.6.34 and then drops off. fsmark-single ------------- Nothing really notable here. Deviations are too high to draw reasonable conclusions from. Looking at the raw results it's due to a small number of low outliers. These could be filtered but it would mask the fact that throughput is not consistent so strong justification would be required. fsmark-threaded --------------- Similar to fsmark-single. Figures look ok but large deviations are a problem Unlike the single-threaded case the raw data shows that we cluster around two points that are very far apart from each other. It is worth investigating if this can be presented in some sensible manner such as k-means clustering because arithmetic mean with this sort of data is crap. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Same as the other two complete with spikes. fsmark-single ------------- Other than overhead going crazy in 3.2 there is nothing notable either. As with hydra, there are a small number of outliers that result in large deviations fsmark-threaded --------------- Similar to hydra. Figures look basically ok but deviations are high with some clustering going on. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on ext3 @ 2012-06-29 11:23 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:23 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-metadata-ext3 Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= While the fsmark figures look ok, fsmark in single threaded mode has a small number of large outliers towards the min end of the scale. The resulting standard deviation fuzzes results but filtering is not necessarily the best answer. A similar effect is visible when running in threaded mode except that there is clustering around some values that could be presented better. Arithmetic mean is unsuitable for this sort of data. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Fine =========================================================== dbench ------ For single clients, we're doing reasonably well. There was a big spike for large number of clients in 2.6.34 and to a lesser extent in 2.6.39.4 but much of this is due to the operations taking place in memory without reaching disk. There were also fairness issues and the indicated throughput figures are far higher than the disks capabilities so I do not consider this to be a regression. There was a mild dip for 3.2.x and 3.3.x that has been recovered somewhat in 3.4. As this was when IO-Less Dirty Throttling got merged it is hardly a surprise and a dip in dbench is not worth backing that out for. Recent kernels appear to deal worse for large number of clients. However, I very strongly suspect this is due to improved fairness in IO. The high throughput figures are due to one client making an unfair amount of progress while other clients stall. fsmark-single ------------- Again, this is looking good. Files/sec has improved slightly with the exception of a dip in 3.2 and 3.3 which again may be due to IO-Less dirty throttling. I have a slight concern with the overhead measurements. Somewhere between 3.0.23 and 3.1.10 the overhead started deviating a lot more. Ideally this should be bisected because it difficult to blame IO-Less throttling with any certainity. IO-Less Throttling. fsmark-threaded --------------- Looks better but due to high deviations it's hard to be 100% sure. If this is of interest then the thing to do is do a proper measurement of whether the results are significant or not although with 15 samples it still will be fuzzy. Right now, there is little to be concerned about. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Very similar story to the arnold machine. Big spike in 2.6.34 and then drops off. fsmark-single ------------- Nothing really notable here. Deviations are too high to draw reasonable conclusions from. Looking at the raw results it's due to a small number of low outliers. These could be filtered but it would mask the fact that throughput is not consistent so strong justification would be required. fsmark-threaded --------------- Similar to fsmark-single. Figures look ok but large deviations are a problem Unlike the single-threaded case the raw data shows that we cluster around two points that are very far apart from each other. It is worth investigating if this can be presented in some sensible manner such as k-means clustering because arithmetic mean with this sort of data is crap. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Same as the other two complete with spikes. fsmark-single ------------- Other than overhead going crazy in 3.2 there is nothing notable either. As with hydra, there are a small number of outliers that result in large deviations fsmark-threaded --------------- Similar to hydra. Figures look basically ok but deviations are high with some clustering going on. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on ext4 2012-06-29 11:19 ` Mel Gorman @ 2012-06-29 11:24 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:24 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-metadata-ext4 Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= For the most part the figures look ok currently. However a number of tests show that we have declined since 3.0 in a number of areas. Some machines show that there were performance drops in the 3.2 and 3.3 kernels that have not being fully recovered. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Fine but fsmark has declined since 3.0 =========================================================== dbench ------ For single clients, we're doing reasonably well and this has been consistent with each release. fsmark-single ------------- This is not as happy a story. Variations are quite high but 3.0 was a reasonably good kernel and we've been declining ever since with 3.4 being marginally worse than 2.6.32. fsmark-threaded --------------- The trends are very similar to fsmark-single. 3.0 was reasonably good but we have degraded since and are at approximately 2.6.32 levels. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Fine but fsmark has declined since 3.0 ========================================================== dbench ------ Unlike arnold, this is looking good with solid gains in most kernels for the single-threaded case. The exception was 3.2.9 which saw a a big dip that was recovered in 3.3. For higher number of clients the figures still look good. It's not clear why there is such a difference between arnold and hydra for the single-threaded case. fsmark-single ------------- This is very similar to arnold in that 3.0 performed best and we have declined since back to more or less the same level as 2.6.32. fsmark-threaded --------------- Performance here is flat in terms of throughput. 3.4 recorded much higher overhead but it is not clear if this is a cause for concern. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine but there have been recent declines ========================================================== dbench ------ Like hydra, this is looking good with solid gains in most kernels for the single-threaded case. The same dip in 3.2.9 is visible but unlikely hydra it was not recovered until 3.4. Higher number of clients generally look good as well although it is interesting to see that the dip in 3.2.9 is not consistently visible. fsmark-single ------------- Overhead went crazy in 3.3 and there is a large drop in files/sec in 3.3 as well. fsmark-threaded --------------- The trends are similar to the single-threaded case. Looking reasonably good but a dip in 3.3 that has not being recovered and overhead is higher. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on ext4 @ 2012-06-29 11:24 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:24 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-metadata-ext4 Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= For the most part the figures look ok currently. However a number of tests show that we have declined since 3.0 in a number of areas. Some machines show that there were performance drops in the 3.2 and 3.3 kernels that have not being fully recovered. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Fine but fsmark has declined since 3.0 =========================================================== dbench ------ For single clients, we're doing reasonably well and this has been consistent with each release. fsmark-single ------------- This is not as happy a story. Variations are quite high but 3.0 was a reasonably good kernel and we've been declining ever since with 3.4 being marginally worse than 2.6.32. fsmark-threaded --------------- The trends are very similar to fsmark-single. 3.0 was reasonably good but we have degraded since and are at approximately 2.6.32 levels. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Fine but fsmark has declined since 3.0 ========================================================== dbench ------ Unlike arnold, this is looking good with solid gains in most kernels for the single-threaded case. The exception was 3.2.9 which saw a a big dip that was recovered in 3.3. For higher number of clients the figures still look good. It's not clear why there is such a difference between arnold and hydra for the single-threaded case. fsmark-single ------------- This is very similar to arnold in that 3.0 performed best and we have declined since back to more or less the same level as 2.6.32. fsmark-threaded --------------- Performance here is flat in terms of throughput. 3.4 recorded much higher overhead but it is not clear if this is a cause for concern. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine but there have been recent declines ========================================================== dbench ------ Like hydra, this is looking good with solid gains in most kernels for the single-threaded case. The same dip in 3.2.9 is visible but unlikely hydra it was not recovered until 3.4. Higher number of clients generally look good as well although it is interesting to see that the dip in 3.2.9 is not consistently visible. fsmark-single ------------- Overhead went crazy in 3.3 and there is a large drop in files/sec in 3.3 as well. fsmark-threaded --------------- The trends are similar to the single-threaded case. Looking reasonably good but a dip in 3.3 that has not being recovered and overhead is higher. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on XFS 2012-06-29 11:19 ` Mel Gorman (?) @ 2012-06-29 11:25 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:25 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs Configuration: global-dhp__io-metadata-xfs Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= Most of the figures look good and in general there has been consistent good performance from XFS. However, fsmark-single is showing a severe performance dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single thread took a particularly bad dive in 3.4 for two machines that is worth examining closer. Unfortunately it is harder to easy conclusions as the gains/losses are not consistent between machines which may be related to the available number of CPU threads. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Great =========================================================== dbench ------ XFS is showing steady improvements with a large gain for single client in 2.6.39 and more or less retained since then. This is also true for higher number of clients although 64 clients was suspiciously poor even though 128 clients looked better. I didn't re-examine the raw data to see why. In general, dbench is looking very good. fsmark-single ------------- Again, this is looking good. Files/sec has improved slightly with the exception of a small dip in 3.2 and 3.3 which may be due to IO-Less dirty throttling. Overhead measurements are a bit all over the place. Not clear if this is cause for concern or not. fsmark-threaded --------------- Improved since 2.6.32 and has been steadily good for some time. Overhead measurements are all over the place. Again, not clear if this is a cause for concern. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench ------ The results here look very different to the arnold machine. This is curious because the disks have similar size and performance characteristics. It is doubtful that the difference is between 32 bit and 64 bit architectures. The discrepency may be more due to the different number of CPUs and how XFS does locking. One possibility is that fewer CPUs has the side-effect of better batching of some operations but this is a case. Figures areis showing that throughput is worse and highly variable in 3.4 for single clients. For higher number of clients figures look better overall. There was a dip in 3.1-based kernels though for an unknown reason. This does not exactly correlate with the ext3 figures although it showed a dip in performance at 3.2. fsmark-single ------------- While performance is better than 2.6.32, there was a dip in performance in 3.3 and a very large dip in 3.4. fsmark-threaded --------------- The same dip in 3.4 is visibile when multiple threads are used but it is not as severe. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Like seen on other filesystems, this data shows that there was a large dip in performance around 3.2 for single threads. Unlike the hydra machine, this was recovered in 3.4. As higher number of threads are used the gains and losses are inconsistent making it hard to draw a solid conclusion. fsmark-single ------------- This was doing great until 3.4 where there is a large drop. fsmark-threaded --------------- Unlike the single threaded case, things are looking great here. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on XFS @ 2012-06-29 11:25 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:25 UTC (permalink / raw) To: linux-mm; +Cc: linux-fsdevel, linux-kernel, xfs Configuration: global-dhp__io-metadata-xfs Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= Most of the figures look good and in general there has been consistent good performance from XFS. However, fsmark-single is showing a severe performance dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single thread took a particularly bad dive in 3.4 for two machines that is worth examining closer. Unfortunately it is harder to easy conclusions as the gains/losses are not consistent between machines which may be related to the available number of CPU threads. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Great =========================================================== dbench ------ XFS is showing steady improvements with a large gain for single client in 2.6.39 and more or less retained since then. This is also true for higher number of clients although 64 clients was suspiciously poor even though 128 clients looked better. I didn't re-examine the raw data to see why. In general, dbench is looking very good. fsmark-single ------------- Again, this is looking good. Files/sec has improved slightly with the exception of a small dip in 3.2 and 3.3 which may be due to IO-Less dirty throttling. Overhead measurements are a bit all over the place. Not clear if this is cause for concern or not. fsmark-threaded --------------- Improved since 2.6.32 and has been steadily good for some time. Overhead measurements are all over the place. Again, not clear if this is a cause for concern. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench ------ The results here look very different to the arnold machine. This is curious because the disks have similar size and performance characteristics. It is doubtful that the difference is between 32 bit and 64 bit architectures. The discrepency may be more due to the different number of CPUs and how XFS does locking. One possibility is that fewer CPUs has the side-effect of better batching of some operations but this is a case. Figures areis showing that throughput is worse and highly variable in 3.4 for single clients. For higher number of clients figures look better overall. There was a dip in 3.1-based kernels though for an unknown reason. This does not exactly correlate with the ext3 figures although it showed a dip in performance at 3.2. fsmark-single ------------- While performance is better than 2.6.32, there was a dip in performance in 3.3 and a very large dip in 3.4. fsmark-threaded --------------- The same dip in 3.4 is visibile when multiple threads are used but it is not as severe. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Like seen on other filesystems, this data shows that there was a large dip in performance around 3.2 for single threads. Unlike the hydra machine, this was recovered in 3.4. As higher number of threads are used the gains and losses are inconsistent making it hard to draw a solid conclusion. fsmark-single ------------- This was doing great until 3.4 where there is a large drop. fsmark-threaded --------------- Unlike the single threaded case, things are looking great here. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] IO metadata on XFS @ 2012-06-29 11:25 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-06-29 11:25 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs Configuration: global-dhp__io-metadata-xfs Benchmarks: dbench3, fsmark-single, fsmark-threaded Summary ======= Most of the figures look good and in general there has been consistent good performance from XFS. However, fsmark-single is showing a severe performance dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single thread took a particularly bad dive in 3.4 for two machines that is worth examining closer. Unfortunately it is harder to easy conclusions as the gains/losses are not consistent between machines which may be related to the available number of CPU threads. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. dbench3 was chosen as it's metadata intensive. o Duration was 180 seconds o OSYNC, OSYNC_DIRECTORY and FSYNC were all off As noted in the MMTests, dbench3 can be a random number generator particularly when run in asynchronous mode. Even with the limitations, it can be useful as an early warning system and as it's still used by QA teams it's still worth keeping an eye on. FSMark o Parallel directories were used o 1 Thread per CPU o 0 Filesize o 225 directories o 22500 files per directory o 50000 files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-9227/1 -D 225 -N 22500 -n 50000 -L 15 -S0 -s 0 Thread: ./fs_mark -d /tmp/fsmark-9407/1 -d /tmp/fsmark-9407/2 -D 225 -N 22500 -n 25000 -L 15 -S0 -s 0 FSMark is a more realistic indicator of metadata intensive workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Great =========================================================== dbench ------ XFS is showing steady improvements with a large gain for single client in 2.6.39 and more or less retained since then. This is also true for higher number of clients although 64 clients was suspiciously poor even though 128 clients looked better. I didn't re-examine the raw data to see why. In general, dbench is looking very good. fsmark-single ------------- Again, this is looking good. Files/sec has improved slightly with the exception of a small dip in 3.2 and 3.3 which may be due to IO-Less dirty throttling. Overhead measurements are a bit all over the place. Not clear if this is cause for concern or not. fsmark-threaded --------------- Improved since 2.6.32 and has been steadily good for some time. Overhead measurements are all over the place. Again, not clear if this is a cause for concern. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench ------ The results here look very different to the arnold machine. This is curious because the disks have similar size and performance characteristics. It is doubtful that the difference is between 32 bit and 64 bit architectures. The discrepency may be more due to the different number of CPUs and how XFS does locking. One possibility is that fewer CPUs has the side-effect of better batching of some operations but this is a case. Figures areis showing that throughput is worse and highly variable in 3.4 for single clients. For higher number of clients figures look better overall. There was a dip in 3.1-based kernels though for an unknown reason. This does not exactly correlate with the ext3 figures although it showed a dip in performance at 3.2. fsmark-single ------------- While performance is better than 2.6.32, there was a dip in performance in 3.3 and a very large dip in 3.4. fsmark-threaded --------------- The same dip in 3.4 is visibile when multiple threads are used but it is not as severe. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Fine ========================================================== dbench ------ Like seen on other filesystems, this data shows that there was a large dip in performance around 3.2 for single threads. Unlike the hydra machine, this was recovered in 3.4. As higher number of threads are used the gains and losses are inconsistent making it hard to draw a solid conclusion. fsmark-single ------------- This was doing great until 3.4 where there is a large drop. fsmark-threaded --------------- Unlike the single threaded case, things are looking great here. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-06-29 11:25 ` Mel Gorman (?) @ 2012-07-01 23:54 ` Dave Chinner -1 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-01 23:54 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote: > Configuration: global-dhp__io-metadata-xfs > Benchmarks: dbench3, fsmark-single, fsmark-threaded > > Summary > ======= > Most of the figures look good and in general there has been consistent good > performance from XFS. However, fsmark-single is showing a severe performance > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single > thread took a particularly bad dive in 3.4 for two machines that is worth > examining closer. That will be caused by the fact we changed all the metadata updates to be logged, which means a transaction every time .dirty_inode is called. This should mostly go away when XFS is converted to use .update_time rather than .dirty_inode to only issue transactions when the VFS updates the atime rather than every .dirty_inode call... > Unfortunately it is harder to easy conclusions as the > gains/losses are not consistent between machines which may be related to > the available number of CPU threads. It increases the CPU overhead (dirty_inode can be called up to 4 times per write(2) call, IIRC), so with limited numbers of threads/limited CPU power it will result in lower performance. Where you have lots of CPU power, there will be little difference in performance... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-01 23:54 ` Dave Chinner 0 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-01 23:54 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-fsdevel, linux-mm, linux-kernel, xfs On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote: > Configuration: global-dhp__io-metadata-xfs > Benchmarks: dbench3, fsmark-single, fsmark-threaded > > Summary > ======= > Most of the figures look good and in general there has been consistent good > performance from XFS. However, fsmark-single is showing a severe performance > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single > thread took a particularly bad dive in 3.4 for two machines that is worth > examining closer. That will be caused by the fact we changed all the metadata updates to be logged, which means a transaction every time .dirty_inode is called. This should mostly go away when XFS is converted to use .update_time rather than .dirty_inode to only issue transactions when the VFS updates the atime rather than every .dirty_inode call... > Unfortunately it is harder to easy conclusions as the > gains/losses are not consistent between machines which may be related to > the available number of CPU threads. It increases the CPU overhead (dirty_inode can be called up to 4 times per write(2) call, IIRC), so with limited numbers of threads/limited CPU power it will result in lower performance. Where you have lots of CPU power, there will be little difference in performance... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-01 23:54 ` Dave Chinner 0 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-01 23:54 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote: > Configuration: global-dhp__io-metadata-xfs > Benchmarks: dbench3, fsmark-single, fsmark-threaded > > Summary > ======= > Most of the figures look good and in general there has been consistent good > performance from XFS. However, fsmark-single is showing a severe performance > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single > thread took a particularly bad dive in 3.4 for two machines that is worth > examining closer. That will be caused by the fact we changed all the metadata updates to be logged, which means a transaction every time .dirty_inode is called. This should mostly go away when XFS is converted to use .update_time rather than .dirty_inode to only issue transactions when the VFS updates the atime rather than every .dirty_inode call... > Unfortunately it is harder to easy conclusions as the > gains/losses are not consistent between machines which may be related to > the available number of CPU threads. It increases the CPU overhead (dirty_inode can be called up to 4 times per write(2) call, IIRC), so with limited numbers of threads/limited CPU power it will result in lower performance. Where you have lots of CPU power, there will be little difference in performance... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-01 23:54 ` Dave Chinner (?) @ 2012-07-02 6:32 ` Christoph Hellwig -1 siblings, 0 replies; 108+ messages in thread From: Christoph Hellwig @ 2012-07-02 6:32 UTC (permalink / raw) To: Dave Chinner; +Cc: Mel Gorman, linux-mm, linux-kernel, linux-fsdevel, xfs On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote: > That will be caused by the fact we changed all the metadata updates > to be logged, which means a transaction every time .dirty_inode is > called. > > This should mostly go away when XFS is converted to use .update_time > rather than .dirty_inode to only issue transactions when the VFS > updates the atime rather than every .dirty_inode call... I think the patch to do that conversion still needs review.. > It increases the CPU overhead (dirty_inode can be called up to 4 > times per write(2) call, IIRC), so with limited numbers of > threads/limited CPU power it will result in lower performance. Where > you have lots of CPU power, there will be little difference in > performance... When I checked it it could only be called twice, and we'd already optimize away the second call. I'd defintively like to track down where the performance changes happend, at least to a major version but even better to a -rc or git commit. ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 6:32 ` Christoph Hellwig 0 siblings, 0 replies; 108+ messages in thread From: Christoph Hellwig @ 2012-07-02 6:32 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs, Mel Gorman, linux-kernel On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote: > That will be caused by the fact we changed all the metadata updates > to be logged, which means a transaction every time .dirty_inode is > called. > > This should mostly go away when XFS is converted to use .update_time > rather than .dirty_inode to only issue transactions when the VFS > updates the atime rather than every .dirty_inode call... I think the patch to do that conversion still needs review.. > It increases the CPU overhead (dirty_inode can be called up to 4 > times per write(2) call, IIRC), so with limited numbers of > threads/limited CPU power it will result in lower performance. Where > you have lots of CPU power, there will be little difference in > performance... When I checked it it could only be called twice, and we'd already optimize away the second call. I'd defintively like to track down where the performance changes happend, at least to a major version but even better to a -rc or git commit. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 6:32 ` Christoph Hellwig 0 siblings, 0 replies; 108+ messages in thread From: Christoph Hellwig @ 2012-07-02 6:32 UTC (permalink / raw) To: Dave Chinner; +Cc: Mel Gorman, linux-mm, linux-kernel, linux-fsdevel, xfs On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote: > That will be caused by the fact we changed all the metadata updates > to be logged, which means a transaction every time .dirty_inode is > called. > > This should mostly go away when XFS is converted to use .update_time > rather than .dirty_inode to only issue transactions when the VFS > updates the atime rather than every .dirty_inode call... I think the patch to do that conversion still needs review.. > It increases the CPU overhead (dirty_inode can be called up to 4 > times per write(2) call, IIRC), so with limited numbers of > threads/limited CPU power it will result in lower performance. Where > you have lots of CPU power, there will be little difference in > performance... When I checked it it could only be called twice, and we'd already optimize away the second call. I'd defintively like to track down where the performance changes happend, at least to a major version but even better to a -rc or git commit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-02 6:32 ` Christoph Hellwig (?) @ 2012-07-02 14:32 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 14:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > It increases the CPU overhead (dirty_inode can be called up to 4 > > times per write(2) call, IIRC), so with limited numbers of > > threads/limited CPU power it will result in lower performance. Where > > you have lots of CPU power, there will be little difference in > > performance... > > When I checked it it could only be called twice, and we'd already > optimize away the second call. I'd defintively like to track down where > the performance changes happend, at least to a major version but even > better to a -rc or git commit. > By all means feel free to run the test yourself and run the bisection :) It's rare but on this occasion the test machine is idle so I started an automated git bisection. As you know the milage with an automated bisect varies so it may or may not find the right commit. Test machine is sandy so http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html is the report of interest. The script is doing a full search between v3.3 and v3.4 for a point where average files/sec for fsmark-single drops below 25000. I did not limit the search to fs/xfs on the off-chance that it is an apparently unrelated patch that caused the problem. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 14:32 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 14:32 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel, xfs On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > It increases the CPU overhead (dirty_inode can be called up to 4 > > times per write(2) call, IIRC), so with limited numbers of > > threads/limited CPU power it will result in lower performance. Where > > you have lots of CPU power, there will be little difference in > > performance... > > When I checked it it could only be called twice, and we'd already > optimize away the second call. I'd defintively like to track down where > the performance changes happend, at least to a major version but even > better to a -rc or git commit. > By all means feel free to run the test yourself and run the bisection :) It's rare but on this occasion the test machine is idle so I started an automated git bisection. As you know the milage with an automated bisect varies so it may or may not find the right commit. Test machine is sandy so http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html is the report of interest. The script is doing a full search between v3.3 and v3.4 for a point where average files/sec for fsmark-single drops below 25000. I did not limit the search to fs/xfs on the off-chance that it is an apparently unrelated patch that caused the problem. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 14:32 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 14:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > It increases the CPU overhead (dirty_inode can be called up to 4 > > times per write(2) call, IIRC), so with limited numbers of > > threads/limited CPU power it will result in lower performance. Where > > you have lots of CPU power, there will be little difference in > > performance... > > When I checked it it could only be called twice, and we'd already > optimize away the second call. I'd defintively like to track down where > the performance changes happend, at least to a major version but even > better to a -rc or git commit. > By all means feel free to run the test yourself and run the bisection :) It's rare but on this occasion the test machine is idle so I started an automated git bisection. As you know the milage with an automated bisect varies so it may or may not find the right commit. Test machine is sandy so http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html is the report of interest. The script is doing a full search between v3.3 and v3.4 for a point where average files/sec for fsmark-single drops below 25000. I did not limit the search to fs/xfs on the off-chance that it is an apparently unrelated patch that caused the problem. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-02 14:32 ` Mel Gorman (?) @ 2012-07-02 19:35 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 19:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson Adding dri-devel and a few others because an i915 patch contributed to the regression. On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > times per write(2) call, IIRC), so with limited numbers of > > > threads/limited CPU power it will result in lower performance. Where > > > you have lots of CPU power, there will be little difference in > > > performance... > > > > When I checked it it could only be called twice, and we'd already > > optimize away the second call. I'd defintively like to track down where > > the performance changes happend, at least to a major version but even > > better to a -rc or git commit. > > > > By all means feel free to run the test yourself and run the bisection :) > > It's rare but on this occasion the test machine is idle so I started an > automated git bisection. As you know the milage with an automated bisect > varies so it may or may not find the right commit. Test machine is sandy so > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > is the report of interest. The script is doing a full search between v3.3 and > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > I did not limit the search to fs/xfs on the off-chance that it is an > apparently unrelated patch that caused the problem. > It was obvious very quickly that there were two distinct regression so I ran two bisections. One led to a XFS and the other led to an i915 patch that enables RC6 to reduce power usage. [c999a223: xfs: introduce an allocation workqueue] [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] gdm was running on the machine so i915 would have been in use. In case it is of interest this is the log of the bisection. Lines beginning with # are notes I made and all other lines are from the bisection script. The second-last column is the files/sec recorded by fsmark. # MARK v3.3..v3.4 Search for BAD files/sec -lt 28000 # BAD 16536 # GOOD 34757 Mon Jul 2 15:46:13 IST 2012 sandy xfsbisect 141124c02059eee9dbc5c86ea797b1ca888e77f7 37454 good Mon Jul 2 15:56:06 IST 2012 sandy xfsbisect 55a320308902f7a0746569ee57eeb3f254e6ed16 25192 bad Mon Jul 2 16:08:34 IST 2012 sandy xfsbisect 281b05392fc2cb26209b4d85abaf4889ab1991f3 38807 good Mon Jul 2 16:18:02 IST 2012 sandy xfsbisect a8364d5555b2030d093cde0f07951628e55454e1 37553 good Mon Jul 2 16:27:22 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 36676 good Mon Jul 2 16:36:48 IST 2012 sandy xfsbisect 2e7580b0e75d771d93e24e681031a165b1d31071 37756 good Mon Jul 2 16:46:36 IST 2012 sandy xfsbisect 532bfc851a7475fb6a36c1e953aa395798a7cca7 25416 bad Mon Jul 2 16:56:10 IST 2012 sandy xfsbisect 0c9aac08261512d70d7d4817bd222abca8b6bdd6 38486 good Mon Jul 2 17:05:40 IST 2012 sandy xfsbisect 0fc9d1040313047edf6a39fd4d7c7defdca97c62 37970 good Mon Jul 2 17:16:01 IST 2012 sandy xfsbisect 5a5881cdeec2c019b5c9a307800218ee029f7f61 24493 bad Mon Jul 2 17:21:15 IST 2012 sandy xfsbisect f616137519feb17b849894fcbe634a021d3fa7db 24405 bad Mon Jul 2 17:26:16 IST 2012 sandy xfsbisect 5575acc7807595687288b3bbac15103f2a5462e1 37336 good Mon Jul 2 17:31:25 IST 2012 sandy xfsbisect c999a223c2f0d31c64ef7379814cea1378b2b800 24552 bad Mon Jul 2 17:36:34 IST 2012 sandy xfsbisect 1a1d772433d42aaff7315b3468fef5951604f5c6 36872 good # c999a223c2f0d31c64ef7379814cea1378b2b800 is the first bad commit # [c999a223: xfs: introduce an allocation workqueue] # # MARK c999a223c2f0d31c64ef7379814cea1378b2b800..v3.4 Search for BAD files/sec -lt 20000 # BAD 16536 # GOOD 24552 Mon Jul 2 17:48:39 IST 2012 sandy xfsbisect b2094ef840697bc8ca5d17a83b7e30fad5f1e9fa 37435 good Mon Jul 2 17:58:12 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 38303 good Mon Jul 2 18:08:18 IST 2012 sandy xfsbisect 5d32c88f0b94061b3af2e3ade92422407282eb12 16718 bad Mon Jul 2 18:18:02 IST 2012 sandy xfsbisect 2f7fa1be66dce77608330c5eb918d6360b5525f2 24964 good Mon Jul 2 18:24:14 IST 2012 sandy xfsbisect 923f79743c76583ed4684e2c80c8da51a7268af3 24963 good Mon Jul 2 18:33:49 IST 2012 sandy xfsbisect b61c37f57988567c84359645f8202a7c84bc798a 24824 good Mon Jul 2 18:40:20 IST 2012 sandy xfsbisect 20a2a811602b16c42ce88bada3d52712cdfb988b 17155 bad Mon Jul 2 18:50:12 IST 2012 sandy xfsbisect 78fb72f7936c01d5b426c03a691eca082b03f2b9 38494 good Mon Jul 2 19:00:24 IST 2012 sandy xfsbisect e1a7eb08ee097e97e928062a242b0de5b2599a11 25033 good Mon Jul 2 19:10:24 IST 2012 sandy xfsbisect 97effadb65ed08809e1720c8d3ee80b73a93665c 16520 bad Mon Jul 2 19:16:16 IST 2012 sandy xfsbisect 25e341cfc33d94435472983825163e97fe370a6c 16748 bad Mon Jul 2 19:21:52 IST 2012 sandy xfsbisect 7dd4906586274f3945f2aeaaa5a33b451c3b4bba 24957 good Mon Jul 2 19:27:35 IST 2012 sandy xfsbisect aa46419186992e6b8b8010319f0ca7f40a0d13f5 17088 bad Mon Jul 2 19:32:54 IST 2012 sandy xfsbisect 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d 25667 good # aa46419186992e6b8b8010319f0ca7f40a0d13f5 is the first bad commit # [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] I tested plain reverts of the patches individually and together and got the following results FS-Mark Single Threaded 3.4.0 3.4.0 3.4.0 3.4.0-vanilla revert-aa464191 revert-c999a223 revert-both Files/s min 14176.40 ( 0.00%) 17830.60 (25.78%) 24186.70 (70.61%) 25108.00 (77.11%) Files/s mean 16783.35 ( 0.00%) 25029.69 (49.13%) 37513.72 (123.52%) 38169.97 (127.43%) Files/s stddev 1007.26 ( 0.00%) 2644.87 (162.58%) 5344.99 (430.65%) 5599.65 (455.93%) Files/s max 18475.40 ( 0.00%) 27966.10 (51.37%) 45564.60 (146.62%) 47918.10 (159.36%) Overhead min 593978.00 ( 0.00%) 386173.00 (34.99%) 253812.00 (57.27%) 247396.00 (58.35%) Overhead mean 637782.80 ( 0.00%) 429229.33 (32.70%) 322868.20 (49.38%) 287141.73 (54.98%) Overhead stddev 72440.72 ( 0.00%) 100056.96 (-38.12%) 175001.08 (-141.58%) 102018.14 (-40.83%) Overhead max 855637.00 ( 0.00%) 753541.00 (11.93%) 880531.00 (-2.91%) 637932.00 (25.44%) MMTests Statistics: duration Sys Time Running Test (seconds) 44.06 32.25 24.19 23.99 User+Sys Time Running Test (seconds) 50.19 36.35 27.24 26.7 Total Elapsed Time (seconds) 59.21 44.76 34.95 34.14 Individually reverting either patch makes a difference to both files/sec and overhead. Reverting both is not as dramatic as reverting each individual patch would indicate but it's still a major improvement. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 19:35 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 19:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, linux-mm, linux-fsdevel, Eugeni Dodonov Adding dri-devel and a few others because an i915 patch contributed to the regression. On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > times per write(2) call, IIRC), so with limited numbers of > > > threads/limited CPU power it will result in lower performance. Where > > > you have lots of CPU power, there will be little difference in > > > performance... > > > > When I checked it it could only be called twice, and we'd already > > optimize away the second call. I'd defintively like to track down where > > the performance changes happend, at least to a major version but even > > better to a -rc or git commit. > > > > By all means feel free to run the test yourself and run the bisection :) > > It's rare but on this occasion the test machine is idle so I started an > automated git bisection. As you know the milage with an automated bisect > varies so it may or may not find the right commit. Test machine is sandy so > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > is the report of interest. The script is doing a full search between v3.3 and > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > I did not limit the search to fs/xfs on the off-chance that it is an > apparently unrelated patch that caused the problem. > It was obvious very quickly that there were two distinct regression so I ran two bisections. One led to a XFS and the other led to an i915 patch that enables RC6 to reduce power usage. [c999a223: xfs: introduce an allocation workqueue] [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] gdm was running on the machine so i915 would have been in use. In case it is of interest this is the log of the bisection. Lines beginning with # are notes I made and all other lines are from the bisection script. The second-last column is the files/sec recorded by fsmark. # MARK v3.3..v3.4 Search for BAD files/sec -lt 28000 # BAD 16536 # GOOD 34757 Mon Jul 2 15:46:13 IST 2012 sandy xfsbisect 141124c02059eee9dbc5c86ea797b1ca888e77f7 37454 good Mon Jul 2 15:56:06 IST 2012 sandy xfsbisect 55a320308902f7a0746569ee57eeb3f254e6ed16 25192 bad Mon Jul 2 16:08:34 IST 2012 sandy xfsbisect 281b05392fc2cb26209b4d85abaf4889ab1991f3 38807 good Mon Jul 2 16:18:02 IST 2012 sandy xfsbisect a8364d5555b2030d093cde0f07951628e55454e1 37553 good Mon Jul 2 16:27:22 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 36676 good Mon Jul 2 16:36:48 IST 2012 sandy xfsbisect 2e7580b0e75d771d93e24e681031a165b1d31071 37756 good Mon Jul 2 16:46:36 IST 2012 sandy xfsbisect 532bfc851a7475fb6a36c1e953aa395798a7cca7 25416 bad Mon Jul 2 16:56:10 IST 2012 sandy xfsbisect 0c9aac08261512d70d7d4817bd222abca8b6bdd6 38486 good Mon Jul 2 17:05:40 IST 2012 sandy xfsbisect 0fc9d1040313047edf6a39fd4d7c7defdca97c62 37970 good Mon Jul 2 17:16:01 IST 2012 sandy xfsbisect 5a5881cdeec2c019b5c9a307800218ee029f7f61 24493 bad Mon Jul 2 17:21:15 IST 2012 sandy xfsbisect f616137519feb17b849894fcbe634a021d3fa7db 24405 bad Mon Jul 2 17:26:16 IST 2012 sandy xfsbisect 5575acc7807595687288b3bbac15103f2a5462e1 37336 good Mon Jul 2 17:31:25 IST 2012 sandy xfsbisect c999a223c2f0d31c64ef7379814cea1378b2b800 24552 bad Mon Jul 2 17:36:34 IST 2012 sandy xfsbisect 1a1d772433d42aaff7315b3468fef5951604f5c6 36872 good # c999a223c2f0d31c64ef7379814cea1378b2b800 is the first bad commit # [c999a223: xfs: introduce an allocation workqueue] # # MARK c999a223c2f0d31c64ef7379814cea1378b2b800..v3.4 Search for BAD files/sec -lt 20000 # BAD 16536 # GOOD 24552 Mon Jul 2 17:48:39 IST 2012 sandy xfsbisect b2094ef840697bc8ca5d17a83b7e30fad5f1e9fa 37435 good Mon Jul 2 17:58:12 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 38303 good Mon Jul 2 18:08:18 IST 2012 sandy xfsbisect 5d32c88f0b94061b3af2e3ade92422407282eb12 16718 bad Mon Jul 2 18:18:02 IST 2012 sandy xfsbisect 2f7fa1be66dce77608330c5eb918d6360b5525f2 24964 good Mon Jul 2 18:24:14 IST 2012 sandy xfsbisect 923f79743c76583ed4684e2c80c8da51a7268af3 24963 good Mon Jul 2 18:33:49 IST 2012 sandy xfsbisect b61c37f57988567c84359645f8202a7c84bc798a 24824 good Mon Jul 2 18:40:20 IST 2012 sandy xfsbisect 20a2a811602b16c42ce88bada3d52712cdfb988b 17155 bad Mon Jul 2 18:50:12 IST 2012 sandy xfsbisect 78fb72f7936c01d5b426c03a691eca082b03f2b9 38494 good Mon Jul 2 19:00:24 IST 2012 sandy xfsbisect e1a7eb08ee097e97e928062a242b0de5b2599a11 25033 good Mon Jul 2 19:10:24 IST 2012 sandy xfsbisect 97effadb65ed08809e1720c8d3ee80b73a93665c 16520 bad Mon Jul 2 19:16:16 IST 2012 sandy xfsbisect 25e341cfc33d94435472983825163e97fe370a6c 16748 bad Mon Jul 2 19:21:52 IST 2012 sandy xfsbisect 7dd4906586274f3945f2aeaaa5a33b451c3b4bba 24957 good Mon Jul 2 19:27:35 IST 2012 sandy xfsbisect aa46419186992e6b8b8010319f0ca7f40a0d13f5 17088 bad Mon Jul 2 19:32:54 IST 2012 sandy xfsbisect 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d 25667 good # aa46419186992e6b8b8010319f0ca7f40a0d13f5 is the first bad commit # [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] I tested plain reverts of the patches individually and together and got the following results FS-Mark Single Threaded 3.4.0 3.4.0 3.4.0 3.4.0-vanilla revert-aa464191 revert-c999a223 revert-both Files/s min 14176.40 ( 0.00%) 17830.60 (25.78%) 24186.70 (70.61%) 25108.00 (77.11%) Files/s mean 16783.35 ( 0.00%) 25029.69 (49.13%) 37513.72 (123.52%) 38169.97 (127.43%) Files/s stddev 1007.26 ( 0.00%) 2644.87 (162.58%) 5344.99 (430.65%) 5599.65 (455.93%) Files/s max 18475.40 ( 0.00%) 27966.10 (51.37%) 45564.60 (146.62%) 47918.10 (159.36%) Overhead min 593978.00 ( 0.00%) 386173.00 (34.99%) 253812.00 (57.27%) 247396.00 (58.35%) Overhead mean 637782.80 ( 0.00%) 429229.33 (32.70%) 322868.20 (49.38%) 287141.73 (54.98%) Overhead stddev 72440.72 ( 0.00%) 100056.96 (-38.12%) 175001.08 (-141.58%) 102018.14 (-40.83%) Overhead max 855637.00 ( 0.00%) 753541.00 (11.93%) 880531.00 (-2.91%) 637932.00 (25.44%) MMTests Statistics: duration Sys Time Running Test (seconds) 44.06 32.25 24.19 23.99 User+Sys Time Running Test (seconds) 50.19 36.35 27.24 26.7 Total Elapsed Time (seconds) 59.21 44.76 34.95 34.14 Individually reverting either patch makes a difference to both files/sec and overhead. Reverting both is not as dramatic as reverting each individual patch would indicate but it's still a major improvement. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 19:35 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 19:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson Adding dri-devel and a few others because an i915 patch contributed to the regression. On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > times per write(2) call, IIRC), so with limited numbers of > > > threads/limited CPU power it will result in lower performance. Where > > > you have lots of CPU power, there will be little difference in > > > performance... > > > > When I checked it it could only be called twice, and we'd already > > optimize away the second call. I'd defintively like to track down where > > the performance changes happend, at least to a major version but even > > better to a -rc or git commit. > > > > By all means feel free to run the test yourself and run the bisection :) > > It's rare but on this occasion the test machine is idle so I started an > automated git bisection. As you know the milage with an automated bisect > varies so it may or may not find the right commit. Test machine is sandy so > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > is the report of interest. The script is doing a full search between v3.3 and > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > I did not limit the search to fs/xfs on the off-chance that it is an > apparently unrelated patch that caused the problem. > It was obvious very quickly that there were two distinct regression so I ran two bisections. One led to a XFS and the other led to an i915 patch that enables RC6 to reduce power usage. [c999a223: xfs: introduce an allocation workqueue] [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] gdm was running on the machine so i915 would have been in use. In case it is of interest this is the log of the bisection. Lines beginning with # are notes I made and all other lines are from the bisection script. The second-last column is the files/sec recorded by fsmark. # MARK v3.3..v3.4 Search for BAD files/sec -lt 28000 # BAD 16536 # GOOD 34757 Mon Jul 2 15:46:13 IST 2012 sandy xfsbisect 141124c02059eee9dbc5c86ea797b1ca888e77f7 37454 good Mon Jul 2 15:56:06 IST 2012 sandy xfsbisect 55a320308902f7a0746569ee57eeb3f254e6ed16 25192 bad Mon Jul 2 16:08:34 IST 2012 sandy xfsbisect 281b05392fc2cb26209b4d85abaf4889ab1991f3 38807 good Mon Jul 2 16:18:02 IST 2012 sandy xfsbisect a8364d5555b2030d093cde0f07951628e55454e1 37553 good Mon Jul 2 16:27:22 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 36676 good Mon Jul 2 16:36:48 IST 2012 sandy xfsbisect 2e7580b0e75d771d93e24e681031a165b1d31071 37756 good Mon Jul 2 16:46:36 IST 2012 sandy xfsbisect 532bfc851a7475fb6a36c1e953aa395798a7cca7 25416 bad Mon Jul 2 16:56:10 IST 2012 sandy xfsbisect 0c9aac08261512d70d7d4817bd222abca8b6bdd6 38486 good Mon Jul 2 17:05:40 IST 2012 sandy xfsbisect 0fc9d1040313047edf6a39fd4d7c7defdca97c62 37970 good Mon Jul 2 17:16:01 IST 2012 sandy xfsbisect 5a5881cdeec2c019b5c9a307800218ee029f7f61 24493 bad Mon Jul 2 17:21:15 IST 2012 sandy xfsbisect f616137519feb17b849894fcbe634a021d3fa7db 24405 bad Mon Jul 2 17:26:16 IST 2012 sandy xfsbisect 5575acc7807595687288b3bbac15103f2a5462e1 37336 good Mon Jul 2 17:31:25 IST 2012 sandy xfsbisect c999a223c2f0d31c64ef7379814cea1378b2b800 24552 bad Mon Jul 2 17:36:34 IST 2012 sandy xfsbisect 1a1d772433d42aaff7315b3468fef5951604f5c6 36872 good # c999a223c2f0d31c64ef7379814cea1378b2b800 is the first bad commit # [c999a223: xfs: introduce an allocation workqueue] # # MARK c999a223c2f0d31c64ef7379814cea1378b2b800..v3.4 Search for BAD files/sec -lt 20000 # BAD 16536 # GOOD 24552 Mon Jul 2 17:48:39 IST 2012 sandy xfsbisect b2094ef840697bc8ca5d17a83b7e30fad5f1e9fa 37435 good Mon Jul 2 17:58:12 IST 2012 sandy xfsbisect d2a2fc18d98d8ee2dec1542efc7f47beec256144 38303 good Mon Jul 2 18:08:18 IST 2012 sandy xfsbisect 5d32c88f0b94061b3af2e3ade92422407282eb12 16718 bad Mon Jul 2 18:18:02 IST 2012 sandy xfsbisect 2f7fa1be66dce77608330c5eb918d6360b5525f2 24964 good Mon Jul 2 18:24:14 IST 2012 sandy xfsbisect 923f79743c76583ed4684e2c80c8da51a7268af3 24963 good Mon Jul 2 18:33:49 IST 2012 sandy xfsbisect b61c37f57988567c84359645f8202a7c84bc798a 24824 good Mon Jul 2 18:40:20 IST 2012 sandy xfsbisect 20a2a811602b16c42ce88bada3d52712cdfb988b 17155 bad Mon Jul 2 18:50:12 IST 2012 sandy xfsbisect 78fb72f7936c01d5b426c03a691eca082b03f2b9 38494 good Mon Jul 2 19:00:24 IST 2012 sandy xfsbisect e1a7eb08ee097e97e928062a242b0de5b2599a11 25033 good Mon Jul 2 19:10:24 IST 2012 sandy xfsbisect 97effadb65ed08809e1720c8d3ee80b73a93665c 16520 bad Mon Jul 2 19:16:16 IST 2012 sandy xfsbisect 25e341cfc33d94435472983825163e97fe370a6c 16748 bad Mon Jul 2 19:21:52 IST 2012 sandy xfsbisect 7dd4906586274f3945f2aeaaa5a33b451c3b4bba 24957 good Mon Jul 2 19:27:35 IST 2012 sandy xfsbisect aa46419186992e6b8b8010319f0ca7f40a0d13f5 17088 bad Mon Jul 2 19:32:54 IST 2012 sandy xfsbisect 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d 25667 good # aa46419186992e6b8b8010319f0ca7f40a0d13f5 is the first bad commit # [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] I tested plain reverts of the patches individually and together and got the following results FS-Mark Single Threaded 3.4.0 3.4.0 3.4.0 3.4.0-vanilla revert-aa464191 revert-c999a223 revert-both Files/s min 14176.40 ( 0.00%) 17830.60 (25.78%) 24186.70 (70.61%) 25108.00 (77.11%) Files/s mean 16783.35 ( 0.00%) 25029.69 (49.13%) 37513.72 (123.52%) 38169.97 (127.43%) Files/s stddev 1007.26 ( 0.00%) 2644.87 (162.58%) 5344.99 (430.65%) 5599.65 (455.93%) Files/s max 18475.40 ( 0.00%) 27966.10 (51.37%) 45564.60 (146.62%) 47918.10 (159.36%) Overhead min 593978.00 ( 0.00%) 386173.00 (34.99%) 253812.00 (57.27%) 247396.00 (58.35%) Overhead mean 637782.80 ( 0.00%) 429229.33 (32.70%) 322868.20 (49.38%) 287141.73 (54.98%) Overhead stddev 72440.72 ( 0.00%) 100056.96 (-38.12%) 175001.08 (-141.58%) 102018.14 (-40.83%) Overhead max 855637.00 ( 0.00%) 753541.00 (11.93%) 880531.00 (-2.91%) 637932.00 (25.44%) MMTests Statistics: duration Sys Time Running Test (seconds) 44.06 32.25 24.19 23.99 User+Sys Time Running Test (seconds) 50.19 36.35 27.24 26.7 Total Elapsed Time (seconds) 59.21 44.76 34.95 34.14 Individually reverting either patch makes a difference to both files/sec and overhead. Reverting both is not as dramatic as reverting each individual patch would indicate but it's still a major improvement. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-02 19:35 ` Mel Gorman (?) @ 2012-07-03 0:19 ` Dave Chinner -1 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-03 0:19 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > Adding dri-devel and a few others because an i915 patch contributed to > the regression. > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > times per write(2) call, IIRC), so with limited numbers of > > > > threads/limited CPU power it will result in lower performance. Where > > > > you have lots of CPU power, there will be little difference in > > > > performance... > > > > > > When I checked it it could only be called twice, and we'd already > > > optimize away the second call. I'd defintively like to track down where > > > the performance changes happend, at least to a major version but even > > > better to a -rc or git commit. > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > It's rare but on this occasion the test machine is idle so I started an > > automated git bisection. As you know the milage with an automated bisect > > varies so it may or may not find the right commit. Test machine is sandy so > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > is the report of interest. The script is doing a full search between v3.3 and > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > I did not limit the search to fs/xfs on the off-chance that it is an > > apparently unrelated patch that caused the problem. > > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] Doesn't seem to be the major cause of the regression. By itself, it has impact, but the majority comes from the XFS change... > [c999a223: xfs: introduce an allocation workqueue] Which indicates that there is workqueue scheduling issues, I think. The same amount of work is being done, but half of it is being pushed off into a workqueue to avoid stack overflow issues (*). I tested the above patch in anger on an 8p machine, similar to the machine you saw no regressions on, but the workload didn't drive it to being completely CPU bound (only about 90%) so the allocation work was probably always scheduled quickly. How many worker threads have been spawned on these machines that are showing the regression? What is the context switch rate on the machines whenteh test is running? Can you run latencytop to see if there is excessive starvation/wait times for allocation completion? A pert top profile comparison might be informative, too... (*) The stack usage below submit_bio() can be more than 5k (DM, MD, SCSI, driver, memory allocation), so it's really not safe to do allocation anywhere below about 3k of kernel stack being used. e.g. on a relatively trivial storage setup without the above commit: [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left Fundamentally, 8k stacks on x86-64 are too small for our increasingly complex storage layers and the 100+ function deep call chains that occur. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 0:19 ` Dave Chinner 0 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-03 0:19 UTC (permalink / raw) To: Mel Gorman Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > Adding dri-devel and a few others because an i915 patch contributed to > the regression. > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > times per write(2) call, IIRC), so with limited numbers of > > > > threads/limited CPU power it will result in lower performance. Where > > > > you have lots of CPU power, there will be little difference in > > > > performance... > > > > > > When I checked it it could only be called twice, and we'd already > > > optimize away the second call. I'd defintively like to track down where > > > the performance changes happend, at least to a major version but even > > > better to a -rc or git commit. > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > It's rare but on this occasion the test machine is idle so I started an > > automated git bisection. As you know the milage with an automated bisect > > varies so it may or may not find the right commit. Test machine is sandy so > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > is the report of interest. The script is doing a full search between v3.3 and > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > I did not limit the search to fs/xfs on the off-chance that it is an > > apparently unrelated patch that caused the problem. > > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] Doesn't seem to be the major cause of the regression. By itself, it has impact, but the majority comes from the XFS change... > [c999a223: xfs: introduce an allocation workqueue] Which indicates that there is workqueue scheduling issues, I think. The same amount of work is being done, but half of it is being pushed off into a workqueue to avoid stack overflow issues (*). I tested the above patch in anger on an 8p machine, similar to the machine you saw no regressions on, but the workload didn't drive it to being completely CPU bound (only about 90%) so the allocation work was probably always scheduled quickly. How many worker threads have been spawned on these machines that are showing the regression? What is the context switch rate on the machines whenteh test is running? Can you run latencytop to see if there is excessive starvation/wait times for allocation completion? A pert top profile comparison might be informative, too... (*) The stack usage below submit_bio() can be more than 5k (DM, MD, SCSI, driver, memory allocation), so it's really not safe to do allocation anywhere below about 3k of kernel stack being used. e.g. on a relatively trivial storage setup without the above commit: [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left Fundamentally, 8k stacks on x86-64 are too small for our increasingly complex storage layers and the 100+ function deep call chains that occur. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 0:19 ` Dave Chinner 0 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-03 0:19 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > Adding dri-devel and a few others because an i915 patch contributed to > the regression. > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > times per write(2) call, IIRC), so with limited numbers of > > > > threads/limited CPU power it will result in lower performance. Where > > > > you have lots of CPU power, there will be little difference in > > > > performance... > > > > > > When I checked it it could only be called twice, and we'd already > > > optimize away the second call. I'd defintively like to track down where > > > the performance changes happend, at least to a major version but even > > > better to a -rc or git commit. > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > It's rare but on this occasion the test machine is idle so I started an > > automated git bisection. As you know the milage with an automated bisect > > varies so it may or may not find the right commit. Test machine is sandy so > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > is the report of interest. The script is doing a full search between v3.3 and > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > I did not limit the search to fs/xfs on the off-chance that it is an > > apparently unrelated patch that caused the problem. > > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] Doesn't seem to be the major cause of the regression. By itself, it has impact, but the majority comes from the XFS change... > [c999a223: xfs: introduce an allocation workqueue] Which indicates that there is workqueue scheduling issues, I think. The same amount of work is being done, but half of it is being pushed off into a workqueue to avoid stack overflow issues (*). I tested the above patch in anger on an 8p machine, similar to the machine you saw no regressions on, but the workload didn't drive it to being completely CPU bound (only about 90%) so the allocation work was probably always scheduled quickly. How many worker threads have been spawned on these machines that are showing the regression? What is the context switch rate on the machines whenteh test is running? Can you run latencytop to see if there is excessive starvation/wait times for allocation completion? A pert top profile comparison might be informative, too... (*) The stack usage below submit_bio() can be more than 5k (DM, MD, SCSI, driver, memory allocation), so it's really not safe to do allocation anywhere below about 3k of kernel stack being used. e.g. on a relatively trivial storage setup without the above commit: [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left Fundamentally, 8k stacks on x86-64 are too small for our increasingly complex storage layers and the 100+ function deep call chains that occur. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 0:19 ` Dave Chinner (?) @ 2012-07-03 10:59 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 10:59 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > Adding dri-devel and a few others because an i915 patch contributed to > > the regression. > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > you have lots of CPU power, there will be little difference in > > > > > performance... > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > optimize away the second call. I'd defintively like to track down where > > > > the performance changes happend, at least to a major version but even > > > > better to a -rc or git commit. > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > automated git bisection. As you know the milage with an automated bisect > > > varies so it may or may not find the right commit. Test machine is sandy so > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > is the report of interest. The script is doing a full search between v3.3 and > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > apparently unrelated patch that caused the problem. > > > > > > > It was obvious very quickly that there were two distinct regression so I > > ran two bisections. One led to a XFS and the other led to an i915 patch > > that enables RC6 to reduce power usage. > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > Doesn't seem to be the major cause of the regression. By itself, it > has impact, but the majority comes from the XFS change... > The fact it has an impact at all is weird but lets see what the DRI folks think about it. > > [c999a223: xfs: introduce an allocation workqueue] > > Which indicates that there is workqueue scheduling issues, I think. > The same amount of work is being done, but half of it is being > pushed off into a workqueue to avoid stack overflow issues (*). I > tested the above patch in anger on an 8p machine, similar to the > machine you saw no regressions on, but the workload didn't drive it > to being completely CPU bound (only about 90%) so the allocation > work was probably always scheduled quickly. > What test were you using? > How many worker threads have been spawned on these machines > that are showing the regression? 20 or 21 generally. An example list as spotted by top looks like kworker/0:0 kworker/0:1 kworker/0:2 kworker/1:0 kworker/1:1 kworker/1:2 kworker/2:0 kworker/2:1 kworker/2:2 kworker/3:0 kworker/3:1 kworker/3:2 kworker/4:0 kworker/4:1 kworker/5:0 kworker/5:1 kworker/6:0 kworker/6:1 kworker/6:2 kworker/7:0 kworker/7:1 There were 8 unbound workers. > What is the context switch rate on the machines whenteh test is running? This is vmstat from a vanilla kernel. The actual vmstat is after the --. The information before that is recorded by mmtests to try and detect if there was jitter in the vmstat output. It's showing that there is little or no jitter in this test. VANILLA 1341306582.6713 1.8109 1.8109 -- 0 0 0 16050784 11448 104056 0 0 376 0 209 526 0 0 99 1 0 1341306584.6715 3.8112 2.0003 -- 1 0 0 16050628 11448 104064 0 0 0 0 121 608 0 0 100 0 0 1341306586.6718 5.8114 2.0003 -- 0 0 0 16047432 11460 104288 0 0 102 45 227 999 0 0 99 1 0 1341306588.6721 7.8117 2.0003 -- 1 0 0 16046944 11460 104292 0 0 0 0 120 663 0 0 100 0 0 1341306590.6723 9.8119 2.0002 -- 0 2 0 16045788 11476 104296 0 0 12 40 190 754 0 0 99 0 0 1341306592.6725 11.8121 2.0002 -- 0 1 0 15990236 12600 141724 0 0 19054 30 1400 2937 2 1 88 9 0 1341306594.6727 13.8124 2.0002 -- 1 0 0 15907628 12600 186360 0 0 1653 0 3117 6406 2 9 88 1 0 1341306596.6730 15.8127 2.0003 -- 0 0 0 15825964 12608 226636 0 0 15 11024 3073 6350 2 9 89 0 0 1341306598.6733 17.8130 2.0003 -- 1 0 0 15730420 12608 271632 0 0 0 3072 3461 7179 2 10 88 0 0 1341306600.6736 19.8132 2.0003 -- 1 0 0 15686200 12608 310816 0 0 0 12416 3093 6198 2 9 89 0 0 1341306602.6738 21.8135 2.0003 -- 2 0 0 15593588 12616 354928 0 0 0 32 3482 7146 2 11 87 0 0 1341306604.6741 23.8138 2.0003 -- 2 0 0 15562032 12616 393772 0 0 0 12288 3129 6330 2 10 89 0 0 1341306606.6744 25.8140 2.0002 -- 1 0 0 15458316 12624 438004 0 0 0 26 3471 7107 2 11 87 0 0 1341306608.6746 27.8142 2.0002 -- 1 0 0 15432024 12624 474244 0 0 0 12416 3011 6017 1 10 89 0 0 1341306610.6749 29.8145 2.0003 -- 2 0 0 15343280 12624 517696 0 0 0 24 3393 6826 2 11 87 0 0 1341306612.6751 31.8148 2.0002 -- 1 0 0 15311136 12632 551816 0 0 0 16502 2818 5653 2 9 88 1 0 1341306614.6754 33.8151 2.0003 -- 1 0 0 15220648 12632 594936 0 0 0 3584 3451 6779 2 11 87 0 0 1341306616.6755 35.8152 2.0001 -- 4 0 0 15221252 12632 649296 0 0 0 38559 4846 8709 2 15 78 6 0 1341306618.6758 37.8155 2.0003 -- 1 0 0 15177724 12640 668476 0 0 20 40679 2204 4067 1 5 89 5 0 1341306620.6761 39.8158 2.0003 -- 1 0 0 15090204 12640 711752 0 0 0 0 3316 6788 2 11 88 0 0 1341306622.6764 41.8160 2.0003 -- 1 0 0 15005356 12640 748532 0 0 0 12288 3073 6132 2 10 89 0 0 1341306624.6766 43.8163 2.0002 -- 2 0 0 14913088 12648 791952 0 0 0 28 3408 6806 2 11 87 0 0 1341306626.6769 45.8166 2.0003 -- 1 0 0 14891512 12648 826328 0 0 0 12420 2906 5710 1 9 90 0 0 1341306628.6772 47.8168 2.0003 -- 1 0 0 14794316 12656 868936 0 0 0 26 3367 6798 2 11 87 0 0 1341306630.6774 49.8171 2.0003 -- 1 0 0 14769188 12656 905016 0 0 30 12324 3029 5876 2 10 89 0 0 1341306632.6777 51.8173 2.0002 -- 1 0 0 14679544 12656 947712 0 0 0 0 3399 6868 2 11 87 0 0 1341306634.6780 53.8176 2.0003 -- 1 0 0 14646156 12664 982032 0 0 0 14658 2987 5761 1 10 89 0 0 1341306636.6782 55.8179 2.0003 -- 1 0 0 14560504 12664 1023816 0 0 0 4404 3454 6876 2 11 87 0 0 1341306638.6783 57.8180 2.0001 -- 2 0 0 14533384 12664 1056812 0 0 0 15810 3002 5581 1 10 89 0 0 1341306640.6785 59.8182 2.0002 -- 1 0 0 14593332 12672 1027392 0 0 0 31790 3504 1811 1 13 78 8 0 1341306642.6787 61.8183 2.0001 -- 1 0 0 14686968 12672 1007604 0 0 0 14621 2434 1248 1 10 89 0 0 1341306644.6789 63.8185 2.0002 -- 1 1 0 15042476 12680 788104 0 0 0 36564 2809 1484 1 12 86 1 0 1341306646.6790 65.8187 2.0002 -- 1 0 0 15128292 12680 757948 0 0 0 26395 3050 1313 1 13 86 1 0 1341306648.6792 67.8189 2.0002 -- 1 0 0 15160036 12680 727964 0 0 0 5463 2752 910 1 12 87 0 0 1341306650.6795 69.8192 2.0003 -- 0 0 0 15633256 12688 332572 0 0 1156 12308 2117 2346 1 7 91 1 0 1341306652.6797 71.8194 2.0002 -- 0 0 0 15633892 12688 332652 0 0 0 0 224 758 0 0 100 0 0 1341306654.6800 73.8197 2.0003 -- 0 0 0 15633900 12688 332524 0 0 0 0 231 1009 0 0 100 0 0 1341306656.6803 75.8199 2.0003 -- 0 0 0 15637436 12696 332504 0 0 0 38 266 713 0 0 99 0 0 1341306658.6805 77.8202 2.0003 -- 0 0 0 15654180 12696 332352 0 0 0 0 270 821 0 0 100 0 0 REVERT-XFS 1341307733.8702 1.7941 1.7941 -- 0 0 0 16050640 12036 103996 0 0 372 0 216 752 0 0 99 1 0 1341307735.8704 3.7944 2.0002 -- 0 0 0 16050864 12036 104028 0 0 0 0 132 857 0 0 100 0 0 1341307737.8707 5.7946 2.0002 -- 0 0 0 16047492 12048 104252 0 0 102 37 255 938 0 0 99 1 0 1341307739.8709 7.7949 2.0003 -- 0 0 0 16047600 12072 104324 0 0 32 2 129 658 0 0 100 0 0 1341307741.8712 9.7951 2.0002 -- 1 1 0 16046676 12080 104328 0 0 0 32 165 729 0 0 100 0 0 1341307743.8714 11.7954 2.0003 -- 0 1 0 15990840 13216 142612 0 0 19422 30 1467 3015 2 1 89 8 0 1341307745.8717 13.7956 2.0002 -- 0 0 0 15825496 13216 226396 0 0 1310 11214 2217 1348 2 8 89 1 0 1341307747.8717 15.7957 2.0001 -- 1 0 0 15677816 13224 314672 0 0 4 15294 2307 1173 2 9 89 0 0 1341307749.8719 17.7959 2.0002 -- 1 0 0 15524372 13224 409728 0 0 0 12288 2466 888 1 10 89 0 0 1341307751.8721 19.7960 2.0002 -- 1 0 0 15368424 13224 502552 0 0 0 12416 2312 878 1 10 89 0 0 1341307753.8722 21.7962 2.0002 -- 1 0 0 15225216 13232 593092 0 0 0 12448 2539 1380 1 10 88 0 0 1341307755.8724 23.7963 2.0002 -- 2 0 0 15163712 13232 664768 0 0 0 32160 2184 1177 1 8 90 0 0 1341307757.8727 25.7967 2.0003 -- 1 0 0 14973888 13240 755080 0 0 0 12316 2482 1219 1 10 89 0 0 1341307759.8728 27.7968 2.0001 -- 1 0 0 14883580 13240 840036 0 0 0 44471 2711 1234 2 10 88 0 0 1341307761.8730 29.7970 2.0002 -- 1 0 0 14800304 13240 920504 0 0 0 42554 2571 1050 1 10 89 0 0 1341307763.8734 31.7973 2.0003 -- 0 0 0 14642504 13248 995004 0 0 0 3232 2276 1081 1 8 90 0 0 1341307765.8737 33.7976 2.0003 -- 1 0 0 14545072 13248 1052536 0 0 0 18688 2628 1114 1 9 89 0 0 1341307767.8739 35.7979 2.0003 -- 1 0 0 14783848 13248 926824 0 0 0 59559 2409 1308 0 10 89 1 0 1341307769.8740 37.7980 2.0001 -- 2 0 0 14854800 13256 896832 0 0 0 9172 2419 1004 1 10 89 1 0 1341307771.8742 39.7981 2.0002 -- 2 0 0 14835084 13256 875612 0 0 0 12288 2524 812 0 11 89 0 0 1341307773.8743 41.7983 2.0002 -- 2 0 0 15126252 13256 745844 0 0 0 10297 2714 1163 1 12 88 0 0 1341307775.8745 43.7985 2.0002 -- 1 0 0 15108800 13264 724544 0 0 0 12316 2499 931 1 11 88 0 0 1341307777.8746 45.7986 2.0001 -- 2 0 0 15226236 13264 694580 0 0 0 12416 2700 1194 1 12 88 0 0 1341307779.8750 47.7989 2.0003 -- 1 0 0 15697632 13264 300716 0 0 1156 0 934 1701 0 2 96 1 0 1341307781.8752 49.7992 2.0003 -- 0 0 0 15697508 13272 300720 0 0 0 66 166 641 0 0 100 0 0 1341307783.8755 51.7995 2.0003 -- 0 0 0 15699008 13272 300524 0 0 0 0 248 865 0 0 100 0 0 1341307785.8758 53.7997 2.0003 -- 0 0 0 15702452 13272 300520 0 0 0 0 285 960 0 0 99 0 0 1341307787.8760 55.7999 2.0002 -- 0 0 0 15719404 13280 300436 0 0 0 26 136 590 0 0 99 0 0 Vanilla average context switch rate 4278.53 Revert average context switch rate 1095 > Can you run latencytop to see > if there is excessive starvation/wait times for allocation > completion? I'm not sure what format you are looking for. latencytop is shit for capturing information throughout a test and it does not easily allow you to record a snapshot of a test. You can record all the console output of course but that's a complete mess. I tried capturing /proc/latency_stats over time instead because that can be trivially sorted on a system-wide basis but as I write this I find that latency_stats was bust. It was just spitting out Latency Top version : v0.1 and nothing else. Either latency_stats is broken or my config is. Not sure which it is right now and won't get enough time on this today to pinpoint it. > A pert top profile comparison might be informative, > too... > I'm not sure if this is what you really wanted. I thought an oprofile or perf report would have made more sense but I recorded perf top over time anyway and it's at the end of the mail. The timestamp information is poor because the perf top information was buffered so it would receive a bunch of updates at once. Each sample should be roughly 2 seconds apart. This buffering can be dealt with, I just failed to do it in advance and I do not think it's necessary to rerun the tests for it. > (*) The stack usage below submit_bio() can be more than 5k (DM, MD, > SCSI, driver, memory allocation), so it's really not safe to do > allocation anywhere below about 3k of kernel stack being used. e.g. > on a relatively trivial storage setup without the above commit: > > [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left > > Fundamentally, 8k stacks on x86-64 are too small for our > increasingly complex storage layers and the 100+ function deep call > chains that occur. > I understand the patches motivation. For these tests I'm being deliberately a bit of a dummy and just capturing information. This might allow me to actually get through all the results and identify some of the problems and spread them around a bit. Either that or I need to clone myself a few times to tackle each of the problems in a reasonable timeframe :) For just these XFS tests I've uploaded a tarball of the logs to http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz For results with no monitor you can find them somewhere like this default/no-monitor/sandy/fsmark-single-3.4.0-vanilla/noprofile/fsmark.log Results with monitors attached are in run-monitor. You can read the iostat logs for example from default/run-monitor/sandy/iostat-3.4.0-vanilla-fsmark-single Some of the monitor logs are gzipped. This is perf top over time for the vanilla kernel time: 1341306570 time: 1341306579 PerfTop: 1 irqs/sec kernel: 0.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 61.85% [kernel] [k] __rmqueue 38.15% libc-2.11.3.so [.] _IO_vfscanf time: 1341306579 PerfTop: 3 irqs/sec kernel:66.7% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 19.88% [kernel] [k] _raw_spin_lock_irqsave 17.14% [kernel] [k] __rmqueue 16.96% [kernel] [k] format_decode 15.37% libc-2.11.3.so [.] __tzfile_compute 13.55% [kernel] [k] copy_user_generic_string 10.57% libc-2.11.3.so [.] _IO_vfscanf 6.53% [kernel] [k] find_first_bit time: 1341306579 PerfTop: 0 irqs/sec kernel:-nan% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 17.51% [kernel] [k] _raw_spin_lock_irqsave 15.10% [kernel] [k] __rmqueue 14.94% [kernel] [k] format_decode 13.54% libc-2.11.3.so [.] __tzfile_compute 11.94% [kernel] [k] copy_user_generic_string 11.90% [kernel] [k] _raw_spin_lock 9.31% libc-2.11.3.so [.] _IO_vfscanf 5.75% [kernel] [k] find_first_bit time: 1341306579 PerfTop: 41 irqs/sec kernel:58.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 13.62% [kernel] [k] _raw_spin_lock_irqsave 11.02% [kernel] [k] __rmqueue 10.91% [kernel] [k] format_decode 9.89% libc-2.11.3.so [.] __tzfile_compute 8.72% [kernel] [k] copy_user_generic_string 8.69% [kernel] [k] _raw_spin_lock 7.15% libc-2.11.3.so [.] _IO_vfscanf 4.20% [kernel] [k] find_first_bit 1.47% libc-2.11.3.so [.] __strcmp_sse42 1.37% libc-2.11.3.so [.] __strchr_sse42 1.19% sed [.] 0x0000000000009f7d 0.90% libc-2.11.3.so [.] vfprintf 0.84% [kernel] [k] hrtimer_interrupt 0.84% libc-2.11.3.so [.] re_string_realloc_buffers 0.76% [kernel] [k] enqueue_entity 0.66% [kernel] [k] __switch_to 0.65% libc-2.11.3.so [.] _IO_default_xsputn 0.62% [kernel] [k] do_vfs_ioctl 0.59% [kernel] [k] perf_event_mmap_event 0.56% gzip [.] 0x0000000000007b96 0.55% libc-2.11.3.so [.] bsearch time: 1341306579 PerfTop: 35 irqs/sec kernel:62.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.50% [kernel] [k] _raw_spin_lock_irqsave 9.22% [kernel] [k] __rmqueue 9.13% [kernel] [k] format_decode 8.27% libc-2.11.3.so [.] __tzfile_compute 7.92% [kernel] [k] copy_user_generic_string 7.74% [kernel] [k] _raw_spin_lock 6.21% libc-2.11.3.so [.] _IO_vfscanf 3.51% [kernel] [k] find_first_bit 1.44% gzip [.] 0x0000000000007b96 1.23% libc-2.11.3.so [.] __strcmp_sse42 1.15% libc-2.11.3.so [.] __strchr_sse42 1.06% libc-2.11.3.so [.] vfprintf 0.99% sed [.] 0x0000000000009f7d 0.92% [unknown] [.] 0x00007f84a7766b99 0.70% [kernel] [k] hrtimer_interrupt 0.70% libc-2.11.3.so [.] re_string_realloc_buffers 0.64% [kernel] [k] enqueue_entity 0.58% libtcl8.5.so [.] 0x000000000006fe86 0.55% [kernel] [k] __switch_to 0.54% libc-2.11.3.so [.] _IO_default_xsputn 0.53% [kernel] [k] __d_lookup_rcu time: 1341306585 PerfTop: 100 irqs/sec kernel:59.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 8.61% [kernel] [k] _raw_spin_lock_irqsave 5.92% [kernel] [k] __rmqueue 5.86% [kernel] [k] format_decode 5.31% libc-2.11.3.so [.] __tzfile_compute 5.30% [kernel] [k] copy_user_generic_string 5.27% [kernel] [k] _raw_spin_lock 3.99% libc-2.11.3.so [.] _IO_vfscanf 2.45% [unknown] [.] 0x00007f84a7766b99 2.26% [kernel] [k] find_first_bit 1.68% [kernel] [k] page_fault 1.45% libc-2.11.3.so [.] _int_malloc 1.28% gzip [.] 0x0000000000007b96 1.13% libc-2.11.3.so [.] vfprintf 1.06% libc-2.11.3.so [.] __strchr_sse42 1.02% perl [.] 0x0000000000044505 0.79% libc-2.11.3.so [.] __strcmp_sse42 0.79% [kernel] [k] do_task_stat 0.77% [kernel] [k] zap_pte_range 0.72% libc-2.11.3.so [.] __gconv_transform_utf8_internal 0.70% libc-2.11.3.so [.] malloc 0.70% libc-2.11.3.so [.] __mbrtowc time: 1341306585 PerfTop: 19 irqs/sec kernel:78.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.97% [kernel] [k] _raw_spin_lock_irqsave 5.48% [kernel] [k] __rmqueue 5.43% [kernel] [k] format_decode 5.24% [kernel] [k] copy_user_generic_string 5.18% [kernel] [k] _raw_spin_lock 4.92% libc-2.11.3.so [.] __tzfile_compute 4.25% libc-2.11.3.so [.] _IO_vfscanf 2.33% [unknown] [.] 0x00007f84a7766b99 2.12% [kernel] [k] page_fault 2.09% [kernel] [k] find_first_bit 1.34% libc-2.11.3.so [.] _int_malloc 1.19% gzip [.] 0x0000000000007b96 1.05% libc-2.11.3.so [.] vfprintf 0.98% libc-2.11.3.so [.] __strchr_sse42 0.94% perl [.] 0x0000000000044505 0.94% libc-2.11.3.so [.] _dl_addr 0.91% [kernel] [k] zap_pte_range 0.74% [kernel] [k] s_show 0.73% libc-2.11.3.so [.] __strcmp_sse42 0.73% [kernel] [k] do_task_stat 0.67% libc-2.11.3.so [.] __gconv_transform_utf8_internal time: 1341306585 PerfTop: 38 irqs/sec kernel:68.4% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.64% [kernel] [k] _raw_spin_lock_irqsave 4.89% [kernel] [k] _raw_spin_lock 4.77% [kernel] [k] __rmqueue 4.72% [kernel] [k] format_decode 4.56% [kernel] [k] copy_user_generic_string 4.53% libc-2.11.3.so [.] _IO_vfscanf 4.28% libc-2.11.3.so [.] __tzfile_compute 2.52% [unknown] [.] 0x00007f84a7766b99 no symbols found in /bin/sort, maybe install a debug package? 2.10% [kernel] [k] page_fault 1.82% [kernel] [k] find_first_bit 1.31% libc-2.11.3.so [.] _int_malloc 1.14% libc-2.11.3.so [.] vfprintf 1.08% libc-2.11.3.so [.] _dl_addr 1.07% [kernel] [k] s_show 1.05% libc-2.11.3.so [.] __strchr_sse42 1.03% gzip [.] 0x0000000000007b96 0.82% [kernel] [k] do_task_stat 0.82% perl [.] 0x0000000000044505 0.79% [kernel] [k] zap_pte_range 0.70% [kernel] [k] seq_put_decimal_ull 0.69% [kernel] [k] find_busiest_group time: 1341306591 PerfTop: 66 irqs/sec kernel:59.1% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 6.52% [kernel] [k] _raw_spin_lock_irqsave 4.11% libc-2.11.3.so [.] _IO_vfscanf 3.91% [kernel] [k] _raw_spin_lock 3.50% [kernel] [k] copy_user_generic_string 3.41% [kernel] [k] __rmqueue 3.38% [kernel] [k] format_decode 3.06% libc-2.11.3.so [.] __tzfile_compute 2.90% [unknown] [.] 0x00007f84a7766b99 2.30% [kernel] [k] page_fault 2.20% perl [.] 0x0000000000044505 1.83% libc-2.11.3.so [.] vfprintf 1.61% libc-2.11.3.so [.] _int_malloc 1.30% [kernel] [k] find_first_bit 1.22% libc-2.11.3.so [.] _dl_addr 1.19% libc-2.11.3.so [.] __gconv_transform_utf8_internal 1.10% libc-2.11.3.so [.] __strchr_sse42 1.01% [kernel] [k] zap_pte_range 0.99% [kernel] [k] s_show 0.98% [kernel] [k] __percpu_counter_add 0.86% [kernel] [k] __strnlen_user 0.75% ld-2.11.3.so [.] do_lookup_x time: 1341306591 PerfTop: 39 irqs/sec kernel:69.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 6.26% [kernel] [k] _raw_spin_lock_irqsave 4.05% [kernel] [k] _raw_spin_lock 3.86% libc-2.11.3.so [.] _IO_vfscanf 3.21% [kernel] [k] copy_user_generic_string 3.03% [kernel] [k] __rmqueue 3.00% [kernel] [k] format_decode 2.93% [unknown] [.] 0x00007f84a7766b99 2.72% libc-2.11.3.so [.] __tzfile_compute 2.20% [kernel] [k] page_fault 1.96% perl [.] 0x0000000000044505 1.77% libc-2.11.3.so [.] vfprintf 1.43% libc-2.11.3.so [.] _int_malloc 1.16% [kernel] [k] find_first_bit 1.09% libc-2.11.3.so [.] _dl_addr 1.06% libc-2.11.3.so [.] __gconv_transform_utf8_internal 1.02% [kernel] [k] s_show 0.98% libc-2.11.3.so [.] __strchr_sse42 0.93% gzip [.] 0x0000000000007b96 0.90% [kernel] [k] zap_pte_range 0.87% [kernel] [k] __percpu_counter_add 0.76% [kernel] [k] __strnlen_user time: 1341306591 PerfTop: 185 irqs/sec kernel:70.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 4.81% [kernel] [k] _raw_spin_lock_irqsave 3.60% [unknown] [.] 0x00007f84a7766b99 3.10% [kernel] [k] _raw_spin_lock 3.04% [kernel] [k] page_fault 2.66% libc-2.11.3.so [.] _IO_vfscanf 2.14% [kernel] [k] copy_user_generic_string 2.11% [kernel] [k] format_decode 1.96% [kernel] [k] __rmqueue 1.86% libc-2.11.3.so [.] _dl_addr 1.76% libc-2.11.3.so [.] __tzfile_compute 1.26% perl [.] 0x0000000000044505 1.19% libc-2.11.3.so [.] __mbrtowc 1.14% libc-2.11.3.so [.] vfprintf 1.12% libc-2.11.3.so [.] _int_malloc 1.09% gzip [.] 0x0000000000007b96 0.95% libc-2.11.3.so [.] __gconv_transform_utf8_internal 0.88% [kernel] [k] _raw_spin_unlock_irqrestore 0.87% [kernel] [k] __strnlen_user 0.82% [kernel] [k] clear_page_c 0.77% [kernel] [k] __schedule 0.76% [kernel] [k] find_get_page time: 1341306595 PerfTop: 385 irqs/sec kernel:48.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 27.20% cc1 [.] 0x0000000000210978 3.01% [unknown] [.] 0x00007f84a7766b99 2.18% [kernel] [k] page_fault 1.96% libbfd-2.21.so [.] 0x00000000000b9cdd 1.95% [kernel] [k] _raw_spin_lock_irqsave 1.91% ld.bfd [.] 0x000000000000e3b9 1.85% [kernel] [k] _raw_spin_lock 1.31% [kernel] [k] copy_user_generic_string 1.20% libbfd-2.21.so [.] bfd_hash_lookup 1.10% libc-2.11.3.so [.] __strcmp_sse42 0.93% libc-2.11.3.so [.] _IO_vfscanf 0.85% [kernel] [k] _raw_spin_unlock_irqrestore 0.82% libc-2.11.3.so [.] _int_malloc 0.80% [kernel] [k] __rmqueue 0.79% [kernel] [k] kmem_cache_alloc 0.74% [kernel] [k] format_decode 0.71% libc-2.11.3.so [.] _dl_addr 0.62% libbfd-2.21.so [.] _bfd_final_link_relocate 0.61% libc-2.11.3.so [.] __tzfile_compute 0.61% libc-2.11.3.so [.] vfprintf 0.59% [kernel] [k] find_busiest_group time: 1341306595 PerfTop: 1451 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 9.75% cc1 [.] 0x0000000000210978 8.81% [unknown] [.] 0x00007f84a7766b99 4.62% [kernel] [k] page_fault 3.61% [kernel] [k] _raw_spin_lock 2.67% [kernel] [k] memcpy 2.03% [kernel] [k] _raw_spin_lock_irqsave 2.00% [kernel] [k] kmem_cache_alloc 1.64% [xfs] [k] _xfs_buf_find 1.31% [kernel] [k] _raw_spin_unlock_irqrestore 1.16% [kernel] [k] kmem_cache_free 1.15% [xfs] [k] xfs_next_bit 0.98% [kernel] [k] __d_lookup 0.89% [xfs] [k] xfs_da_do_buf 0.83% [kernel] [k] memset 0.80% [xfs] [k] xfs_dir2_node_addname_int 0.79% [kernel] [k] link_path_walk 0.76% [xfs] [k] xfs_buf_item_size no symbols found in /usr/bin/tee, maybe install a debug package? no symbols found in /bin/date, maybe install a debug package? 0.73% [xfs] [k] xfs_buf_offset 0.71% [kernel] [k] __kmalloc 0.70% [kernel] [k] kfree 0.70% libbfd-2.21.so [.] 0x00000000000b9cdd time: 1341306601 PerfTop: 1267 irqs/sec kernel:85.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.81% [unknown] [.] 0x00007f84a7766b99 5.98% cc1 [.] 0x0000000000210978 5.20% [kernel] [k] page_fault 3.54% [kernel] [k] _raw_spin_lock 3.37% [kernel] [k] memcpy 2.03% [kernel] [k] kmem_cache_alloc 1.91% [kernel] [k] _raw_spin_lock_irqsave 1.75% [xfs] [k] _xfs_buf_find 1.35% [kernel] [k] _raw_spin_unlock_irqrestore 1.28% [xfs] [k] xfs_next_bit 1.14% [kernel] [k] kmem_cache_free 1.13% [kernel] [k] __kmalloc 1.12% [kernel] [k] __d_lookup 0.97% [xfs] [k] xfs_dir2_node_addname_int 0.96% [xfs] [k] xfs_buf_offset 0.95% [kernel] [k] memset 0.91% [kernel] [k] link_path_walk 0.88% [xfs] [k] xfs_da_do_buf 0.85% [kernel] [k] kfree 0.84% [xfs] [k] xfs_buf_item_size 0.74% [xfs] [k] xfs_btree_lookup time: 1341306601 PerfTop: 1487 irqs/sec kernel:85.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.84% [unknown] [.] 0x00007f84a7766b99 5.15% [kernel] [k] page_fault 3.93% cc1 [.] 0x0000000000210978 3.76% [kernel] [k] _raw_spin_lock 3.50% [kernel] [k] memcpy 2.13% [kernel] [k] kmem_cache_alloc 1.91% [xfs] [k] _xfs_buf_find 1.79% [kernel] [k] _raw_spin_lock_irqsave 1.52% [kernel] [k] __kmalloc 1.33% [kernel] [k] kmem_cache_free 1.32% [kernel] [k] _raw_spin_unlock_irqrestore 1.29% [kernel] [k] __d_lookup 1.27% [xfs] [k] xfs_next_bit 1.11% [kernel] [k] link_path_walk 1.01% [xfs] [k] xfs_buf_offset 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_da_do_buf 0.97% [kernel] [k] kfree 0.96% [kernel] [k] memset 0.84% [xfs] [k] xfs_btree_lookup 0.82% [xfs] [k] xfs_buf_item_format time: 1341306601 PerfTop: 1291 irqs/sec kernel:85.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.21% [unknown] [.] 0x00007f84a7766b99 5.18% [kernel] [k] page_fault 3.83% [kernel] [k] _raw_spin_lock 3.67% [kernel] [k] memcpy 2.92% cc1 [.] 0x0000000000210978 2.28% [kernel] [k] kmem_cache_alloc 2.18% [xfs] [k] _xfs_buf_find 1.66% [kernel] [k] _raw_spin_lock_irqsave 1.56% [kernel] [k] __kmalloc 1.43% [kernel] [k] __d_lookup 1.43% [kernel] [k] _raw_spin_unlock_irqrestore 1.40% [kernel] [k] kmem_cache_free 1.29% [xfs] [k] xfs_next_bit 1.13% [xfs] [k] xfs_buf_offset 1.07% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_da_do_buf 1.01% [kernel] [k] memset 1.01% [kernel] [k] kfree 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.87% [xfs] [k] xfs_buf_item_size 0.84% [xfs] [k] xfs_btree_lookup time: 1341306607 PerfTop: 1435 irqs/sec kernel:87.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.06% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] page_fault 3.88% [kernel] [k] _raw_spin_lock 3.83% [kernel] [k] memcpy 2.41% [xfs] [k] _xfs_buf_find 2.35% [kernel] [k] kmem_cache_alloc 2.19% cc1 [.] 0x0000000000210978 1.68% [kernel] [k] _raw_spin_lock_irqsave 1.55% [kernel] [k] __kmalloc 1.48% [kernel] [k] __d_lookup 1.43% [kernel] [k] kmem_cache_free 1.42% [kernel] [k] _raw_spin_unlock_irqrestore 1.37% [xfs] [k] xfs_next_bit 1.27% [xfs] [k] xfs_buf_offset 1.12% [kernel] [k] link_path_walk 1.09% [kernel] [k] kfree 1.08% [kernel] [k] memset 1.04% [xfs] [k] xfs_da_do_buf 0.99% [xfs] [k] xfs_dir2_node_addname_int 0.92% [xfs] [k] xfs_buf_item_size 0.89% [xfs] [k] xfs_btree_lookup time: 1341306607 PerfTop: 1281 irqs/sec kernel:87.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.00% [unknown] [.] 0x00007f84a7766b99 5.44% [kernel] [k] page_fault 4.04% [kernel] [k] _raw_spin_lock 3.94% [kernel] [k] memcpy 2.51% [xfs] [k] _xfs_buf_find 2.32% [kernel] [k] kmem_cache_alloc 1.75% cc1 [.] 0x0000000000210978 1.66% [kernel] [k] _raw_spin_lock_irqsave 1.58% [kernel] [k] __d_lookup 1.56% [kernel] [k] __kmalloc 1.46% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] kmem_cache_free 1.41% [kernel] [k] _raw_spin_unlock_irqrestore 1.34% [xfs] [k] xfs_buf_offset 1.20% [kernel] [k] link_path_walk 1.16% [kernel] [k] kfree 1.11% [kernel] [k] memset 1.04% [xfs] [k] xfs_dir2_node_addname_int 0.94% [xfs] [k] xfs_da_do_buf 0.92% [xfs] [k] xfs_btree_lookup 0.89% [xfs] [k] xfs_buf_item_size time: 1341306607 PerfTop: 1455 irqs/sec kernel:86.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.14% [unknown] [.] 0x00007f84a7766b99 5.36% [kernel] [k] page_fault 4.12% [kernel] [k] _raw_spin_lock 4.02% [kernel] [k] memcpy 2.54% [xfs] [k] _xfs_buf_find 2.41% [kernel] [k] kmem_cache_alloc 1.69% [kernel] [k] _raw_spin_lock_irqsave 1.56% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.47% [kernel] [k] __d_lookup 1.42% [kernel] [k] _raw_spin_unlock_irqrestore 1.39% [xfs] [k] xfs_buf_offset 1.39% cc1 [.] 0x0000000000210978 1.37% [kernel] [k] kmem_cache_free 1.24% [kernel] [k] link_path_walk 1.17% [kernel] [k] memset 1.16% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.92% [xfs] [k] xfs_da_do_buf 0.91% [xfs] [k] xfs_btree_lookup time: 1341306613 PerfTop: 1245 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.05% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] page_fault 4.10% [kernel] [k] _raw_spin_lock 4.06% [kernel] [k] memcpy 2.74% [xfs] [k] _xfs_buf_find 2.40% [kernel] [k] kmem_cache_alloc 1.64% [kernel] [k] _raw_spin_lock_irqsave 1.58% [xfs] [k] xfs_next_bit 1.54% [kernel] [k] __kmalloc 1.49% [kernel] [k] __d_lookup 1.45% [kernel] [k] _raw_spin_unlock_irqrestore 1.41% [kernel] [k] kmem_cache_free 1.35% [xfs] [k] xfs_buf_offset 1.25% [kernel] [k] link_path_walk 1.22% [kernel] [k] kfree 1.16% [kernel] [k] memset 1.15% cc1 [.] 0x0000000000210978 1.02% [xfs] [k] xfs_buf_item_size 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.92% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_da_do_buf time: 1341306613 PerfTop: 1433 irqs/sec kernel:87.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.04% [unknown] [.] 0x00007f84a7766b99 5.30% [kernel] [k] page_fault 4.08% [kernel] [k] memcpy 4.07% [kernel] [k] _raw_spin_lock 2.88% [xfs] [k] _xfs_buf_find 2.50% [kernel] [k] kmem_cache_alloc 1.72% [kernel] [k] _raw_spin_lock_irqsave 1.68% [xfs] [k] xfs_next_bit 1.56% [kernel] [k] __d_lookup 1.54% [kernel] [k] __kmalloc 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.46% [kernel] [k] kmem_cache_free 1.40% [xfs] [k] xfs_buf_offset 1.25% [kernel] [k] link_path_walk 1.21% [kernel] [k] memset 1.18% [kernel] [k] kfree 1.04% [xfs] [k] xfs_buf_item_size 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.95% [xfs] [k] xfs_btree_lookup 0.94% cc1 [.] 0x0000000000210978 0.90% [xfs] [k] xfs_da_do_buf time: 1341306613 PerfTop: 1118 irqs/sec kernel:87.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- no symbols found in /usr/bin/vmstat, maybe install a debug package? 12.03% [unknown] [.] 0x00007f84a7766b99 5.48% [kernel] [k] page_fault 4.21% [kernel] [k] memcpy 4.11% [kernel] [k] _raw_spin_lock 2.98% [xfs] [k] _xfs_buf_find 2.47% [kernel] [k] kmem_cache_alloc 1.81% [kernel] [k] _raw_spin_lock_irqsave 1.72% [xfs] [k] xfs_next_bit 1.51% [kernel] [k] __kmalloc 1.48% [kernel] [k] kmem_cache_free 1.48% [kernel] [k] __d_lookup 1.47% [kernel] [k] _raw_spin_unlock_irqrestore 1.39% [xfs] [k] xfs_buf_offset 1.23% [kernel] [k] link_path_walk 1.19% [kernel] [k] memset 1.19% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_buf_item_size 0.98% [xfs] [k] xfs_btree_lookup 0.93% [xfs] [k] xfs_buf_item_format 0.91% [xfs] [k] xfs_da_do_buf time: 1341306617 PerfTop: 1454 irqs/sec kernel:87.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.93% [unknown] [.] 0x00007f84a7766b99 5.42% [kernel] [k] page_fault 4.28% [kernel] [k] memcpy 4.20% [kernel] [k] _raw_spin_lock 3.15% [xfs] [k] _xfs_buf_find 2.52% [kernel] [k] kmem_cache_alloc 1.76% [kernel] [k] _raw_spin_lock_irqsave 1.72% [xfs] [k] xfs_next_bit 1.59% [kernel] [k] __d_lookup 1.51% [kernel] [k] __kmalloc 1.49% [kernel] [k] _raw_spin_unlock_irqrestore 1.48% [kernel] [k] kmem_cache_free 1.40% [xfs] [k] xfs_buf_offset 1.29% [kernel] [k] memset 1.20% [kernel] [k] link_path_walk 1.17% [kernel] [k] kfree 1.09% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_buf_item_size 0.95% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf 0.91% [xfs] [k] xfs_buf_item_format time: 1341306617 PerfTop: 1758 irqs/sec kernel:90.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.99% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.04% [kernel] [k] memcpy 3.86% [xfs] [k] _xfs_buf_find 2.31% [kernel] [k] kmem_cache_alloc 2.03% [kernel] [k] _raw_spin_lock_irqsave 1.67% [kernel] [k] _raw_spin_unlock_irqrestore 1.60% [kernel] [k] __d_lookup 1.60% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] __kmalloc 1.36% [xfs] [k] xfs_buf_offset 1.35% [kernel] [k] kmem_cache_free 1.17% [kernel] [k] kfree 1.16% [kernel] [k] memset 1.08% [kernel] [k] link_path_walk 0.98% [xfs] [k] xfs_dir2_node_addname_int 0.97% [xfs] [k] xfs_btree_lookup 0.92% [xfs] [k] xfs_perag_put 0.90% [xfs] [k] xfs_buf_item_size 0.84% [xfs] [k] xfs_da_do_buf time: 1341306623 PerfTop: 1022 irqs/sec kernel:88.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.93% [unknown] [.] 0x00007f84a7766b99 5.34% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.01% [kernel] [k] memcpy 4.01% [xfs] [k] _xfs_buf_find 2.28% [kernel] [k] kmem_cache_alloc 2.00% [kernel] [k] _raw_spin_lock_irqsave 1.68% [kernel] [k] _raw_spin_unlock_irqrestore 1.60% [xfs] [k] xfs_next_bit 1.59% [kernel] [k] __d_lookup 1.41% [kernel] [k] __kmalloc 1.39% [kernel] [k] kmem_cache_free 1.35% [xfs] [k] xfs_buf_offset 1.15% [kernel] [k] kfree 1.15% [kernel] [k] memset 1.09% [kernel] [k] link_path_walk 0.98% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_perag_put 0.88% [xfs] [k] xfs_buf_item_size 0.86% [xfs] [k] xfs_da_do_buf time: 1341306623 PerfTop: 1430 irqs/sec kernel:87.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.05% [unknown] [.] 0x00007f84a7766b99 5.24% [kernel] [k] _raw_spin_lock 4.89% [kernel] [k] page_fault 4.13% [kernel] [k] memcpy 3.96% [xfs] [k] _xfs_buf_find 2.35% [kernel] [k] kmem_cache_alloc 1.95% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] _raw_spin_unlock_irqrestore 1.63% [kernel] [k] __d_lookup 1.54% [xfs] [k] xfs_next_bit 1.42% [kernel] [k] __kmalloc 1.41% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.16% [kernel] [k] memset 1.11% [kernel] [k] kfree 1.10% [kernel] [k] link_path_walk 1.05% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_buf_item_size 0.87% [xfs] [k] xfs_da_do_buf 0.87% [xfs] [k] xfs_perag_put time: 1341306623 PerfTop: 1267 irqs/sec kernel:87.1% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.20% [unknown] [.] 0x00007f84a7766b99 5.08% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.12% [kernel] [k] memcpy 3.96% [xfs] [k] _xfs_buf_find 2.41% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] _raw_spin_unlock_irqrestore 1.61% [kernel] [k] __d_lookup 1.50% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] __kmalloc 1.40% [kernel] [k] kmem_cache_free 1.31% [xfs] [k] xfs_buf_offset no symbols found in /usr/bin/iostat, maybe install a debug package? 1.16% [kernel] [k] memset 1.11% [kernel] [k] kfree 1.06% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_btree_lookup 0.95% [xfs] [k] xfs_buf_item_size 0.90% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_perag_put time: 1341306629 PerfTop: 1399 irqs/sec kernel:88.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.12% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.23% [kernel] [k] memcpy 4.03% [xfs] [k] _xfs_buf_find 2.37% [kernel] [k] kmem_cache_alloc 1.96% [kernel] [k] _raw_spin_lock_irqsave 1.69% [kernel] [k] _raw_spin_unlock_irqrestore 1.63% [kernel] [k] __d_lookup 1.50% [xfs] [k] xfs_next_bit 1.45% [kernel] [k] __kmalloc 1.35% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.17% [kernel] [k] memset 1.09% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 1.07% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_btree_lookup 1.02% [xfs] [k] xfs_buf_item_size 0.93% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_buf_item_format time: 1341306629 PerfTop: 1225 irqs/sec kernel:87.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.15% [unknown] [.] 0x00007f84a7766b99 5.02% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.22% [kernel] [k] memcpy 4.19% [xfs] [k] _xfs_buf_find 2.32% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.71% [kernel] [k] __d_lookup 1.68% [kernel] [k] _raw_spin_unlock_irqrestore 1.54% [xfs] [k] xfs_next_bit 1.51% [kernel] [k] __kmalloc 1.36% [kernel] [k] kmem_cache_free 1.28% [xfs] [k] xfs_buf_offset 1.14% [kernel] [k] memset 1.09% [kernel] [k] kfree 1.06% [kernel] [k] link_path_walk 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.99% [xfs] [k] xfs_btree_lookup 0.92% [xfs] [k] xfs_da_do_buf 0.86% [kernel] [k] s_show time: 1341306629 PerfTop: 1400 irqs/sec kernel:87.4% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.23% [unknown] [.] 0x00007f84a7766b99 5.07% [kernel] [k] _raw_spin_lock 4.87% [kernel] [k] page_fault 4.27% [xfs] [k] _xfs_buf_find 4.18% [kernel] [k] memcpy 2.31% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.73% [kernel] [k] __d_lookup 1.66% [kernel] [k] _raw_spin_unlock_irqrestore 1.49% [xfs] [k] xfs_next_bit 1.49% [kernel] [k] __kmalloc 1.40% [kernel] [k] kmem_cache_free 1.29% [xfs] [k] xfs_buf_offset 1.11% [kernel] [k] kfree 1.07% [kernel] [k] memset 1.07% [kernel] [k] link_path_walk 1.05% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.97% [xfs] [k] xfs_btree_lookup 0.93% [xfs] [k] xfs_da_do_buf 0.89% [kernel] [k] s_show time: 1341306635 PerfTop: 1251 irqs/sec kernel:87.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.20% [unknown] [.] 0x00007f84a7766b99 5.10% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.29% [xfs] [k] _xfs_buf_find 4.19% [kernel] [k] memcpy 2.26% [kernel] [k] kmem_cache_alloc 1.87% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] __d_lookup 1.64% [kernel] [k] _raw_spin_unlock_irqrestore 1.53% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.41% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.10% [kernel] [k] link_path_walk 1.09% [kernel] [k] memset 1.08% [kernel] [k] kfree 1.03% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.98% [xfs] [k] xfs_btree_lookup 0.96% [kernel] [k] s_show 0.93% [xfs] [k] xfs_da_do_buf time: 1341306635 PerfTop: 1429 irqs/sec kernel:88.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.18% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.28% [xfs] [k] _xfs_buf_find 4.21% [kernel] [k] memcpy 2.23% [kernel] [k] kmem_cache_alloc 1.90% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] __d_lookup 1.67% [kernel] [k] _raw_spin_unlock_irqrestore 1.52% [kernel] [k] __kmalloc 1.52% [xfs] [k] xfs_next_bit 1.36% [kernel] [k] kmem_cache_free 1.34% [xfs] [k] xfs_buf_offset 1.11% [kernel] [k] link_path_walk 1.11% [kernel] [k] memset 1.08% [kernel] [k] kfree 1.03% [xfs] [k] xfs_buf_item_size 1.03% [xfs] [k] xfs_dir2_node_addname_int 1.01% [kernel] [k] s_show 0.98% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf time: 1341306635 PerfTop: 1232 irqs/sec kernel:88.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.11% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.87% [kernel] [k] page_fault 4.33% [xfs] [k] _xfs_buf_find 4.16% [kernel] [k] memcpy 2.24% [kernel] [k] kmem_cache_alloc 1.84% [kernel] [k] _raw_spin_lock_irqsave 1.82% [kernel] [k] __d_lookup 1.65% [kernel] [k] _raw_spin_unlock_irqrestore 1.50% [xfs] [k] xfs_next_bit 1.49% [kernel] [k] __kmalloc 1.34% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.13% [kernel] [k] link_path_walk 1.13% [kernel] [k] kfree 1.11% [kernel] [k] memset 1.06% [kernel] [k] s_show 1.03% [xfs] [k] xfs_buf_item_size 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf time: 1341306639 PerfTop: 1444 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.19% [unknown] [.] 0x00007f84a7766b99 5.10% [kernel] [k] _raw_spin_lock 4.95% [kernel] [k] page_fault 4.40% [xfs] [k] _xfs_buf_find 4.10% [kernel] [k] memcpy 2.20% [kernel] [k] kmem_cache_alloc 1.93% [kernel] [k] _raw_spin_lock_irqsave 1.81% [kernel] [k] __d_lookup 1.59% [kernel] [k] _raw_spin_unlock_irqrestore 1.50% [xfs] [k] xfs_next_bit 1.48% [kernel] [k] __kmalloc 1.37% [kernel] [k] kmem_cache_free 1.36% [xfs] [k] xfs_buf_offset 1.15% [kernel] [k] memset 1.12% [kernel] [k] s_show 1.12% [kernel] [k] link_path_walk 1.10% [kernel] [k] kfree 1.02% [xfs] [k] xfs_buf_item_size 0.99% [xfs] [k] xfs_btree_lookup 0.97% [xfs] [k] xfs_dir2_node_addname_int 0.94% [xfs] [k] xfs_da_do_buf time: 1341306639 PerfTop: 1195 irqs/sec kernel:90.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.00% [unknown] [.] 0x00007f84a7766b99 5.17% [kernel] [k] _raw_spin_lock 4.44% [xfs] [k] _xfs_buf_find 4.37% [kernel] [k] page_fault 4.37% [kernel] [k] memcpy 2.30% [kernel] [k] kmem_cache_alloc 1.90% [kernel] [k] _raw_spin_lock_irqsave 1.63% [kernel] [k] __d_lookup 1.62% [kernel] [k] _raw_spin_unlock_irqrestore 1.59% [xfs] [k] xfs_buf_offset 1.50% [kernel] [k] kmem_cache_free 1.50% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.33% [kernel] [k] memset 1.28% [kernel] [k] kfree 1.11% [xfs] [k] xfs_buf_item_size 1.07% [kernel] [k] s_show 1.07% [kernel] [k] link_path_walk 0.93% [xfs] [k] xfs_btree_lookup 0.90% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_dir2_node_addname_int time: 1341306645 PerfTop: 1097 irqs/sec kernel:95.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.51% [unknown] [.] 0x00007f84a7766b99 5.02% [kernel] [k] _raw_spin_lock 4.63% [kernel] [k] memcpy 4.51% [xfs] [k] _xfs_buf_find 3.32% [kernel] [k] page_fault 2.37% [kernel] [k] kmem_cache_alloc 1.87% [kernel] [k] _raw_spin_lock_irqsave 1.77% [xfs] [k] xfs_buf_offset 1.75% [kernel] [k] __kmalloc 1.73% [xfs] [k] xfs_next_bit 1.65% [kernel] [k] memset 1.60% [kernel] [k] _raw_spin_unlock_irqrestore 1.53% [kernel] [k] kfree 1.52% [kernel] [k] kmem_cache_free 1.44% [xfs] [k] xfs_trans_ail_cursor_first 1.26% [kernel] [k] __d_lookup 1.26% [xfs] [k] xfs_buf_item_size 1.05% [kernel] [k] s_show 1.03% [xfs] [k] xfs_buf_item_format 0.92% [kernel] [k] __d_lookup_rcu 0.87% [kernel] [k] link_path_walk time: 1341306645 PerfTop: 1038 irqs/sec kernel:95.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.89% [unknown] [.] 0x00007f84a7766b99 5.18% [kernel] [k] memcpy 4.60% [xfs] [k] _xfs_buf_find 4.52% [kernel] [k] _raw_spin_lock 2.60% [kernel] [k] page_fault 2.42% [kernel] [k] kmem_cache_alloc 1.99% [kernel] [k] __kmalloc 1.96% [xfs] [k] xfs_next_bit 1.93% [xfs] [k] xfs_buf_offset 1.84% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] memset 1.80% [kernel] [k] kmem_cache_free 1.68% [kernel] [k] kfree 1.47% [xfs] [k] xfs_buf_item_size 1.45% [kernel] [k] _raw_spin_unlock_irqrestore 1.18% [kernel] [k] __d_lookup_rcu 1.13% [xfs] [k] xfs_trans_ail_cursor_first 1.12% [xfs] [k] xfs_buf_item_format 1.04% [kernel] [k] s_show 1.01% [kernel] [k] __d_lookup 0.93% [xfs] [k] xfs_da_do_buf time: 1341306645 PerfTop: 1087 irqs/sec kernel:96.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.27% [kernel] [k] memcpy 4.77% [unknown] [.] 0x00007f84a7766b99 4.69% [xfs] [k] _xfs_buf_find 4.56% [kernel] [k] _raw_spin_lock 2.47% [kernel] [k] kmem_cache_alloc 2.18% [xfs] [k] xfs_next_bit 2.11% [kernel] [k] page_fault 2.00% [xfs] [k] xfs_buf_offset 1.99% [kernel] [k] __kmalloc 1.96% [kernel] [k] kmem_cache_free 1.85% [kernel] [k] kfree 1.82% [kernel] [k] memset 1.75% [kernel] [k] _raw_spin_lock_irqsave 1.58% [xfs] [k] xfs_buf_item_size 1.41% [kernel] [k] _raw_spin_unlock_irqrestore 1.23% [kernel] [k] __d_lookup_rcu 1.21% [xfs] [k] xfs_buf_item_format 0.99% [kernel] [k] s_show 0.97% [xfs] [k] xfs_perag_put 0.92% [xfs] [k] xfs_da_do_buf 0.92% [xfs] [k] xfs_trans_ail_cursor_first time: 1341306651 PerfTop: 1157 irqs/sec kernel:96.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.53% [kernel] [k] memcpy 4.61% [xfs] [k] _xfs_buf_find 4.40% [kernel] [k] _raw_spin_lock 3.83% [unknown] [.] 0x00007f84a7766b99 2.67% [kernel] [k] kmem_cache_alloc 2.32% [xfs] [k] xfs_next_bit 2.21% [kernel] [k] __kmalloc 2.21% [xfs] [k] xfs_buf_offset 2.19% [kernel] [k] kmem_cache_free 1.92% [kernel] [k] memset 1.89% [kernel] [k] kfree 1.80% [xfs] [k] xfs_buf_item_size 1.70% [kernel] [k] page_fault 1.62% [kernel] [k] _raw_spin_lock_irqsave 1.50% [kernel] [k] _raw_spin_unlock_irqrestore 1.30% [xfs] [k] xfs_buf_item_format 1.27% [kernel] [k] __d_lookup_rcu 0.97% [xfs] [k] xfs_da_do_buf 0.96% [xfs] [k] xlog_cil_prepare_log_vecs 0.93% [kernel] [k] s_show 0.93% [xfs] [k] xfs_perag_put time: 1341306651 PerfTop: 1073 irqs/sec kernel:95.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.76% [kernel] [k] memcpy 4.54% [xfs] [k] _xfs_buf_find 4.32% [kernel] [k] _raw_spin_lock 3.15% [unknown] [.] 0x00007f84a7766b99 2.77% [kernel] [k] kmem_cache_alloc 2.49% [xfs] [k] xfs_next_bit 2.36% [kernel] [k] __kmalloc 2.27% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset 1.88% [kernel] [k] memset 1.88% [kernel] [k] kfree 1.77% [xfs] [k] xfs_buf_item_size 1.62% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.42% [xfs] [k] xfs_buf_item_format 1.40% [kernel] [k] page_fault 1.39% [kernel] [k] __d_lookup_rcu 0.99% [xfs] [k] xfs_da_do_buf 0.96% [xfs] [k] xlog_cil_prepare_log_vecs 0.88% [kernel] [k] s_show 0.87% [xfs] [k] xfs_perag_put time: 1341306651 PerfTop: 492 irqs/sec kernel:85.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.74% [kernel] [k] memcpy 4.48% [xfs] [k] _xfs_buf_find 4.27% [kernel] [k] _raw_spin_lock 3.00% [unknown] [.] 0x00007f84a7766b99 2.76% [kernel] [k] kmem_cache_alloc 2.54% [xfs] [k] xfs_next_bit 2.39% [kernel] [k] __kmalloc 2.30% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset no symbols found in /bin/ps, maybe install a debug package? 1.96% [kernel] [k] kfree 1.92% [kernel] [k] memset 1.75% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.44% [xfs] [k] xfs_buf_item_format 1.39% [kernel] [k] __d_lookup_rcu 1.36% [kernel] [k] page_fault 0.99% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.86% [kernel] [k] s_show 0.85% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 70 irqs/sec kernel:72.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.73% [kernel] [k] memcpy 4.47% [xfs] [k] _xfs_buf_find 4.27% [kernel] [k] _raw_spin_lock 2.99% [unknown] [.] 0x00007f84a7766b99 2.75% [kernel] [k] kmem_cache_alloc 2.53% [xfs] [k] xfs_next_bit 2.39% [kernel] [k] __kmalloc 2.30% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset 1.96% [kernel] [k] kfree 1.92% [kernel] [k] memset 1.75% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.49% [kernel] [k] _raw_spin_unlock_irqrestore 1.43% [xfs] [k] xfs_buf_item_format 1.38% [kernel] [k] __d_lookup_rcu 1.37% [kernel] [k] page_fault 0.98% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.89% [kernel] [k] s_show 0.85% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 87 irqs/sec kernel:71.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.72% [kernel] [k] memcpy 4.45% [xfs] [k] _xfs_buf_find 4.25% [kernel] [k] _raw_spin_lock 2.99% [unknown] [.] 0x00007f84a7766b99 2.74% [kernel] [k] kmem_cache_alloc 2.52% [xfs] [k] xfs_next_bit 2.38% [kernel] [k] __kmalloc 2.29% [kernel] [k] kmem_cache_free 2.19% [xfs] [k] xfs_buf_offset 1.95% [kernel] [k] kfree 1.91% [kernel] [k] memset 1.74% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.43% [xfs] [k] xfs_buf_item_format 1.38% [kernel] [k] page_fault 1.38% [kernel] [k] __d_lookup_rcu 0.98% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.93% [kernel] [k] s_show 0.84% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 88 irqs/sec kernel:68.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.69% [kernel] [k] memcpy 4.42% [xfs] [k] _xfs_buf_find 4.25% [kernel] [k] _raw_spin_lock 2.98% [unknown] [.] 0x00007f84a7766b99 -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 10:59 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 10:59 UTC (permalink / raw) To: Dave Chinner Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > Adding dri-devel and a few others because an i915 patch contributed to > > the regression. > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > you have lots of CPU power, there will be little difference in > > > > > performance... > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > optimize away the second call. I'd defintively like to track down where > > > > the performance changes happend, at least to a major version but even > > > > better to a -rc or git commit. > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > automated git bisection. As you know the milage with an automated bisect > > > varies so it may or may not find the right commit. Test machine is sandy so > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > is the report of interest. The script is doing a full search between v3.3 and > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > apparently unrelated patch that caused the problem. > > > > > > > It was obvious very quickly that there were two distinct regression so I > > ran two bisections. One led to a XFS and the other led to an i915 patch > > that enables RC6 to reduce power usage. > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > Doesn't seem to be the major cause of the regression. By itself, it > has impact, but the majority comes from the XFS change... > The fact it has an impact at all is weird but lets see what the DRI folks think about it. > > [c999a223: xfs: introduce an allocation workqueue] > > Which indicates that there is workqueue scheduling issues, I think. > The same amount of work is being done, but half of it is being > pushed off into a workqueue to avoid stack overflow issues (*). I > tested the above patch in anger on an 8p machine, similar to the > machine you saw no regressions on, but the workload didn't drive it > to being completely CPU bound (only about 90%) so the allocation > work was probably always scheduled quickly. > What test were you using? > How many worker threads have been spawned on these machines > that are showing the regression? 20 or 21 generally. An example list as spotted by top looks like kworker/0:0 kworker/0:1 kworker/0:2 kworker/1:0 kworker/1:1 kworker/1:2 kworker/2:0 kworker/2:1 kworker/2:2 kworker/3:0 kworker/3:1 kworker/3:2 kworker/4:0 kworker/4:1 kworker/5:0 kworker/5:1 kworker/6:0 kworker/6:1 kworker/6:2 kworker/7:0 kworker/7:1 There were 8 unbound workers. > What is the context switch rate on the machines whenteh test is running? This is vmstat from a vanilla kernel. The actual vmstat is after the --. The information before that is recorded by mmtests to try and detect if there was jitter in the vmstat output. It's showing that there is little or no jitter in this test. VANILLA 1341306582.6713 1.8109 1.8109 -- 0 0 0 16050784 11448 104056 0 0 376 0 209 526 0 0 99 1 0 1341306584.6715 3.8112 2.0003 -- 1 0 0 16050628 11448 104064 0 0 0 0 121 608 0 0 100 0 0 1341306586.6718 5.8114 2.0003 -- 0 0 0 16047432 11460 104288 0 0 102 45 227 999 0 0 99 1 0 1341306588.6721 7.8117 2.0003 -- 1 0 0 16046944 11460 104292 0 0 0 0 120 663 0 0 100 0 0 1341306590.6723 9.8119 2.0002 -- 0 2 0 16045788 11476 104296 0 0 12 40 190 754 0 0 99 0 0 1341306592.6725 11.8121 2.0002 -- 0 1 0 15990236 12600 141724 0 0 19054 30 1400 2937 2 1 88 9 0 1341306594.6727 13.8124 2.0002 -- 1 0 0 15907628 12600 186360 0 0 1653 0 3117 6406 2 9 88 1 0 1341306596.6730 15.8127 2.0003 -- 0 0 0 15825964 12608 226636 0 0 15 11024 3073 6350 2 9 89 0 0 1341306598.6733 17.8130 2.0003 -- 1 0 0 15730420 12608 271632 0 0 0 3072 3461 7179 2 10 88 0 0 1341306600.6736 19.8132 2.0003 -- 1 0 0 15686200 12608 310816 0 0 0 12416 3093 6198 2 9 89 0 0 1341306602.6738 21.8135 2.0003 -- 2 0 0 15593588 12616 354928 0 0 0 32 3482 7146 2 11 87 0 0 1341306604.6741 23.8138 2.0003 -- 2 0 0 15562032 12616 393772 0 0 0 12288 3129 6330 2 10 89 0 0 1341306606.6744 25.8140 2.0002 -- 1 0 0 15458316 12624 438004 0 0 0 26 3471 7107 2 11 87 0 0 1341306608.6746 27.8142 2.0002 -- 1 0 0 15432024 12624 474244 0 0 0 12416 3011 6017 1 10 89 0 0 1341306610.6749 29.8145 2.0003 -- 2 0 0 15343280 12624 517696 0 0 0 24 3393 6826 2 11 87 0 0 1341306612.6751 31.8148 2.0002 -- 1 0 0 15311136 12632 551816 0 0 0 16502 2818 5653 2 9 88 1 0 1341306614.6754 33.8151 2.0003 -- 1 0 0 15220648 12632 594936 0 0 0 3584 3451 6779 2 11 87 0 0 1341306616.6755 35.8152 2.0001 -- 4 0 0 15221252 12632 649296 0 0 0 38559 4846 8709 2 15 78 6 0 1341306618.6758 37.8155 2.0003 -- 1 0 0 15177724 12640 668476 0 0 20 40679 2204 4067 1 5 89 5 0 1341306620.6761 39.8158 2.0003 -- 1 0 0 15090204 12640 711752 0 0 0 0 3316 6788 2 11 88 0 0 1341306622.6764 41.8160 2.0003 -- 1 0 0 15005356 12640 748532 0 0 0 12288 3073 6132 2 10 89 0 0 1341306624.6766 43.8163 2.0002 -- 2 0 0 14913088 12648 791952 0 0 0 28 3408 6806 2 11 87 0 0 1341306626.6769 45.8166 2.0003 -- 1 0 0 14891512 12648 826328 0 0 0 12420 2906 5710 1 9 90 0 0 1341306628.6772 47.8168 2.0003 -- 1 0 0 14794316 12656 868936 0 0 0 26 3367 6798 2 11 87 0 0 1341306630.6774 49.8171 2.0003 -- 1 0 0 14769188 12656 905016 0 0 30 12324 3029 5876 2 10 89 0 0 1341306632.6777 51.8173 2.0002 -- 1 0 0 14679544 12656 947712 0 0 0 0 3399 6868 2 11 87 0 0 1341306634.6780 53.8176 2.0003 -- 1 0 0 14646156 12664 982032 0 0 0 14658 2987 5761 1 10 89 0 0 1341306636.6782 55.8179 2.0003 -- 1 0 0 14560504 12664 1023816 0 0 0 4404 3454 6876 2 11 87 0 0 1341306638.6783 57.8180 2.0001 -- 2 0 0 14533384 12664 1056812 0 0 0 15810 3002 5581 1 10 89 0 0 1341306640.6785 59.8182 2.0002 -- 1 0 0 14593332 12672 1027392 0 0 0 31790 3504 1811 1 13 78 8 0 1341306642.6787 61.8183 2.0001 -- 1 0 0 14686968 12672 1007604 0 0 0 14621 2434 1248 1 10 89 0 0 1341306644.6789 63.8185 2.0002 -- 1 1 0 15042476 12680 788104 0 0 0 36564 2809 1484 1 12 86 1 0 1341306646.6790 65.8187 2.0002 -- 1 0 0 15128292 12680 757948 0 0 0 26395 3050 1313 1 13 86 1 0 1341306648.6792 67.8189 2.0002 -- 1 0 0 15160036 12680 727964 0 0 0 5463 2752 910 1 12 87 0 0 1341306650.6795 69.8192 2.0003 -- 0 0 0 15633256 12688 332572 0 0 1156 12308 2117 2346 1 7 91 1 0 1341306652.6797 71.8194 2.0002 -- 0 0 0 15633892 12688 332652 0 0 0 0 224 758 0 0 100 0 0 1341306654.6800 73.8197 2.0003 -- 0 0 0 15633900 12688 332524 0 0 0 0 231 1009 0 0 100 0 0 1341306656.6803 75.8199 2.0003 -- 0 0 0 15637436 12696 332504 0 0 0 38 266 713 0 0 99 0 0 1341306658.6805 77.8202 2.0003 -- 0 0 0 15654180 12696 332352 0 0 0 0 270 821 0 0 100 0 0 REVERT-XFS 1341307733.8702 1.7941 1.7941 -- 0 0 0 16050640 12036 103996 0 0 372 0 216 752 0 0 99 1 0 1341307735.8704 3.7944 2.0002 -- 0 0 0 16050864 12036 104028 0 0 0 0 132 857 0 0 100 0 0 1341307737.8707 5.7946 2.0002 -- 0 0 0 16047492 12048 104252 0 0 102 37 255 938 0 0 99 1 0 1341307739.8709 7.7949 2.0003 -- 0 0 0 16047600 12072 104324 0 0 32 2 129 658 0 0 100 0 0 1341307741.8712 9.7951 2.0002 -- 1 1 0 16046676 12080 104328 0 0 0 32 165 729 0 0 100 0 0 1341307743.8714 11.7954 2.0003 -- 0 1 0 15990840 13216 142612 0 0 19422 30 1467 3015 2 1 89 8 0 1341307745.8717 13.7956 2.0002 -- 0 0 0 15825496 13216 226396 0 0 1310 11214 2217 1348 2 8 89 1 0 1341307747.8717 15.7957 2.0001 -- 1 0 0 15677816 13224 314672 0 0 4 15294 2307 1173 2 9 89 0 0 1341307749.8719 17.7959 2.0002 -- 1 0 0 15524372 13224 409728 0 0 0 12288 2466 888 1 10 89 0 0 1341307751.8721 19.7960 2.0002 -- 1 0 0 15368424 13224 502552 0 0 0 12416 2312 878 1 10 89 0 0 1341307753.8722 21.7962 2.0002 -- 1 0 0 15225216 13232 593092 0 0 0 12448 2539 1380 1 10 88 0 0 1341307755.8724 23.7963 2.0002 -- 2 0 0 15163712 13232 664768 0 0 0 32160 2184 1177 1 8 90 0 0 1341307757.8727 25.7967 2.0003 -- 1 0 0 14973888 13240 755080 0 0 0 12316 2482 1219 1 10 89 0 0 1341307759.8728 27.7968 2.0001 -- 1 0 0 14883580 13240 840036 0 0 0 44471 2711 1234 2 10 88 0 0 1341307761.8730 29.7970 2.0002 -- 1 0 0 14800304 13240 920504 0 0 0 42554 2571 1050 1 10 89 0 0 1341307763.8734 31.7973 2.0003 -- 0 0 0 14642504 13248 995004 0 0 0 3232 2276 1081 1 8 90 0 0 1341307765.8737 33.7976 2.0003 -- 1 0 0 14545072 13248 1052536 0 0 0 18688 2628 1114 1 9 89 0 0 1341307767.8739 35.7979 2.0003 -- 1 0 0 14783848 13248 926824 0 0 0 59559 2409 1308 0 10 89 1 0 1341307769.8740 37.7980 2.0001 -- 2 0 0 14854800 13256 896832 0 0 0 9172 2419 1004 1 10 89 1 0 1341307771.8742 39.7981 2.0002 -- 2 0 0 14835084 13256 875612 0 0 0 12288 2524 812 0 11 89 0 0 1341307773.8743 41.7983 2.0002 -- 2 0 0 15126252 13256 745844 0 0 0 10297 2714 1163 1 12 88 0 0 1341307775.8745 43.7985 2.0002 -- 1 0 0 15108800 13264 724544 0 0 0 12316 2499 931 1 11 88 0 0 1341307777.8746 45.7986 2.0001 -- 2 0 0 15226236 13264 694580 0 0 0 12416 2700 1194 1 12 88 0 0 1341307779.8750 47.7989 2.0003 -- 1 0 0 15697632 13264 300716 0 0 1156 0 934 1701 0 2 96 1 0 1341307781.8752 49.7992 2.0003 -- 0 0 0 15697508 13272 300720 0 0 0 66 166 641 0 0 100 0 0 1341307783.8755 51.7995 2.0003 -- 0 0 0 15699008 13272 300524 0 0 0 0 248 865 0 0 100 0 0 1341307785.8758 53.7997 2.0003 -- 0 0 0 15702452 13272 300520 0 0 0 0 285 960 0 0 99 0 0 1341307787.8760 55.7999 2.0002 -- 0 0 0 15719404 13280 300436 0 0 0 26 136 590 0 0 99 0 0 Vanilla average context switch rate 4278.53 Revert average context switch rate 1095 > Can you run latencytop to see > if there is excessive starvation/wait times for allocation > completion? I'm not sure what format you are looking for. latencytop is shit for capturing information throughout a test and it does not easily allow you to record a snapshot of a test. You can record all the console output of course but that's a complete mess. I tried capturing /proc/latency_stats over time instead because that can be trivially sorted on a system-wide basis but as I write this I find that latency_stats was bust. It was just spitting out Latency Top version : v0.1 and nothing else. Either latency_stats is broken or my config is. Not sure which it is right now and won't get enough time on this today to pinpoint it. > A pert top profile comparison might be informative, > too... > I'm not sure if this is what you really wanted. I thought an oprofile or perf report would have made more sense but I recorded perf top over time anyway and it's at the end of the mail. The timestamp information is poor because the perf top information was buffered so it would receive a bunch of updates at once. Each sample should be roughly 2 seconds apart. This buffering can be dealt with, I just failed to do it in advance and I do not think it's necessary to rerun the tests for it. > (*) The stack usage below submit_bio() can be more than 5k (DM, MD, > SCSI, driver, memory allocation), so it's really not safe to do > allocation anywhere below about 3k of kernel stack being used. e.g. > on a relatively trivial storage setup without the above commit: > > [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left > > Fundamentally, 8k stacks on x86-64 are too small for our > increasingly complex storage layers and the 100+ function deep call > chains that occur. > I understand the patches motivation. For these tests I'm being deliberately a bit of a dummy and just capturing information. This might allow me to actually get through all the results and identify some of the problems and spread them around a bit. Either that or I need to clone myself a few times to tackle each of the problems in a reasonable timeframe :) For just these XFS tests I've uploaded a tarball of the logs to http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz For results with no monitor you can find them somewhere like this default/no-monitor/sandy/fsmark-single-3.4.0-vanilla/noprofile/fsmark.log Results with monitors attached are in run-monitor. You can read the iostat logs for example from default/run-monitor/sandy/iostat-3.4.0-vanilla-fsmark-single Some of the monitor logs are gzipped. This is perf top over time for the vanilla kernel time: 1341306570 time: 1341306579 PerfTop: 1 irqs/sec kernel: 0.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 61.85% [kernel] [k] __rmqueue 38.15% libc-2.11.3.so [.] _IO_vfscanf time: 1341306579 PerfTop: 3 irqs/sec kernel:66.7% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 19.88% [kernel] [k] _raw_spin_lock_irqsave 17.14% [kernel] [k] __rmqueue 16.96% [kernel] [k] format_decode 15.37% libc-2.11.3.so [.] __tzfile_compute 13.55% [kernel] [k] copy_user_generic_string 10.57% libc-2.11.3.so [.] _IO_vfscanf 6.53% [kernel] [k] find_first_bit time: 1341306579 PerfTop: 0 irqs/sec kernel:-nan% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 17.51% [kernel] [k] _raw_spin_lock_irqsave 15.10% [kernel] [k] __rmqueue 14.94% [kernel] [k] format_decode 13.54% libc-2.11.3.so [.] __tzfile_compute 11.94% [kernel] [k] copy_user_generic_string 11.90% [kernel] [k] _raw_spin_lock 9.31% libc-2.11.3.so [.] _IO_vfscanf 5.75% [kernel] [k] find_first_bit time: 1341306579 PerfTop: 41 irqs/sec kernel:58.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 13.62% [kernel] [k] _raw_spin_lock_irqsave 11.02% [kernel] [k] __rmqueue 10.91% [kernel] [k] format_decode 9.89% libc-2.11.3.so [.] __tzfile_compute 8.72% [kernel] [k] copy_user_generic_string 8.69% [kernel] [k] _raw_spin_lock 7.15% libc-2.11.3.so [.] _IO_vfscanf 4.20% [kernel] [k] find_first_bit 1.47% libc-2.11.3.so [.] __strcmp_sse42 1.37% libc-2.11.3.so [.] __strchr_sse42 1.19% sed [.] 0x0000000000009f7d 0.90% libc-2.11.3.so [.] vfprintf 0.84% [kernel] [k] hrtimer_interrupt 0.84% libc-2.11.3.so [.] re_string_realloc_buffers 0.76% [kernel] [k] enqueue_entity 0.66% [kernel] [k] __switch_to 0.65% libc-2.11.3.so [.] _IO_default_xsputn 0.62% [kernel] [k] do_vfs_ioctl 0.59% [kernel] [k] perf_event_mmap_event 0.56% gzip [.] 0x0000000000007b96 0.55% libc-2.11.3.so [.] bsearch time: 1341306579 PerfTop: 35 irqs/sec kernel:62.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.50% [kernel] [k] _raw_spin_lock_irqsave 9.22% [kernel] [k] __rmqueue 9.13% [kernel] [k] format_decode 8.27% libc-2.11.3.so [.] __tzfile_compute 7.92% [kernel] [k] copy_user_generic_string 7.74% [kernel] [k] _raw_spin_lock 6.21% libc-2.11.3.so [.] _IO_vfscanf 3.51% [kernel] [k] find_first_bit 1.44% gzip [.] 0x0000000000007b96 1.23% libc-2.11.3.so [.] __strcmp_sse42 1.15% libc-2.11.3.so [.] __strchr_sse42 1.06% libc-2.11.3.so [.] vfprintf 0.99% sed [.] 0x0000000000009f7d 0.92% [unknown] [.] 0x00007f84a7766b99 0.70% [kernel] [k] hrtimer_interrupt 0.70% libc-2.11.3.so [.] re_string_realloc_buffers 0.64% [kernel] [k] enqueue_entity 0.58% libtcl8.5.so [.] 0x000000000006fe86 0.55% [kernel] [k] __switch_to 0.54% libc-2.11.3.so [.] _IO_default_xsputn 0.53% [kernel] [k] __d_lookup_rcu time: 1341306585 PerfTop: 100 irqs/sec kernel:59.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 8.61% [kernel] [k] _raw_spin_lock_irqsave 5.92% [kernel] [k] __rmqueue 5.86% [kernel] [k] format_decode 5.31% libc-2.11.3.so [.] __tzfile_compute 5.30% [kernel] [k] copy_user_generic_string 5.27% [kernel] [k] _raw_spin_lock 3.99% libc-2.11.3.so [.] _IO_vfscanf 2.45% [unknown] [.] 0x00007f84a7766b99 2.26% [kernel] [k] find_first_bit 1.68% [kernel] [k] page_fault 1.45% libc-2.11.3.so [.] _int_malloc 1.28% gzip [.] 0x0000000000007b96 1.13% libc-2.11.3.so [.] vfprintf 1.06% libc-2.11.3.so [.] __strchr_sse42 1.02% perl [.] 0x0000000000044505 0.79% libc-2.11.3.so [.] __strcmp_sse42 0.79% [kernel] [k] do_task_stat 0.77% [kernel] [k] zap_pte_range 0.72% libc-2.11.3.so [.] __gconv_transform_utf8_internal 0.70% libc-2.11.3.so [.] malloc 0.70% libc-2.11.3.so [.] __mbrtowc time: 1341306585 PerfTop: 19 irqs/sec kernel:78.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.97% [kernel] [k] _raw_spin_lock_irqsave 5.48% [kernel] [k] __rmqueue 5.43% [kernel] [k] format_decode 5.24% [kernel] [k] copy_user_generic_string 5.18% [kernel] [k] _raw_spin_lock 4.92% libc-2.11.3.so [.] __tzfile_compute 4.25% libc-2.11.3.so [.] _IO_vfscanf 2.33% [unknown] [.] 0x00007f84a7766b99 2.12% [kernel] [k] page_fault 2.09% [kernel] [k] find_first_bit 1.34% libc-2.11.3.so [.] _int_malloc 1.19% gzip [.] 0x0000000000007b96 1.05% libc-2.11.3.so [.] vfprintf 0.98% libc-2.11.3.so [.] __strchr_sse42 0.94% perl [.] 0x0000000000044505 0.94% libc-2.11.3.so [.] _dl_addr 0.91% [kernel] [k] zap_pte_range 0.74% [kernel] [k] s_show 0.73% libc-2.11.3.so [.] __strcmp_sse42 0.73% [kernel] [k] do_task_stat 0.67% libc-2.11.3.so [.] __gconv_transform_utf8_internal time: 1341306585 PerfTop: 38 irqs/sec kernel:68.4% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.64% [kernel] [k] _raw_spin_lock_irqsave 4.89% [kernel] [k] _raw_spin_lock 4.77% [kernel] [k] __rmqueue 4.72% [kernel] [k] format_decode 4.56% [kernel] [k] copy_user_generic_string 4.53% libc-2.11.3.so [.] _IO_vfscanf 4.28% libc-2.11.3.so [.] __tzfile_compute 2.52% [unknown] [.] 0x00007f84a7766b99 no symbols found in /bin/sort, maybe install a debug package? 2.10% [kernel] [k] page_fault 1.82% [kernel] [k] find_first_bit 1.31% libc-2.11.3.so [.] _int_malloc 1.14% libc-2.11.3.so [.] vfprintf 1.08% libc-2.11.3.so [.] _dl_addr 1.07% [kernel] [k] s_show 1.05% libc-2.11.3.so [.] __strchr_sse42 1.03% gzip [.] 0x0000000000007b96 0.82% [kernel] [k] do_task_stat 0.82% perl [.] 0x0000000000044505 0.79% [kernel] [k] zap_pte_range 0.70% [kernel] [k] seq_put_decimal_ull 0.69% [kernel] [k] find_busiest_group time: 1341306591 PerfTop: 66 irqs/sec kernel:59.1% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 6.52% [kernel] [k] _raw_spin_lock_irqsave 4.11% libc-2.11.3.so [.] _IO_vfscanf 3.91% [kernel] [k] _raw_spin_lock 3.50% [kernel] [k] copy_user_generic_string 3.41% [kernel] [k] __rmqueue 3.38% [kernel] [k] format_decode 3.06% libc-2.11.3.so [.] __tzfile_compute 2.90% [unknown] [.] 0x00007f84a7766b99 2.30% [kernel] [k] page_fault 2.20% perl [.] 0x0000000000044505 1.83% libc-2.11.3.so [.] vfprintf 1.61% libc-2.11.3.so [.] _int_malloc 1.30% [kernel] [k] find_first_bit 1.22% libc-2.11.3.so [.] _dl_addr 1.19% libc-2.11.3.so [.] __gconv_transform_utf8_internal 1.10% libc-2.11.3.so [.] __strchr_sse42 1.01% [kernel] [k] zap_pte_range 0.99% [kernel] [k] s_show 0.98% [kernel] [k] __percpu_counter_add 0.86% [kernel] [k] __strnlen_user 0.75% ld-2.11.3.so [.] do_lookup_x time: 1341306591 PerfTop: 39 irqs/sec kernel:69.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 6.26% [kernel] [k] _raw_spin_lock_irqsave 4.05% [kernel] [k] _raw_spin_lock 3.86% libc-2.11.3.so [.] _IO_vfscanf 3.21% [kernel] [k] copy_user_generic_string 3.03% [kernel] [k] __rmqueue 3.00% [kernel] [k] format_decode 2.93% [unknown] [.] 0x00007f84a7766b99 2.72% libc-2.11.3.so [.] __tzfile_compute 2.20% [kernel] [k] page_fault 1.96% perl [.] 0x0000000000044505 1.77% libc-2.11.3.so [.] vfprintf 1.43% libc-2.11.3.so [.] _int_malloc 1.16% [kernel] [k] find_first_bit 1.09% libc-2.11.3.so [.] _dl_addr 1.06% libc-2.11.3.so [.] __gconv_transform_utf8_internal 1.02% [kernel] [k] s_show 0.98% libc-2.11.3.so [.] __strchr_sse42 0.93% gzip [.] 0x0000000000007b96 0.90% [kernel] [k] zap_pte_range 0.87% [kernel] [k] __percpu_counter_add 0.76% [kernel] [k] __strnlen_user time: 1341306591 PerfTop: 185 irqs/sec kernel:70.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 4.81% [kernel] [k] _raw_spin_lock_irqsave 3.60% [unknown] [.] 0x00007f84a7766b99 3.10% [kernel] [k] _raw_spin_lock 3.04% [kernel] [k] page_fault 2.66% libc-2.11.3.so [.] _IO_vfscanf 2.14% [kernel] [k] copy_user_generic_string 2.11% [kernel] [k] format_decode 1.96% [kernel] [k] __rmqueue 1.86% libc-2.11.3.so [.] _dl_addr 1.76% libc-2.11.3.so [.] __tzfile_compute 1.26% perl [.] 0x0000000000044505 1.19% libc-2.11.3.so [.] __mbrtowc 1.14% libc-2.11.3.so [.] vfprintf 1.12% libc-2.11.3.so [.] _int_malloc 1.09% gzip [.] 0x0000000000007b96 0.95% libc-2.11.3.so [.] __gconv_transform_utf8_internal 0.88% [kernel] [k] _raw_spin_unlock_irqrestore 0.87% [kernel] [k] __strnlen_user 0.82% [kernel] [k] clear_page_c 0.77% [kernel] [k] __schedule 0.76% [kernel] [k] find_get_page time: 1341306595 PerfTop: 385 irqs/sec kernel:48.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 27.20% cc1 [.] 0x0000000000210978 3.01% [unknown] [.] 0x00007f84a7766b99 2.18% [kernel] [k] page_fault 1.96% libbfd-2.21.so [.] 0x00000000000b9cdd 1.95% [kernel] [k] _raw_spin_lock_irqsave 1.91% ld.bfd [.] 0x000000000000e3b9 1.85% [kernel] [k] _raw_spin_lock 1.31% [kernel] [k] copy_user_generic_string 1.20% libbfd-2.21.so [.] bfd_hash_lookup 1.10% libc-2.11.3.so [.] __strcmp_sse42 0.93% libc-2.11.3.so [.] _IO_vfscanf 0.85% [kernel] [k] _raw_spin_unlock_irqrestore 0.82% libc-2.11.3.so [.] _int_malloc 0.80% [kernel] [k] __rmqueue 0.79% [kernel] [k] kmem_cache_alloc 0.74% [kernel] [k] format_decode 0.71% libc-2.11.3.so [.] _dl_addr 0.62% libbfd-2.21.so [.] _bfd_final_link_relocate 0.61% libc-2.11.3.so [.] __tzfile_compute 0.61% libc-2.11.3.so [.] vfprintf 0.59% [kernel] [k] find_busiest_group time: 1341306595 PerfTop: 1451 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 9.75% cc1 [.] 0x0000000000210978 8.81% [unknown] [.] 0x00007f84a7766b99 4.62% [kernel] [k] page_fault 3.61% [kernel] [k] _raw_spin_lock 2.67% [kernel] [k] memcpy 2.03% [kernel] [k] _raw_spin_lock_irqsave 2.00% [kernel] [k] kmem_cache_alloc 1.64% [xfs] [k] _xfs_buf_find 1.31% [kernel] [k] _raw_spin_unlock_irqrestore 1.16% [kernel] [k] kmem_cache_free 1.15% [xfs] [k] xfs_next_bit 0.98% [kernel] [k] __d_lookup 0.89% [xfs] [k] xfs_da_do_buf 0.83% [kernel] [k] memset 0.80% [xfs] [k] xfs_dir2_node_addname_int 0.79% [kernel] [k] link_path_walk 0.76% [xfs] [k] xfs_buf_item_size no symbols found in /usr/bin/tee, maybe install a debug package? no symbols found in /bin/date, maybe install a debug package? 0.73% [xfs] [k] xfs_buf_offset 0.71% [kernel] [k] __kmalloc 0.70% [kernel] [k] kfree 0.70% libbfd-2.21.so [.] 0x00000000000b9cdd time: 1341306601 PerfTop: 1267 irqs/sec kernel:85.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.81% [unknown] [.] 0x00007f84a7766b99 5.98% cc1 [.] 0x0000000000210978 5.20% [kernel] [k] page_fault 3.54% [kernel] [k] _raw_spin_lock 3.37% [kernel] [k] memcpy 2.03% [kernel] [k] kmem_cache_alloc 1.91% [kernel] [k] _raw_spin_lock_irqsave 1.75% [xfs] [k] _xfs_buf_find 1.35% [kernel] [k] _raw_spin_unlock_irqrestore 1.28% [xfs] [k] xfs_next_bit 1.14% [kernel] [k] kmem_cache_free 1.13% [kernel] [k] __kmalloc 1.12% [kernel] [k] __d_lookup 0.97% [xfs] [k] xfs_dir2_node_addname_int 0.96% [xfs] [k] xfs_buf_offset 0.95% [kernel] [k] memset 0.91% [kernel] [k] link_path_walk 0.88% [xfs] [k] xfs_da_do_buf 0.85% [kernel] [k] kfree 0.84% [xfs] [k] xfs_buf_item_size 0.74% [xfs] [k] xfs_btree_lookup time: 1341306601 PerfTop: 1487 irqs/sec kernel:85.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.84% [unknown] [.] 0x00007f84a7766b99 5.15% [kernel] [k] page_fault 3.93% cc1 [.] 0x0000000000210978 3.76% [kernel] [k] _raw_spin_lock 3.50% [kernel] [k] memcpy 2.13% [kernel] [k] kmem_cache_alloc 1.91% [xfs] [k] _xfs_buf_find 1.79% [kernel] [k] _raw_spin_lock_irqsave 1.52% [kernel] [k] __kmalloc 1.33% [kernel] [k] kmem_cache_free 1.32% [kernel] [k] _raw_spin_unlock_irqrestore 1.29% [kernel] [k] __d_lookup 1.27% [xfs] [k] xfs_next_bit 1.11% [kernel] [k] link_path_walk 1.01% [xfs] [k] xfs_buf_offset 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_da_do_buf 0.97% [kernel] [k] kfree 0.96% [kernel] [k] memset 0.84% [xfs] [k] xfs_btree_lookup 0.82% [xfs] [k] xfs_buf_item_format time: 1341306601 PerfTop: 1291 irqs/sec kernel:85.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.21% [unknown] [.] 0x00007f84a7766b99 5.18% [kernel] [k] page_fault 3.83% [kernel] [k] _raw_spin_lock 3.67% [kernel] [k] memcpy 2.92% cc1 [.] 0x0000000000210978 2.28% [kernel] [k] kmem_cache_alloc 2.18% [xfs] [k] _xfs_buf_find 1.66% [kernel] [k] _raw_spin_lock_irqsave 1.56% [kernel] [k] __kmalloc 1.43% [kernel] [k] __d_lookup 1.43% [kernel] [k] _raw_spin_unlock_irqrestore 1.40% [kernel] [k] kmem_cache_free 1.29% [xfs] [k] xfs_next_bit 1.13% [xfs] [k] xfs_buf_offset 1.07% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_da_do_buf 1.01% [kernel] [k] memset 1.01% [kernel] [k] kfree 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.87% [xfs] [k] xfs_buf_item_size 0.84% [xfs] [k] xfs_btree_lookup time: 1341306607 PerfTop: 1435 irqs/sec kernel:87.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.06% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] page_fault 3.88% [kernel] [k] _raw_spin_lock 3.83% [kernel] [k] memcpy 2.41% [xfs] [k] _xfs_buf_find 2.35% [kernel] [k] kmem_cache_alloc 2.19% cc1 [.] 0x0000000000210978 1.68% [kernel] [k] _raw_spin_lock_irqsave 1.55% [kernel] [k] __kmalloc 1.48% [kernel] [k] __d_lookup 1.43% [kernel] [k] kmem_cache_free 1.42% [kernel] [k] _raw_spin_unlock_irqrestore 1.37% [xfs] [k] xfs_next_bit 1.27% [xfs] [k] xfs_buf_offset 1.12% [kernel] [k] link_path_walk 1.09% [kernel] [k] kfree 1.08% [kernel] [k] memset 1.04% [xfs] [k] xfs_da_do_buf 0.99% [xfs] [k] xfs_dir2_node_addname_int 0.92% [xfs] [k] xfs_buf_item_size 0.89% [xfs] [k] xfs_btree_lookup time: 1341306607 PerfTop: 1281 irqs/sec kernel:87.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.00% [unknown] [.] 0x00007f84a7766b99 5.44% [kernel] [k] page_fault 4.04% [kernel] [k] _raw_spin_lock 3.94% [kernel] [k] memcpy 2.51% [xfs] [k] _xfs_buf_find 2.32% [kernel] [k] kmem_cache_alloc 1.75% cc1 [.] 0x0000000000210978 1.66% [kernel] [k] _raw_spin_lock_irqsave 1.58% [kernel] [k] __d_lookup 1.56% [kernel] [k] __kmalloc 1.46% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] kmem_cache_free 1.41% [kernel] [k] _raw_spin_unlock_irqrestore 1.34% [xfs] [k] xfs_buf_offset 1.20% [kernel] [k] link_path_walk 1.16% [kernel] [k] kfree 1.11% [kernel] [k] memset 1.04% [xfs] [k] xfs_dir2_node_addname_int 0.94% [xfs] [k] xfs_da_do_buf 0.92% [xfs] [k] xfs_btree_lookup 0.89% [xfs] [k] xfs_buf_item_size time: 1341306607 PerfTop: 1455 irqs/sec kernel:86.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.14% [unknown] [.] 0x00007f84a7766b99 5.36% [kernel] [k] page_fault 4.12% [kernel] [k] _raw_spin_lock 4.02% [kernel] [k] memcpy 2.54% [xfs] [k] _xfs_buf_find 2.41% [kernel] [k] kmem_cache_alloc 1.69% [kernel] [k] _raw_spin_lock_irqsave 1.56% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.47% [kernel] [k] __d_lookup 1.42% [kernel] [k] _raw_spin_unlock_irqrestore 1.39% [xfs] [k] xfs_buf_offset 1.39% cc1 [.] 0x0000000000210978 1.37% [kernel] [k] kmem_cache_free 1.24% [kernel] [k] link_path_walk 1.17% [kernel] [k] memset 1.16% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.92% [xfs] [k] xfs_da_do_buf 0.91% [xfs] [k] xfs_btree_lookup time: 1341306613 PerfTop: 1245 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.05% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] page_fault 4.10% [kernel] [k] _raw_spin_lock 4.06% [kernel] [k] memcpy 2.74% [xfs] [k] _xfs_buf_find 2.40% [kernel] [k] kmem_cache_alloc 1.64% [kernel] [k] _raw_spin_lock_irqsave 1.58% [xfs] [k] xfs_next_bit 1.54% [kernel] [k] __kmalloc 1.49% [kernel] [k] __d_lookup 1.45% [kernel] [k] _raw_spin_unlock_irqrestore 1.41% [kernel] [k] kmem_cache_free 1.35% [xfs] [k] xfs_buf_offset 1.25% [kernel] [k] link_path_walk 1.22% [kernel] [k] kfree 1.16% [kernel] [k] memset 1.15% cc1 [.] 0x0000000000210978 1.02% [xfs] [k] xfs_buf_item_size 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.92% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_da_do_buf time: 1341306613 PerfTop: 1433 irqs/sec kernel:87.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.04% [unknown] [.] 0x00007f84a7766b99 5.30% [kernel] [k] page_fault 4.08% [kernel] [k] memcpy 4.07% [kernel] [k] _raw_spin_lock 2.88% [xfs] [k] _xfs_buf_find 2.50% [kernel] [k] kmem_cache_alloc 1.72% [kernel] [k] _raw_spin_lock_irqsave 1.68% [xfs] [k] xfs_next_bit 1.56% [kernel] [k] __d_lookup 1.54% [kernel] [k] __kmalloc 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.46% [kernel] [k] kmem_cache_free 1.40% [xfs] [k] xfs_buf_offset 1.25% [kernel] [k] link_path_walk 1.21% [kernel] [k] memset 1.18% [kernel] [k] kfree 1.04% [xfs] [k] xfs_buf_item_size 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.95% [xfs] [k] xfs_btree_lookup 0.94% cc1 [.] 0x0000000000210978 0.90% [xfs] [k] xfs_da_do_buf time: 1341306613 PerfTop: 1118 irqs/sec kernel:87.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- no symbols found in /usr/bin/vmstat, maybe install a debug package? 12.03% [unknown] [.] 0x00007f84a7766b99 5.48% [kernel] [k] page_fault 4.21% [kernel] [k] memcpy 4.11% [kernel] [k] _raw_spin_lock 2.98% [xfs] [k] _xfs_buf_find 2.47% [kernel] [k] kmem_cache_alloc 1.81% [kernel] [k] _raw_spin_lock_irqsave 1.72% [xfs] [k] xfs_next_bit 1.51% [kernel] [k] __kmalloc 1.48% [kernel] [k] kmem_cache_free 1.48% [kernel] [k] __d_lookup 1.47% [kernel] [k] _raw_spin_unlock_irqrestore 1.39% [xfs] [k] xfs_buf_offset 1.23% [kernel] [k] link_path_walk 1.19% [kernel] [k] memset 1.19% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_buf_item_size 0.98% [xfs] [k] xfs_btree_lookup 0.93% [xfs] [k] xfs_buf_item_format 0.91% [xfs] [k] xfs_da_do_buf time: 1341306617 PerfTop: 1454 irqs/sec kernel:87.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.93% [unknown] [.] 0x00007f84a7766b99 5.42% [kernel] [k] page_fault 4.28% [kernel] [k] memcpy 4.20% [kernel] [k] _raw_spin_lock 3.15% [xfs] [k] _xfs_buf_find 2.52% [kernel] [k] kmem_cache_alloc 1.76% [kernel] [k] _raw_spin_lock_irqsave 1.72% [xfs] [k] xfs_next_bit 1.59% [kernel] [k] __d_lookup 1.51% [kernel] [k] __kmalloc 1.49% [kernel] [k] _raw_spin_unlock_irqrestore 1.48% [kernel] [k] kmem_cache_free 1.40% [xfs] [k] xfs_buf_offset 1.29% [kernel] [k] memset 1.20% [kernel] [k] link_path_walk 1.17% [kernel] [k] kfree 1.09% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_buf_item_size 0.95% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf 0.91% [xfs] [k] xfs_buf_item_format time: 1341306617 PerfTop: 1758 irqs/sec kernel:90.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.99% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.04% [kernel] [k] memcpy 3.86% [xfs] [k] _xfs_buf_find 2.31% [kernel] [k] kmem_cache_alloc 2.03% [kernel] [k] _raw_spin_lock_irqsave 1.67% [kernel] [k] _raw_spin_unlock_irqrestore 1.60% [kernel] [k] __d_lookup 1.60% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] __kmalloc 1.36% [xfs] [k] xfs_buf_offset 1.35% [kernel] [k] kmem_cache_free 1.17% [kernel] [k] kfree 1.16% [kernel] [k] memset 1.08% [kernel] [k] link_path_walk 0.98% [xfs] [k] xfs_dir2_node_addname_int 0.97% [xfs] [k] xfs_btree_lookup 0.92% [xfs] [k] xfs_perag_put 0.90% [xfs] [k] xfs_buf_item_size 0.84% [xfs] [k] xfs_da_do_buf time: 1341306623 PerfTop: 1022 irqs/sec kernel:88.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.93% [unknown] [.] 0x00007f84a7766b99 5.34% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.01% [kernel] [k] memcpy 4.01% [xfs] [k] _xfs_buf_find 2.28% [kernel] [k] kmem_cache_alloc 2.00% [kernel] [k] _raw_spin_lock_irqsave 1.68% [kernel] [k] _raw_spin_unlock_irqrestore 1.60% [xfs] [k] xfs_next_bit 1.59% [kernel] [k] __d_lookup 1.41% [kernel] [k] __kmalloc 1.39% [kernel] [k] kmem_cache_free 1.35% [xfs] [k] xfs_buf_offset 1.15% [kernel] [k] kfree 1.15% [kernel] [k] memset 1.09% [kernel] [k] link_path_walk 0.98% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_perag_put 0.88% [xfs] [k] xfs_buf_item_size 0.86% [xfs] [k] xfs_da_do_buf time: 1341306623 PerfTop: 1430 irqs/sec kernel:87.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.05% [unknown] [.] 0x00007f84a7766b99 5.24% [kernel] [k] _raw_spin_lock 4.89% [kernel] [k] page_fault 4.13% [kernel] [k] memcpy 3.96% [xfs] [k] _xfs_buf_find 2.35% [kernel] [k] kmem_cache_alloc 1.95% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] _raw_spin_unlock_irqrestore 1.63% [kernel] [k] __d_lookup 1.54% [xfs] [k] xfs_next_bit 1.42% [kernel] [k] __kmalloc 1.41% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.16% [kernel] [k] memset 1.11% [kernel] [k] kfree 1.10% [kernel] [k] link_path_walk 1.05% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_buf_item_size 0.87% [xfs] [k] xfs_da_do_buf 0.87% [xfs] [k] xfs_perag_put time: 1341306623 PerfTop: 1267 irqs/sec kernel:87.1% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.20% [unknown] [.] 0x00007f84a7766b99 5.08% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.12% [kernel] [k] memcpy 3.96% [xfs] [k] _xfs_buf_find 2.41% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] _raw_spin_unlock_irqrestore 1.61% [kernel] [k] __d_lookup 1.50% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] __kmalloc 1.40% [kernel] [k] kmem_cache_free 1.31% [xfs] [k] xfs_buf_offset no symbols found in /usr/bin/iostat, maybe install a debug package? 1.16% [kernel] [k] memset 1.11% [kernel] [k] kfree 1.06% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_btree_lookup 0.95% [xfs] [k] xfs_buf_item_size 0.90% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_perag_put time: 1341306629 PerfTop: 1399 irqs/sec kernel:88.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.12% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.23% [kernel] [k] memcpy 4.03% [xfs] [k] _xfs_buf_find 2.37% [kernel] [k] kmem_cache_alloc 1.96% [kernel] [k] _raw_spin_lock_irqsave 1.69% [kernel] [k] _raw_spin_unlock_irqrestore 1.63% [kernel] [k] __d_lookup 1.50% [xfs] [k] xfs_next_bit 1.45% [kernel] [k] __kmalloc 1.35% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.17% [kernel] [k] memset 1.09% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 1.07% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_btree_lookup 1.02% [xfs] [k] xfs_buf_item_size 0.93% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_buf_item_format time: 1341306629 PerfTop: 1225 irqs/sec kernel:87.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.15% [unknown] [.] 0x00007f84a7766b99 5.02% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.22% [kernel] [k] memcpy 4.19% [xfs] [k] _xfs_buf_find 2.32% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.71% [kernel] [k] __d_lookup 1.68% [kernel] [k] _raw_spin_unlock_irqrestore 1.54% [xfs] [k] xfs_next_bit 1.51% [kernel] [k] __kmalloc 1.36% [kernel] [k] kmem_cache_free 1.28% [xfs] [k] xfs_buf_offset 1.14% [kernel] [k] memset 1.09% [kernel] [k] kfree 1.06% [kernel] [k] link_path_walk 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.99% [xfs] [k] xfs_btree_lookup 0.92% [xfs] [k] xfs_da_do_buf 0.86% [kernel] [k] s_show time: 1341306629 PerfTop: 1400 irqs/sec kernel:87.4% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.23% [unknown] [.] 0x00007f84a7766b99 5.07% [kernel] [k] _raw_spin_lock 4.87% [kernel] [k] page_fault 4.27% [xfs] [k] _xfs_buf_find 4.18% [kernel] [k] memcpy 2.31% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.73% [kernel] [k] __d_lookup 1.66% [kernel] [k] _raw_spin_unlock_irqrestore 1.49% [xfs] [k] xfs_next_bit 1.49% [kernel] [k] __kmalloc 1.40% [kernel] [k] kmem_cache_free 1.29% [xfs] [k] xfs_buf_offset 1.11% [kernel] [k] kfree 1.07% [kernel] [k] memset 1.07% [kernel] [k] link_path_walk 1.05% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.97% [xfs] [k] xfs_btree_lookup 0.93% [xfs] [k] xfs_da_do_buf 0.89% [kernel] [k] s_show time: 1341306635 PerfTop: 1251 irqs/sec kernel:87.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.20% [unknown] [.] 0x00007f84a7766b99 5.10% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.29% [xfs] [k] _xfs_buf_find 4.19% [kernel] [k] memcpy 2.26% [kernel] [k] kmem_cache_alloc 1.87% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] __d_lookup 1.64% [kernel] [k] _raw_spin_unlock_irqrestore 1.53% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.41% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.10% [kernel] [k] link_path_walk 1.09% [kernel] [k] memset 1.08% [kernel] [k] kfree 1.03% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.98% [xfs] [k] xfs_btree_lookup 0.96% [kernel] [k] s_show 0.93% [xfs] [k] xfs_da_do_buf time: 1341306635 PerfTop: 1429 irqs/sec kernel:88.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.18% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.28% [xfs] [k] _xfs_buf_find 4.21% [kernel] [k] memcpy 2.23% [kernel] [k] kmem_cache_alloc 1.90% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] __d_lookup 1.67% [kernel] [k] _raw_spin_unlock_irqrestore 1.52% [kernel] [k] __kmalloc 1.52% [xfs] [k] xfs_next_bit 1.36% [kernel] [k] kmem_cache_free 1.34% [xfs] [k] xfs_buf_offset 1.11% [kernel] [k] link_path_walk 1.11% [kernel] [k] memset 1.08% [kernel] [k] kfree 1.03% [xfs] [k] xfs_buf_item_size 1.03% [xfs] [k] xfs_dir2_node_addname_int 1.01% [kernel] [k] s_show 0.98% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf time: 1341306635 PerfTop: 1232 irqs/sec kernel:88.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.11% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.87% [kernel] [k] page_fault 4.33% [xfs] [k] _xfs_buf_find 4.16% [kernel] [k] memcpy 2.24% [kernel] [k] kmem_cache_alloc 1.84% [kernel] [k] _raw_spin_lock_irqsave 1.82% [kernel] [k] __d_lookup 1.65% [kernel] [k] _raw_spin_unlock_irqrestore 1.50% [xfs] [k] xfs_next_bit 1.49% [kernel] [k] __kmalloc 1.34% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.13% [kernel] [k] link_path_walk 1.13% [kernel] [k] kfree 1.11% [kernel] [k] memset 1.06% [kernel] [k] s_show 1.03% [xfs] [k] xfs_buf_item_size 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf time: 1341306639 PerfTop: 1444 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.19% [unknown] [.] 0x00007f84a7766b99 5.10% [kernel] [k] _raw_spin_lock 4.95% [kernel] [k] page_fault 4.40% [xfs] [k] _xfs_buf_find 4.10% [kernel] [k] memcpy 2.20% [kernel] [k] kmem_cache_alloc 1.93% [kernel] [k] _raw_spin_lock_irqsave 1.81% [kernel] [k] __d_lookup 1.59% [kernel] [k] _raw_spin_unlock_irqrestore 1.50% [xfs] [k] xfs_next_bit 1.48% [kernel] [k] __kmalloc 1.37% [kernel] [k] kmem_cache_free 1.36% [xfs] [k] xfs_buf_offset 1.15% [kernel] [k] memset 1.12% [kernel] [k] s_show 1.12% [kernel] [k] link_path_walk 1.10% [kernel] [k] kfree 1.02% [xfs] [k] xfs_buf_item_size 0.99% [xfs] [k] xfs_btree_lookup 0.97% [xfs] [k] xfs_dir2_node_addname_int 0.94% [xfs] [k] xfs_da_do_buf time: 1341306639 PerfTop: 1195 irqs/sec kernel:90.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.00% [unknown] [.] 0x00007f84a7766b99 5.17% [kernel] [k] _raw_spin_lock 4.44% [xfs] [k] _xfs_buf_find 4.37% [kernel] [k] page_fault 4.37% [kernel] [k] memcpy 2.30% [kernel] [k] kmem_cache_alloc 1.90% [kernel] [k] _raw_spin_lock_irqsave 1.63% [kernel] [k] __d_lookup 1.62% [kernel] [k] _raw_spin_unlock_irqrestore 1.59% [xfs] [k] xfs_buf_offset 1.50% [kernel] [k] kmem_cache_free 1.50% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.33% [kernel] [k] memset 1.28% [kernel] [k] kfree 1.11% [xfs] [k] xfs_buf_item_size 1.07% [kernel] [k] s_show 1.07% [kernel] [k] link_path_walk 0.93% [xfs] [k] xfs_btree_lookup 0.90% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_dir2_node_addname_int time: 1341306645 PerfTop: 1097 irqs/sec kernel:95.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.51% [unknown] [.] 0x00007f84a7766b99 5.02% [kernel] [k] _raw_spin_lock 4.63% [kernel] [k] memcpy 4.51% [xfs] [k] _xfs_buf_find 3.32% [kernel] [k] page_fault 2.37% [kernel] [k] kmem_cache_alloc 1.87% [kernel] [k] _raw_spin_lock_irqsave 1.77% [xfs] [k] xfs_buf_offset 1.75% [kernel] [k] __kmalloc 1.73% [xfs] [k] xfs_next_bit 1.65% [kernel] [k] memset 1.60% [kernel] [k] _raw_spin_unlock_irqrestore 1.53% [kernel] [k] kfree 1.52% [kernel] [k] kmem_cache_free 1.44% [xfs] [k] xfs_trans_ail_cursor_first 1.26% [kernel] [k] __d_lookup 1.26% [xfs] [k] xfs_buf_item_size 1.05% [kernel] [k] s_show 1.03% [xfs] [k] xfs_buf_item_format 0.92% [kernel] [k] __d_lookup_rcu 0.87% [kernel] [k] link_path_walk time: 1341306645 PerfTop: 1038 irqs/sec kernel:95.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.89% [unknown] [.] 0x00007f84a7766b99 5.18% [kernel] [k] memcpy 4.60% [xfs] [k] _xfs_buf_find 4.52% [kernel] [k] _raw_spin_lock 2.60% [kernel] [k] page_fault 2.42% [kernel] [k] kmem_cache_alloc 1.99% [kernel] [k] __kmalloc 1.96% [xfs] [k] xfs_next_bit 1.93% [xfs] [k] xfs_buf_offset 1.84% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] memset 1.80% [kernel] [k] kmem_cache_free 1.68% [kernel] [k] kfree 1.47% [xfs] [k] xfs_buf_item_size 1.45% [kernel] [k] _raw_spin_unlock_irqrestore 1.18% [kernel] [k] __d_lookup_rcu 1.13% [xfs] [k] xfs_trans_ail_cursor_first 1.12% [xfs] [k] xfs_buf_item_format 1.04% [kernel] [k] s_show 1.01% [kernel] [k] __d_lookup 0.93% [xfs] [k] xfs_da_do_buf time: 1341306645 PerfTop: 1087 irqs/sec kernel:96.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.27% [kernel] [k] memcpy 4.77% [unknown] [.] 0x00007f84a7766b99 4.69% [xfs] [k] _xfs_buf_find 4.56% [kernel] [k] _raw_spin_lock 2.47% [kernel] [k] kmem_cache_alloc 2.18% [xfs] [k] xfs_next_bit 2.11% [kernel] [k] page_fault 2.00% [xfs] [k] xfs_buf_offset 1.99% [kernel] [k] __kmalloc 1.96% [kernel] [k] kmem_cache_free 1.85% [kernel] [k] kfree 1.82% [kernel] [k] memset 1.75% [kernel] [k] _raw_spin_lock_irqsave 1.58% [xfs] [k] xfs_buf_item_size 1.41% [kernel] [k] _raw_spin_unlock_irqrestore 1.23% [kernel] [k] __d_lookup_rcu 1.21% [xfs] [k] xfs_buf_item_format 0.99% [kernel] [k] s_show 0.97% [xfs] [k] xfs_perag_put 0.92% [xfs] [k] xfs_da_do_buf 0.92% [xfs] [k] xfs_trans_ail_cursor_first time: 1341306651 PerfTop: 1157 irqs/sec kernel:96.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.53% [kernel] [k] memcpy 4.61% [xfs] [k] _xfs_buf_find 4.40% [kernel] [k] _raw_spin_lock 3.83% [unknown] [.] 0x00007f84a7766b99 2.67% [kernel] [k] kmem_cache_alloc 2.32% [xfs] [k] xfs_next_bit 2.21% [kernel] [k] __kmalloc 2.21% [xfs] [k] xfs_buf_offset 2.19% [kernel] [k] kmem_cache_free 1.92% [kernel] [k] memset 1.89% [kernel] [k] kfree 1.80% [xfs] [k] xfs_buf_item_size 1.70% [kernel] [k] page_fault 1.62% [kernel] [k] _raw_spin_lock_irqsave 1.50% [kernel] [k] _raw_spin_unlock_irqrestore 1.30% [xfs] [k] xfs_buf_item_format 1.27% [kernel] [k] __d_lookup_rcu 0.97% [xfs] [k] xfs_da_do_buf 0.96% [xfs] [k] xlog_cil_prepare_log_vecs 0.93% [kernel] [k] s_show 0.93% [xfs] [k] xfs_perag_put time: 1341306651 PerfTop: 1073 irqs/sec kernel:95.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.76% [kernel] [k] memcpy 4.54% [xfs] [k] _xfs_buf_find 4.32% [kernel] [k] _raw_spin_lock 3.15% [unknown] [.] 0x00007f84a7766b99 2.77% [kernel] [k] kmem_cache_alloc 2.49% [xfs] [k] xfs_next_bit 2.36% [kernel] [k] __kmalloc 2.27% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset 1.88% [kernel] [k] memset 1.88% [kernel] [k] kfree 1.77% [xfs] [k] xfs_buf_item_size 1.62% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.42% [xfs] [k] xfs_buf_item_format 1.40% [kernel] [k] page_fault 1.39% [kernel] [k] __d_lookup_rcu 0.99% [xfs] [k] xfs_da_do_buf 0.96% [xfs] [k] xlog_cil_prepare_log_vecs 0.88% [kernel] [k] s_show 0.87% [xfs] [k] xfs_perag_put time: 1341306651 PerfTop: 492 irqs/sec kernel:85.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.74% [kernel] [k] memcpy 4.48% [xfs] [k] _xfs_buf_find 4.27% [kernel] [k] _raw_spin_lock 3.00% [unknown] [.] 0x00007f84a7766b99 2.76% [kernel] [k] kmem_cache_alloc 2.54% [xfs] [k] xfs_next_bit 2.39% [kernel] [k] __kmalloc 2.30% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset no symbols found in /bin/ps, maybe install a debug package? 1.96% [kernel] [k] kfree 1.92% [kernel] [k] memset 1.75% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.44% [xfs] [k] xfs_buf_item_format 1.39% [kernel] [k] __d_lookup_rcu 1.36% [kernel] [k] page_fault 0.99% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.86% [kernel] [k] s_show 0.85% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 70 irqs/sec kernel:72.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.73% [kernel] [k] memcpy 4.47% [xfs] [k] _xfs_buf_find 4.27% [kernel] [k] _raw_spin_lock 2.99% [unknown] [.] 0x00007f84a7766b99 2.75% [kernel] [k] kmem_cache_alloc 2.53% [xfs] [k] xfs_next_bit 2.39% [kernel] [k] __kmalloc 2.30% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset 1.96% [kernel] [k] kfree 1.92% [kernel] [k] memset 1.75% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.49% [kernel] [k] _raw_spin_unlock_irqrestore 1.43% [xfs] [k] xfs_buf_item_format 1.38% [kernel] [k] __d_lookup_rcu 1.37% [kernel] [k] page_fault 0.98% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.89% [kernel] [k] s_show 0.85% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 87 irqs/sec kernel:71.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.72% [kernel] [k] memcpy 4.45% [xfs] [k] _xfs_buf_find 4.25% [kernel] [k] _raw_spin_lock 2.99% [unknown] [.] 0x00007f84a7766b99 2.74% [kernel] [k] kmem_cache_alloc 2.52% [xfs] [k] xfs_next_bit 2.38% [kernel] [k] __kmalloc 2.29% [kernel] [k] kmem_cache_free 2.19% [xfs] [k] xfs_buf_offset 1.95% [kernel] [k] kfree 1.91% [kernel] [k] memset 1.74% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.43% [xfs] [k] xfs_buf_item_format 1.38% [kernel] [k] page_fault 1.38% [kernel] [k] __d_lookup_rcu 0.98% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.93% [kernel] [k] s_show 0.84% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 88 irqs/sec kernel:68.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.69% [kernel] [k] memcpy 4.42% [xfs] [k] _xfs_buf_find 4.25% [kernel] [k] _raw_spin_lock 2.98% [unknown] [.] 0x00007f84a7766b99 -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 10:59 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 10:59 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > Adding dri-devel and a few others because an i915 patch contributed to > > the regression. > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > you have lots of CPU power, there will be little difference in > > > > > performance... > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > optimize away the second call. I'd defintively like to track down where > > > > the performance changes happend, at least to a major version but even > > > > better to a -rc or git commit. > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > automated git bisection. As you know the milage with an automated bisect > > > varies so it may or may not find the right commit. Test machine is sandy so > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > is the report of interest. The script is doing a full search between v3.3 and > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > apparently unrelated patch that caused the problem. > > > > > > > It was obvious very quickly that there were two distinct regression so I > > ran two bisections. One led to a XFS and the other led to an i915 patch > > that enables RC6 to reduce power usage. > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > Doesn't seem to be the major cause of the regression. By itself, it > has impact, but the majority comes from the XFS change... > The fact it has an impact at all is weird but lets see what the DRI folks think about it. > > [c999a223: xfs: introduce an allocation workqueue] > > Which indicates that there is workqueue scheduling issues, I think. > The same amount of work is being done, but half of it is being > pushed off into a workqueue to avoid stack overflow issues (*). I > tested the above patch in anger on an 8p machine, similar to the > machine you saw no regressions on, but the workload didn't drive it > to being completely CPU bound (only about 90%) so the allocation > work was probably always scheduled quickly. > What test were you using? > How many worker threads have been spawned on these machines > that are showing the regression? 20 or 21 generally. An example list as spotted by top looks like kworker/0:0 kworker/0:1 kworker/0:2 kworker/1:0 kworker/1:1 kworker/1:2 kworker/2:0 kworker/2:1 kworker/2:2 kworker/3:0 kworker/3:1 kworker/3:2 kworker/4:0 kworker/4:1 kworker/5:0 kworker/5:1 kworker/6:0 kworker/6:1 kworker/6:2 kworker/7:0 kworker/7:1 There were 8 unbound workers. > What is the context switch rate on the machines whenteh test is running? This is vmstat from a vanilla kernel. The actual vmstat is after the --. The information before that is recorded by mmtests to try and detect if there was jitter in the vmstat output. It's showing that there is little or no jitter in this test. VANILLA 1341306582.6713 1.8109 1.8109 -- 0 0 0 16050784 11448 104056 0 0 376 0 209 526 0 0 99 1 0 1341306584.6715 3.8112 2.0003 -- 1 0 0 16050628 11448 104064 0 0 0 0 121 608 0 0 100 0 0 1341306586.6718 5.8114 2.0003 -- 0 0 0 16047432 11460 104288 0 0 102 45 227 999 0 0 99 1 0 1341306588.6721 7.8117 2.0003 -- 1 0 0 16046944 11460 104292 0 0 0 0 120 663 0 0 100 0 0 1341306590.6723 9.8119 2.0002 -- 0 2 0 16045788 11476 104296 0 0 12 40 190 754 0 0 99 0 0 1341306592.6725 11.8121 2.0002 -- 0 1 0 15990236 12600 141724 0 0 19054 30 1400 2937 2 1 88 9 0 1341306594.6727 13.8124 2.0002 -- 1 0 0 15907628 12600 186360 0 0 1653 0 3117 6406 2 9 88 1 0 1341306596.6730 15.8127 2.0003 -- 0 0 0 15825964 12608 226636 0 0 15 11024 3073 6350 2 9 89 0 0 1341306598.6733 17.8130 2.0003 -- 1 0 0 15730420 12608 271632 0 0 0 3072 3461 7179 2 10 88 0 0 1341306600.6736 19.8132 2.0003 -- 1 0 0 15686200 12608 310816 0 0 0 12416 3093 6198 2 9 89 0 0 1341306602.6738 21.8135 2.0003 -- 2 0 0 15593588 12616 354928 0 0 0 32 3482 7146 2 11 87 0 0 1341306604.6741 23.8138 2.0003 -- 2 0 0 15562032 12616 393772 0 0 0 12288 3129 6330 2 10 89 0 0 1341306606.6744 25.8140 2.0002 -- 1 0 0 15458316 12624 438004 0 0 0 26 3471 7107 2 11 87 0 0 1341306608.6746 27.8142 2.0002 -- 1 0 0 15432024 12624 474244 0 0 0 12416 3011 6017 1 10 89 0 0 1341306610.6749 29.8145 2.0003 -- 2 0 0 15343280 12624 517696 0 0 0 24 3393 6826 2 11 87 0 0 1341306612.6751 31.8148 2.0002 -- 1 0 0 15311136 12632 551816 0 0 0 16502 2818 5653 2 9 88 1 0 1341306614.6754 33.8151 2.0003 -- 1 0 0 15220648 12632 594936 0 0 0 3584 3451 6779 2 11 87 0 0 1341306616.6755 35.8152 2.0001 -- 4 0 0 15221252 12632 649296 0 0 0 38559 4846 8709 2 15 78 6 0 1341306618.6758 37.8155 2.0003 -- 1 0 0 15177724 12640 668476 0 0 20 40679 2204 4067 1 5 89 5 0 1341306620.6761 39.8158 2.0003 -- 1 0 0 15090204 12640 711752 0 0 0 0 3316 6788 2 11 88 0 0 1341306622.6764 41.8160 2.0003 -- 1 0 0 15005356 12640 748532 0 0 0 12288 3073 6132 2 10 89 0 0 1341306624.6766 43.8163 2.0002 -- 2 0 0 14913088 12648 791952 0 0 0 28 3408 6806 2 11 87 0 0 1341306626.6769 45.8166 2.0003 -- 1 0 0 14891512 12648 826328 0 0 0 12420 2906 5710 1 9 90 0 0 1341306628.6772 47.8168 2.0003 -- 1 0 0 14794316 12656 868936 0 0 0 26 3367 6798 2 11 87 0 0 1341306630.6774 49.8171 2.0003 -- 1 0 0 14769188 12656 905016 0 0 30 12324 3029 5876 2 10 89 0 0 1341306632.6777 51.8173 2.0002 -- 1 0 0 14679544 12656 947712 0 0 0 0 3399 6868 2 11 87 0 0 1341306634.6780 53.8176 2.0003 -- 1 0 0 14646156 12664 982032 0 0 0 14658 2987 5761 1 10 89 0 0 1341306636.6782 55.8179 2.0003 -- 1 0 0 14560504 12664 1023816 0 0 0 4404 3454 6876 2 11 87 0 0 1341306638.6783 57.8180 2.0001 -- 2 0 0 14533384 12664 1056812 0 0 0 15810 3002 5581 1 10 89 0 0 1341306640.6785 59.8182 2.0002 -- 1 0 0 14593332 12672 1027392 0 0 0 31790 3504 1811 1 13 78 8 0 1341306642.6787 61.8183 2.0001 -- 1 0 0 14686968 12672 1007604 0 0 0 14621 2434 1248 1 10 89 0 0 1341306644.6789 63.8185 2.0002 -- 1 1 0 15042476 12680 788104 0 0 0 36564 2809 1484 1 12 86 1 0 1341306646.6790 65.8187 2.0002 -- 1 0 0 15128292 12680 757948 0 0 0 26395 3050 1313 1 13 86 1 0 1341306648.6792 67.8189 2.0002 -- 1 0 0 15160036 12680 727964 0 0 0 5463 2752 910 1 12 87 0 0 1341306650.6795 69.8192 2.0003 -- 0 0 0 15633256 12688 332572 0 0 1156 12308 2117 2346 1 7 91 1 0 1341306652.6797 71.8194 2.0002 -- 0 0 0 15633892 12688 332652 0 0 0 0 224 758 0 0 100 0 0 1341306654.6800 73.8197 2.0003 -- 0 0 0 15633900 12688 332524 0 0 0 0 231 1009 0 0 100 0 0 1341306656.6803 75.8199 2.0003 -- 0 0 0 15637436 12696 332504 0 0 0 38 266 713 0 0 99 0 0 1341306658.6805 77.8202 2.0003 -- 0 0 0 15654180 12696 332352 0 0 0 0 270 821 0 0 100 0 0 REVERT-XFS 1341307733.8702 1.7941 1.7941 -- 0 0 0 16050640 12036 103996 0 0 372 0 216 752 0 0 99 1 0 1341307735.8704 3.7944 2.0002 -- 0 0 0 16050864 12036 104028 0 0 0 0 132 857 0 0 100 0 0 1341307737.8707 5.7946 2.0002 -- 0 0 0 16047492 12048 104252 0 0 102 37 255 938 0 0 99 1 0 1341307739.8709 7.7949 2.0003 -- 0 0 0 16047600 12072 104324 0 0 32 2 129 658 0 0 100 0 0 1341307741.8712 9.7951 2.0002 -- 1 1 0 16046676 12080 104328 0 0 0 32 165 729 0 0 100 0 0 1341307743.8714 11.7954 2.0003 -- 0 1 0 15990840 13216 142612 0 0 19422 30 1467 3015 2 1 89 8 0 1341307745.8717 13.7956 2.0002 -- 0 0 0 15825496 13216 226396 0 0 1310 11214 2217 1348 2 8 89 1 0 1341307747.8717 15.7957 2.0001 -- 1 0 0 15677816 13224 314672 0 0 4 15294 2307 1173 2 9 89 0 0 1341307749.8719 17.7959 2.0002 -- 1 0 0 15524372 13224 409728 0 0 0 12288 2466 888 1 10 89 0 0 1341307751.8721 19.7960 2.0002 -- 1 0 0 15368424 13224 502552 0 0 0 12416 2312 878 1 10 89 0 0 1341307753.8722 21.7962 2.0002 -- 1 0 0 15225216 13232 593092 0 0 0 12448 2539 1380 1 10 88 0 0 1341307755.8724 23.7963 2.0002 -- 2 0 0 15163712 13232 664768 0 0 0 32160 2184 1177 1 8 90 0 0 1341307757.8727 25.7967 2.0003 -- 1 0 0 14973888 13240 755080 0 0 0 12316 2482 1219 1 10 89 0 0 1341307759.8728 27.7968 2.0001 -- 1 0 0 14883580 13240 840036 0 0 0 44471 2711 1234 2 10 88 0 0 1341307761.8730 29.7970 2.0002 -- 1 0 0 14800304 13240 920504 0 0 0 42554 2571 1050 1 10 89 0 0 1341307763.8734 31.7973 2.0003 -- 0 0 0 14642504 13248 995004 0 0 0 3232 2276 1081 1 8 90 0 0 1341307765.8737 33.7976 2.0003 -- 1 0 0 14545072 13248 1052536 0 0 0 18688 2628 1114 1 9 89 0 0 1341307767.8739 35.7979 2.0003 -- 1 0 0 14783848 13248 926824 0 0 0 59559 2409 1308 0 10 89 1 0 1341307769.8740 37.7980 2.0001 -- 2 0 0 14854800 13256 896832 0 0 0 9172 2419 1004 1 10 89 1 0 1341307771.8742 39.7981 2.0002 -- 2 0 0 14835084 13256 875612 0 0 0 12288 2524 812 0 11 89 0 0 1341307773.8743 41.7983 2.0002 -- 2 0 0 15126252 13256 745844 0 0 0 10297 2714 1163 1 12 88 0 0 1341307775.8745 43.7985 2.0002 -- 1 0 0 15108800 13264 724544 0 0 0 12316 2499 931 1 11 88 0 0 1341307777.8746 45.7986 2.0001 -- 2 0 0 15226236 13264 694580 0 0 0 12416 2700 1194 1 12 88 0 0 1341307779.8750 47.7989 2.0003 -- 1 0 0 15697632 13264 300716 0 0 1156 0 934 1701 0 2 96 1 0 1341307781.8752 49.7992 2.0003 -- 0 0 0 15697508 13272 300720 0 0 0 66 166 641 0 0 100 0 0 1341307783.8755 51.7995 2.0003 -- 0 0 0 15699008 13272 300524 0 0 0 0 248 865 0 0 100 0 0 1341307785.8758 53.7997 2.0003 -- 0 0 0 15702452 13272 300520 0 0 0 0 285 960 0 0 99 0 0 1341307787.8760 55.7999 2.0002 -- 0 0 0 15719404 13280 300436 0 0 0 26 136 590 0 0 99 0 0 Vanilla average context switch rate 4278.53 Revert average context switch rate 1095 > Can you run latencytop to see > if there is excessive starvation/wait times for allocation > completion? I'm not sure what format you are looking for. latencytop is shit for capturing information throughout a test and it does not easily allow you to record a snapshot of a test. You can record all the console output of course but that's a complete mess. I tried capturing /proc/latency_stats over time instead because that can be trivially sorted on a system-wide basis but as I write this I find that latency_stats was bust. It was just spitting out Latency Top version : v0.1 and nothing else. Either latency_stats is broken or my config is. Not sure which it is right now and won't get enough time on this today to pinpoint it. > A pert top profile comparison might be informative, > too... > I'm not sure if this is what you really wanted. I thought an oprofile or perf report would have made more sense but I recorded perf top over time anyway and it's at the end of the mail. The timestamp information is poor because the perf top information was buffered so it would receive a bunch of updates at once. Each sample should be roughly 2 seconds apart. This buffering can be dealt with, I just failed to do it in advance and I do not think it's necessary to rerun the tests for it. > (*) The stack usage below submit_bio() can be more than 5k (DM, MD, > SCSI, driver, memory allocation), so it's really not safe to do > allocation anywhere below about 3k of kernel stack being used. e.g. > on a relatively trivial storage setup without the above commit: > > [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left > > Fundamentally, 8k stacks on x86-64 are too small for our > increasingly complex storage layers and the 100+ function deep call > chains that occur. > I understand the patches motivation. For these tests I'm being deliberately a bit of a dummy and just capturing information. This might allow me to actually get through all the results and identify some of the problems and spread them around a bit. Either that or I need to clone myself a few times to tackle each of the problems in a reasonable timeframe :) For just these XFS tests I've uploaded a tarball of the logs to http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz For results with no monitor you can find them somewhere like this default/no-monitor/sandy/fsmark-single-3.4.0-vanilla/noprofile/fsmark.log Results with monitors attached are in run-monitor. You can read the iostat logs for example from default/run-monitor/sandy/iostat-3.4.0-vanilla-fsmark-single Some of the monitor logs are gzipped. This is perf top over time for the vanilla kernel time: 1341306570 time: 1341306579 PerfTop: 1 irqs/sec kernel: 0.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 61.85% [kernel] [k] __rmqueue 38.15% libc-2.11.3.so [.] _IO_vfscanf time: 1341306579 PerfTop: 3 irqs/sec kernel:66.7% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 19.88% [kernel] [k] _raw_spin_lock_irqsave 17.14% [kernel] [k] __rmqueue 16.96% [kernel] [k] format_decode 15.37% libc-2.11.3.so [.] __tzfile_compute 13.55% [kernel] [k] copy_user_generic_string 10.57% libc-2.11.3.so [.] _IO_vfscanf 6.53% [kernel] [k] find_first_bit time: 1341306579 PerfTop: 0 irqs/sec kernel:-nan% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 17.51% [kernel] [k] _raw_spin_lock_irqsave 15.10% [kernel] [k] __rmqueue 14.94% [kernel] [k] format_decode 13.54% libc-2.11.3.so [.] __tzfile_compute 11.94% [kernel] [k] copy_user_generic_string 11.90% [kernel] [k] _raw_spin_lock 9.31% libc-2.11.3.so [.] _IO_vfscanf 5.75% [kernel] [k] find_first_bit time: 1341306579 PerfTop: 41 irqs/sec kernel:58.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 13.62% [kernel] [k] _raw_spin_lock_irqsave 11.02% [kernel] [k] __rmqueue 10.91% [kernel] [k] format_decode 9.89% libc-2.11.3.so [.] __tzfile_compute 8.72% [kernel] [k] copy_user_generic_string 8.69% [kernel] [k] _raw_spin_lock 7.15% libc-2.11.3.so [.] _IO_vfscanf 4.20% [kernel] [k] find_first_bit 1.47% libc-2.11.3.so [.] __strcmp_sse42 1.37% libc-2.11.3.so [.] __strchr_sse42 1.19% sed [.] 0x0000000000009f7d 0.90% libc-2.11.3.so [.] vfprintf 0.84% [kernel] [k] hrtimer_interrupt 0.84% libc-2.11.3.so [.] re_string_realloc_buffers 0.76% [kernel] [k] enqueue_entity 0.66% [kernel] [k] __switch_to 0.65% libc-2.11.3.so [.] _IO_default_xsputn 0.62% [kernel] [k] do_vfs_ioctl 0.59% [kernel] [k] perf_event_mmap_event 0.56% gzip [.] 0x0000000000007b96 0.55% libc-2.11.3.so [.] bsearch time: 1341306579 PerfTop: 35 irqs/sec kernel:62.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.50% [kernel] [k] _raw_spin_lock_irqsave 9.22% [kernel] [k] __rmqueue 9.13% [kernel] [k] format_decode 8.27% libc-2.11.3.so [.] __tzfile_compute 7.92% [kernel] [k] copy_user_generic_string 7.74% [kernel] [k] _raw_spin_lock 6.21% libc-2.11.3.so [.] _IO_vfscanf 3.51% [kernel] [k] find_first_bit 1.44% gzip [.] 0x0000000000007b96 1.23% libc-2.11.3.so [.] __strcmp_sse42 1.15% libc-2.11.3.so [.] __strchr_sse42 1.06% libc-2.11.3.so [.] vfprintf 0.99% sed [.] 0x0000000000009f7d 0.92% [unknown] [.] 0x00007f84a7766b99 0.70% [kernel] [k] hrtimer_interrupt 0.70% libc-2.11.3.so [.] re_string_realloc_buffers 0.64% [kernel] [k] enqueue_entity 0.58% libtcl8.5.so [.] 0x000000000006fe86 0.55% [kernel] [k] __switch_to 0.54% libc-2.11.3.so [.] _IO_default_xsputn 0.53% [kernel] [k] __d_lookup_rcu time: 1341306585 PerfTop: 100 irqs/sec kernel:59.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 8.61% [kernel] [k] _raw_spin_lock_irqsave 5.92% [kernel] [k] __rmqueue 5.86% [kernel] [k] format_decode 5.31% libc-2.11.3.so [.] __tzfile_compute 5.30% [kernel] [k] copy_user_generic_string 5.27% [kernel] [k] _raw_spin_lock 3.99% libc-2.11.3.so [.] _IO_vfscanf 2.45% [unknown] [.] 0x00007f84a7766b99 2.26% [kernel] [k] find_first_bit 1.68% [kernel] [k] page_fault 1.45% libc-2.11.3.so [.] _int_malloc 1.28% gzip [.] 0x0000000000007b96 1.13% libc-2.11.3.so [.] vfprintf 1.06% libc-2.11.3.so [.] __strchr_sse42 1.02% perl [.] 0x0000000000044505 0.79% libc-2.11.3.so [.] __strcmp_sse42 0.79% [kernel] [k] do_task_stat 0.77% [kernel] [k] zap_pte_range 0.72% libc-2.11.3.so [.] __gconv_transform_utf8_internal 0.70% libc-2.11.3.so [.] malloc 0.70% libc-2.11.3.so [.] __mbrtowc time: 1341306585 PerfTop: 19 irqs/sec kernel:78.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.97% [kernel] [k] _raw_spin_lock_irqsave 5.48% [kernel] [k] __rmqueue 5.43% [kernel] [k] format_decode 5.24% [kernel] [k] copy_user_generic_string 5.18% [kernel] [k] _raw_spin_lock 4.92% libc-2.11.3.so [.] __tzfile_compute 4.25% libc-2.11.3.so [.] _IO_vfscanf 2.33% [unknown] [.] 0x00007f84a7766b99 2.12% [kernel] [k] page_fault 2.09% [kernel] [k] find_first_bit 1.34% libc-2.11.3.so [.] _int_malloc 1.19% gzip [.] 0x0000000000007b96 1.05% libc-2.11.3.so [.] vfprintf 0.98% libc-2.11.3.so [.] __strchr_sse42 0.94% perl [.] 0x0000000000044505 0.94% libc-2.11.3.so [.] _dl_addr 0.91% [kernel] [k] zap_pte_range 0.74% [kernel] [k] s_show 0.73% libc-2.11.3.so [.] __strcmp_sse42 0.73% [kernel] [k] do_task_stat 0.67% libc-2.11.3.so [.] __gconv_transform_utf8_internal time: 1341306585 PerfTop: 38 irqs/sec kernel:68.4% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.64% [kernel] [k] _raw_spin_lock_irqsave 4.89% [kernel] [k] _raw_spin_lock 4.77% [kernel] [k] __rmqueue 4.72% [kernel] [k] format_decode 4.56% [kernel] [k] copy_user_generic_string 4.53% libc-2.11.3.so [.] _IO_vfscanf 4.28% libc-2.11.3.so [.] __tzfile_compute 2.52% [unknown] [.] 0x00007f84a7766b99 no symbols found in /bin/sort, maybe install a debug package? 2.10% [kernel] [k] page_fault 1.82% [kernel] [k] find_first_bit 1.31% libc-2.11.3.so [.] _int_malloc 1.14% libc-2.11.3.so [.] vfprintf 1.08% libc-2.11.3.so [.] _dl_addr 1.07% [kernel] [k] s_show 1.05% libc-2.11.3.so [.] __strchr_sse42 1.03% gzip [.] 0x0000000000007b96 0.82% [kernel] [k] do_task_stat 0.82% perl [.] 0x0000000000044505 0.79% [kernel] [k] zap_pte_range 0.70% [kernel] [k] seq_put_decimal_ull 0.69% [kernel] [k] find_busiest_group time: 1341306591 PerfTop: 66 irqs/sec kernel:59.1% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 6.52% [kernel] [k] _raw_spin_lock_irqsave 4.11% libc-2.11.3.so [.] _IO_vfscanf 3.91% [kernel] [k] _raw_spin_lock 3.50% [kernel] [k] copy_user_generic_string 3.41% [kernel] [k] __rmqueue 3.38% [kernel] [k] format_decode 3.06% libc-2.11.3.so [.] __tzfile_compute 2.90% [unknown] [.] 0x00007f84a7766b99 2.30% [kernel] [k] page_fault 2.20% perl [.] 0x0000000000044505 1.83% libc-2.11.3.so [.] vfprintf 1.61% libc-2.11.3.so [.] _int_malloc 1.30% [kernel] [k] find_first_bit 1.22% libc-2.11.3.so [.] _dl_addr 1.19% libc-2.11.3.so [.] __gconv_transform_utf8_internal 1.10% libc-2.11.3.so [.] __strchr_sse42 1.01% [kernel] [k] zap_pte_range 0.99% [kernel] [k] s_show 0.98% [kernel] [k] __percpu_counter_add 0.86% [kernel] [k] __strnlen_user 0.75% ld-2.11.3.so [.] do_lookup_x time: 1341306591 PerfTop: 39 irqs/sec kernel:69.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 6.26% [kernel] [k] _raw_spin_lock_irqsave 4.05% [kernel] [k] _raw_spin_lock 3.86% libc-2.11.3.so [.] _IO_vfscanf 3.21% [kernel] [k] copy_user_generic_string 3.03% [kernel] [k] __rmqueue 3.00% [kernel] [k] format_decode 2.93% [unknown] [.] 0x00007f84a7766b99 2.72% libc-2.11.3.so [.] __tzfile_compute 2.20% [kernel] [k] page_fault 1.96% perl [.] 0x0000000000044505 1.77% libc-2.11.3.so [.] vfprintf 1.43% libc-2.11.3.so [.] _int_malloc 1.16% [kernel] [k] find_first_bit 1.09% libc-2.11.3.so [.] _dl_addr 1.06% libc-2.11.3.so [.] __gconv_transform_utf8_internal 1.02% [kernel] [k] s_show 0.98% libc-2.11.3.so [.] __strchr_sse42 0.93% gzip [.] 0x0000000000007b96 0.90% [kernel] [k] zap_pte_range 0.87% [kernel] [k] __percpu_counter_add 0.76% [kernel] [k] __strnlen_user time: 1341306591 PerfTop: 185 irqs/sec kernel:70.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 4.81% [kernel] [k] _raw_spin_lock_irqsave 3.60% [unknown] [.] 0x00007f84a7766b99 3.10% [kernel] [k] _raw_spin_lock 3.04% [kernel] [k] page_fault 2.66% libc-2.11.3.so [.] _IO_vfscanf 2.14% [kernel] [k] copy_user_generic_string 2.11% [kernel] [k] format_decode 1.96% [kernel] [k] __rmqueue 1.86% libc-2.11.3.so [.] _dl_addr 1.76% libc-2.11.3.so [.] __tzfile_compute 1.26% perl [.] 0x0000000000044505 1.19% libc-2.11.3.so [.] __mbrtowc 1.14% libc-2.11.3.so [.] vfprintf 1.12% libc-2.11.3.so [.] _int_malloc 1.09% gzip [.] 0x0000000000007b96 0.95% libc-2.11.3.so [.] __gconv_transform_utf8_internal 0.88% [kernel] [k] _raw_spin_unlock_irqrestore 0.87% [kernel] [k] __strnlen_user 0.82% [kernel] [k] clear_page_c 0.77% [kernel] [k] __schedule 0.76% [kernel] [k] find_get_page time: 1341306595 PerfTop: 385 irqs/sec kernel:48.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 27.20% cc1 [.] 0x0000000000210978 3.01% [unknown] [.] 0x00007f84a7766b99 2.18% [kernel] [k] page_fault 1.96% libbfd-2.21.so [.] 0x00000000000b9cdd 1.95% [kernel] [k] _raw_spin_lock_irqsave 1.91% ld.bfd [.] 0x000000000000e3b9 1.85% [kernel] [k] _raw_spin_lock 1.31% [kernel] [k] copy_user_generic_string 1.20% libbfd-2.21.so [.] bfd_hash_lookup 1.10% libc-2.11.3.so [.] __strcmp_sse42 0.93% libc-2.11.3.so [.] _IO_vfscanf 0.85% [kernel] [k] _raw_spin_unlock_irqrestore 0.82% libc-2.11.3.so [.] _int_malloc 0.80% [kernel] [k] __rmqueue 0.79% [kernel] [k] kmem_cache_alloc 0.74% [kernel] [k] format_decode 0.71% libc-2.11.3.so [.] _dl_addr 0.62% libbfd-2.21.so [.] _bfd_final_link_relocate 0.61% libc-2.11.3.so [.] __tzfile_compute 0.61% libc-2.11.3.so [.] vfprintf 0.59% [kernel] [k] find_busiest_group time: 1341306595 PerfTop: 1451 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 9.75% cc1 [.] 0x0000000000210978 8.81% [unknown] [.] 0x00007f84a7766b99 4.62% [kernel] [k] page_fault 3.61% [kernel] [k] _raw_spin_lock 2.67% [kernel] [k] memcpy 2.03% [kernel] [k] _raw_spin_lock_irqsave 2.00% [kernel] [k] kmem_cache_alloc 1.64% [xfs] [k] _xfs_buf_find 1.31% [kernel] [k] _raw_spin_unlock_irqrestore 1.16% [kernel] [k] kmem_cache_free 1.15% [xfs] [k] xfs_next_bit 0.98% [kernel] [k] __d_lookup 0.89% [xfs] [k] xfs_da_do_buf 0.83% [kernel] [k] memset 0.80% [xfs] [k] xfs_dir2_node_addname_int 0.79% [kernel] [k] link_path_walk 0.76% [xfs] [k] xfs_buf_item_size no symbols found in /usr/bin/tee, maybe install a debug package? no symbols found in /bin/date, maybe install a debug package? 0.73% [xfs] [k] xfs_buf_offset 0.71% [kernel] [k] __kmalloc 0.70% [kernel] [k] kfree 0.70% libbfd-2.21.so [.] 0x00000000000b9cdd time: 1341306601 PerfTop: 1267 irqs/sec kernel:85.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.81% [unknown] [.] 0x00007f84a7766b99 5.98% cc1 [.] 0x0000000000210978 5.20% [kernel] [k] page_fault 3.54% [kernel] [k] _raw_spin_lock 3.37% [kernel] [k] memcpy 2.03% [kernel] [k] kmem_cache_alloc 1.91% [kernel] [k] _raw_spin_lock_irqsave 1.75% [xfs] [k] _xfs_buf_find 1.35% [kernel] [k] _raw_spin_unlock_irqrestore 1.28% [xfs] [k] xfs_next_bit 1.14% [kernel] [k] kmem_cache_free 1.13% [kernel] [k] __kmalloc 1.12% [kernel] [k] __d_lookup 0.97% [xfs] [k] xfs_dir2_node_addname_int 0.96% [xfs] [k] xfs_buf_offset 0.95% [kernel] [k] memset 0.91% [kernel] [k] link_path_walk 0.88% [xfs] [k] xfs_da_do_buf 0.85% [kernel] [k] kfree 0.84% [xfs] [k] xfs_buf_item_size 0.74% [xfs] [k] xfs_btree_lookup time: 1341306601 PerfTop: 1487 irqs/sec kernel:85.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.84% [unknown] [.] 0x00007f84a7766b99 5.15% [kernel] [k] page_fault 3.93% cc1 [.] 0x0000000000210978 3.76% [kernel] [k] _raw_spin_lock 3.50% [kernel] [k] memcpy 2.13% [kernel] [k] kmem_cache_alloc 1.91% [xfs] [k] _xfs_buf_find 1.79% [kernel] [k] _raw_spin_lock_irqsave 1.52% [kernel] [k] __kmalloc 1.33% [kernel] [k] kmem_cache_free 1.32% [kernel] [k] _raw_spin_unlock_irqrestore 1.29% [kernel] [k] __d_lookup 1.27% [xfs] [k] xfs_next_bit 1.11% [kernel] [k] link_path_walk 1.01% [xfs] [k] xfs_buf_offset 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_da_do_buf 0.97% [kernel] [k] kfree 0.96% [kernel] [k] memset 0.84% [xfs] [k] xfs_btree_lookup 0.82% [xfs] [k] xfs_buf_item_format time: 1341306601 PerfTop: 1291 irqs/sec kernel:85.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.21% [unknown] [.] 0x00007f84a7766b99 5.18% [kernel] [k] page_fault 3.83% [kernel] [k] _raw_spin_lock 3.67% [kernel] [k] memcpy 2.92% cc1 [.] 0x0000000000210978 2.28% [kernel] [k] kmem_cache_alloc 2.18% [xfs] [k] _xfs_buf_find 1.66% [kernel] [k] _raw_spin_lock_irqsave 1.56% [kernel] [k] __kmalloc 1.43% [kernel] [k] __d_lookup 1.43% [kernel] [k] _raw_spin_unlock_irqrestore 1.40% [kernel] [k] kmem_cache_free 1.29% [xfs] [k] xfs_next_bit 1.13% [xfs] [k] xfs_buf_offset 1.07% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_da_do_buf 1.01% [kernel] [k] memset 1.01% [kernel] [k] kfree 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.87% [xfs] [k] xfs_buf_item_size 0.84% [xfs] [k] xfs_btree_lookup time: 1341306607 PerfTop: 1435 irqs/sec kernel:87.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.06% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] page_fault 3.88% [kernel] [k] _raw_spin_lock 3.83% [kernel] [k] memcpy 2.41% [xfs] [k] _xfs_buf_find 2.35% [kernel] [k] kmem_cache_alloc 2.19% cc1 [.] 0x0000000000210978 1.68% [kernel] [k] _raw_spin_lock_irqsave 1.55% [kernel] [k] __kmalloc 1.48% [kernel] [k] __d_lookup 1.43% [kernel] [k] kmem_cache_free 1.42% [kernel] [k] _raw_spin_unlock_irqrestore 1.37% [xfs] [k] xfs_next_bit 1.27% [xfs] [k] xfs_buf_offset 1.12% [kernel] [k] link_path_walk 1.09% [kernel] [k] kfree 1.08% [kernel] [k] memset 1.04% [xfs] [k] xfs_da_do_buf 0.99% [xfs] [k] xfs_dir2_node_addname_int 0.92% [xfs] [k] xfs_buf_item_size 0.89% [xfs] [k] xfs_btree_lookup time: 1341306607 PerfTop: 1281 irqs/sec kernel:87.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.00% [unknown] [.] 0x00007f84a7766b99 5.44% [kernel] [k] page_fault 4.04% [kernel] [k] _raw_spin_lock 3.94% [kernel] [k] memcpy 2.51% [xfs] [k] _xfs_buf_find 2.32% [kernel] [k] kmem_cache_alloc 1.75% cc1 [.] 0x0000000000210978 1.66% [kernel] [k] _raw_spin_lock_irqsave 1.58% [kernel] [k] __d_lookup 1.56% [kernel] [k] __kmalloc 1.46% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] kmem_cache_free 1.41% [kernel] [k] _raw_spin_unlock_irqrestore 1.34% [xfs] [k] xfs_buf_offset 1.20% [kernel] [k] link_path_walk 1.16% [kernel] [k] kfree 1.11% [kernel] [k] memset 1.04% [xfs] [k] xfs_dir2_node_addname_int 0.94% [xfs] [k] xfs_da_do_buf 0.92% [xfs] [k] xfs_btree_lookup 0.89% [xfs] [k] xfs_buf_item_size time: 1341306607 PerfTop: 1455 irqs/sec kernel:86.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.14% [unknown] [.] 0x00007f84a7766b99 5.36% [kernel] [k] page_fault 4.12% [kernel] [k] _raw_spin_lock 4.02% [kernel] [k] memcpy 2.54% [xfs] [k] _xfs_buf_find 2.41% [kernel] [k] kmem_cache_alloc 1.69% [kernel] [k] _raw_spin_lock_irqsave 1.56% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.47% [kernel] [k] __d_lookup 1.42% [kernel] [k] _raw_spin_unlock_irqrestore 1.39% [xfs] [k] xfs_buf_offset 1.39% cc1 [.] 0x0000000000210978 1.37% [kernel] [k] kmem_cache_free 1.24% [kernel] [k] link_path_walk 1.17% [kernel] [k] memset 1.16% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.92% [xfs] [k] xfs_da_do_buf 0.91% [xfs] [k] xfs_btree_lookup time: 1341306613 PerfTop: 1245 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.05% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] page_fault 4.10% [kernel] [k] _raw_spin_lock 4.06% [kernel] [k] memcpy 2.74% [xfs] [k] _xfs_buf_find 2.40% [kernel] [k] kmem_cache_alloc 1.64% [kernel] [k] _raw_spin_lock_irqsave 1.58% [xfs] [k] xfs_next_bit 1.54% [kernel] [k] __kmalloc 1.49% [kernel] [k] __d_lookup 1.45% [kernel] [k] _raw_spin_unlock_irqrestore 1.41% [kernel] [k] kmem_cache_free 1.35% [xfs] [k] xfs_buf_offset 1.25% [kernel] [k] link_path_walk 1.22% [kernel] [k] kfree 1.16% [kernel] [k] memset 1.15% cc1 [.] 0x0000000000210978 1.02% [xfs] [k] xfs_buf_item_size 1.00% [xfs] [k] xfs_dir2_node_addname_int 0.92% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_da_do_buf time: 1341306613 PerfTop: 1433 irqs/sec kernel:87.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 12.04% [unknown] [.] 0x00007f84a7766b99 5.30% [kernel] [k] page_fault 4.08% [kernel] [k] memcpy 4.07% [kernel] [k] _raw_spin_lock 2.88% [xfs] [k] _xfs_buf_find 2.50% [kernel] [k] kmem_cache_alloc 1.72% [kernel] [k] _raw_spin_lock_irqsave 1.68% [xfs] [k] xfs_next_bit 1.56% [kernel] [k] __d_lookup 1.54% [kernel] [k] __kmalloc 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.46% [kernel] [k] kmem_cache_free 1.40% [xfs] [k] xfs_buf_offset 1.25% [kernel] [k] link_path_walk 1.21% [kernel] [k] memset 1.18% [kernel] [k] kfree 1.04% [xfs] [k] xfs_buf_item_size 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.95% [xfs] [k] xfs_btree_lookup 0.94% cc1 [.] 0x0000000000210978 0.90% [xfs] [k] xfs_da_do_buf time: 1341306613 PerfTop: 1118 irqs/sec kernel:87.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- no symbols found in /usr/bin/vmstat, maybe install a debug package? 12.03% [unknown] [.] 0x00007f84a7766b99 5.48% [kernel] [k] page_fault 4.21% [kernel] [k] memcpy 4.11% [kernel] [k] _raw_spin_lock 2.98% [xfs] [k] _xfs_buf_find 2.47% [kernel] [k] kmem_cache_alloc 1.81% [kernel] [k] _raw_spin_lock_irqsave 1.72% [xfs] [k] xfs_next_bit 1.51% [kernel] [k] __kmalloc 1.48% [kernel] [k] kmem_cache_free 1.48% [kernel] [k] __d_lookup 1.47% [kernel] [k] _raw_spin_unlock_irqrestore 1.39% [xfs] [k] xfs_buf_offset 1.23% [kernel] [k] link_path_walk 1.19% [kernel] [k] memset 1.19% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_buf_item_size 0.98% [xfs] [k] xfs_btree_lookup 0.93% [xfs] [k] xfs_buf_item_format 0.91% [xfs] [k] xfs_da_do_buf time: 1341306617 PerfTop: 1454 irqs/sec kernel:87.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.93% [unknown] [.] 0x00007f84a7766b99 5.42% [kernel] [k] page_fault 4.28% [kernel] [k] memcpy 4.20% [kernel] [k] _raw_spin_lock 3.15% [xfs] [k] _xfs_buf_find 2.52% [kernel] [k] kmem_cache_alloc 1.76% [kernel] [k] _raw_spin_lock_irqsave 1.72% [xfs] [k] xfs_next_bit 1.59% [kernel] [k] __d_lookup 1.51% [kernel] [k] __kmalloc 1.49% [kernel] [k] _raw_spin_unlock_irqrestore 1.48% [kernel] [k] kmem_cache_free 1.40% [xfs] [k] xfs_buf_offset 1.29% [kernel] [k] memset 1.20% [kernel] [k] link_path_walk 1.17% [kernel] [k] kfree 1.09% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_buf_item_size 0.95% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf 0.91% [xfs] [k] xfs_buf_item_format time: 1341306617 PerfTop: 1758 irqs/sec kernel:90.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.99% [unknown] [.] 0x00007f84a7766b99 5.40% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.04% [kernel] [k] memcpy 3.86% [xfs] [k] _xfs_buf_find 2.31% [kernel] [k] kmem_cache_alloc 2.03% [kernel] [k] _raw_spin_lock_irqsave 1.67% [kernel] [k] _raw_spin_unlock_irqrestore 1.60% [kernel] [k] __d_lookup 1.60% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] __kmalloc 1.36% [xfs] [k] xfs_buf_offset 1.35% [kernel] [k] kmem_cache_free 1.17% [kernel] [k] kfree 1.16% [kernel] [k] memset 1.08% [kernel] [k] link_path_walk 0.98% [xfs] [k] xfs_dir2_node_addname_int 0.97% [xfs] [k] xfs_btree_lookup 0.92% [xfs] [k] xfs_perag_put 0.90% [xfs] [k] xfs_buf_item_size 0.84% [xfs] [k] xfs_da_do_buf time: 1341306623 PerfTop: 1022 irqs/sec kernel:88.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.93% [unknown] [.] 0x00007f84a7766b99 5.34% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.01% [kernel] [k] memcpy 4.01% [xfs] [k] _xfs_buf_find 2.28% [kernel] [k] kmem_cache_alloc 2.00% [kernel] [k] _raw_spin_lock_irqsave 1.68% [kernel] [k] _raw_spin_unlock_irqrestore 1.60% [xfs] [k] xfs_next_bit 1.59% [kernel] [k] __d_lookup 1.41% [kernel] [k] __kmalloc 1.39% [kernel] [k] kmem_cache_free 1.35% [xfs] [k] xfs_buf_offset 1.15% [kernel] [k] kfree 1.15% [kernel] [k] memset 1.09% [kernel] [k] link_path_walk 0.98% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_perag_put 0.88% [xfs] [k] xfs_buf_item_size 0.86% [xfs] [k] xfs_da_do_buf time: 1341306623 PerfTop: 1430 irqs/sec kernel:87.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.05% [unknown] [.] 0x00007f84a7766b99 5.24% [kernel] [k] _raw_spin_lock 4.89% [kernel] [k] page_fault 4.13% [kernel] [k] memcpy 3.96% [xfs] [k] _xfs_buf_find 2.35% [kernel] [k] kmem_cache_alloc 1.95% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] _raw_spin_unlock_irqrestore 1.63% [kernel] [k] __d_lookup 1.54% [xfs] [k] xfs_next_bit 1.42% [kernel] [k] __kmalloc 1.41% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.16% [kernel] [k] memset 1.11% [kernel] [k] kfree 1.10% [kernel] [k] link_path_walk 1.05% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_btree_lookup 0.91% [xfs] [k] xfs_buf_item_size 0.87% [xfs] [k] xfs_da_do_buf 0.87% [xfs] [k] xfs_perag_put time: 1341306623 PerfTop: 1267 irqs/sec kernel:87.1% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.20% [unknown] [.] 0x00007f84a7766b99 5.08% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.12% [kernel] [k] memcpy 3.96% [xfs] [k] _xfs_buf_find 2.41% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] _raw_spin_unlock_irqrestore 1.61% [kernel] [k] __d_lookup 1.50% [xfs] [k] xfs_next_bit 1.44% [kernel] [k] __kmalloc 1.40% [kernel] [k] kmem_cache_free 1.31% [xfs] [k] xfs_buf_offset no symbols found in /usr/bin/iostat, maybe install a debug package? 1.16% [kernel] [k] memset 1.11% [kernel] [k] kfree 1.06% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_dir2_node_addname_int 1.01% [xfs] [k] xfs_btree_lookup 0.95% [xfs] [k] xfs_buf_item_size 0.90% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_perag_put time: 1341306629 PerfTop: 1399 irqs/sec kernel:88.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.12% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.23% [kernel] [k] memcpy 4.03% [xfs] [k] _xfs_buf_find 2.37% [kernel] [k] kmem_cache_alloc 1.96% [kernel] [k] _raw_spin_lock_irqsave 1.69% [kernel] [k] _raw_spin_unlock_irqrestore 1.63% [kernel] [k] __d_lookup 1.50% [xfs] [k] xfs_next_bit 1.45% [kernel] [k] __kmalloc 1.35% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.17% [kernel] [k] memset 1.09% [kernel] [k] kfree 1.07% [xfs] [k] xfs_dir2_node_addname_int 1.07% [kernel] [k] link_path_walk 1.04% [xfs] [k] xfs_btree_lookup 1.02% [xfs] [k] xfs_buf_item_size 0.93% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_buf_item_format time: 1341306629 PerfTop: 1225 irqs/sec kernel:87.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.15% [unknown] [.] 0x00007f84a7766b99 5.02% [kernel] [k] _raw_spin_lock 4.85% [kernel] [k] page_fault 4.22% [kernel] [k] memcpy 4.19% [xfs] [k] _xfs_buf_find 2.32% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.71% [kernel] [k] __d_lookup 1.68% [kernel] [k] _raw_spin_unlock_irqrestore 1.54% [xfs] [k] xfs_next_bit 1.51% [kernel] [k] __kmalloc 1.36% [kernel] [k] kmem_cache_free 1.28% [xfs] [k] xfs_buf_offset 1.14% [kernel] [k] memset 1.09% [kernel] [k] kfree 1.06% [kernel] [k] link_path_walk 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.99% [xfs] [k] xfs_btree_lookup 0.92% [xfs] [k] xfs_da_do_buf 0.86% [kernel] [k] s_show time: 1341306629 PerfTop: 1400 irqs/sec kernel:87.4% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.23% [unknown] [.] 0x00007f84a7766b99 5.07% [kernel] [k] _raw_spin_lock 4.87% [kernel] [k] page_fault 4.27% [xfs] [k] _xfs_buf_find 4.18% [kernel] [k] memcpy 2.31% [kernel] [k] kmem_cache_alloc 1.94% [kernel] [k] _raw_spin_lock_irqsave 1.73% [kernel] [k] __d_lookup 1.66% [kernel] [k] _raw_spin_unlock_irqrestore 1.49% [xfs] [k] xfs_next_bit 1.49% [kernel] [k] __kmalloc 1.40% [kernel] [k] kmem_cache_free 1.29% [xfs] [k] xfs_buf_offset 1.11% [kernel] [k] kfree 1.07% [kernel] [k] memset 1.07% [kernel] [k] link_path_walk 1.05% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.97% [xfs] [k] xfs_btree_lookup 0.93% [xfs] [k] xfs_da_do_buf 0.89% [kernel] [k] s_show time: 1341306635 PerfTop: 1251 irqs/sec kernel:87.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.20% [unknown] [.] 0x00007f84a7766b99 5.10% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.29% [xfs] [k] _xfs_buf_find 4.19% [kernel] [k] memcpy 2.26% [kernel] [k] kmem_cache_alloc 1.87% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] __d_lookup 1.64% [kernel] [k] _raw_spin_unlock_irqrestore 1.53% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.41% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.10% [kernel] [k] link_path_walk 1.09% [kernel] [k] memset 1.08% [kernel] [k] kfree 1.03% [xfs] [k] xfs_dir2_node_addname_int 0.99% [xfs] [k] xfs_buf_item_size 0.98% [xfs] [k] xfs_btree_lookup 0.96% [kernel] [k] s_show 0.93% [xfs] [k] xfs_da_do_buf time: 1341306635 PerfTop: 1429 irqs/sec kernel:88.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.18% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.82% [kernel] [k] page_fault 4.28% [xfs] [k] _xfs_buf_find 4.21% [kernel] [k] memcpy 2.23% [kernel] [k] kmem_cache_alloc 1.90% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] __d_lookup 1.67% [kernel] [k] _raw_spin_unlock_irqrestore 1.52% [kernel] [k] __kmalloc 1.52% [xfs] [k] xfs_next_bit 1.36% [kernel] [k] kmem_cache_free 1.34% [xfs] [k] xfs_buf_offset 1.11% [kernel] [k] link_path_walk 1.11% [kernel] [k] memset 1.08% [kernel] [k] kfree 1.03% [xfs] [k] xfs_buf_item_size 1.03% [xfs] [k] xfs_dir2_node_addname_int 1.01% [kernel] [k] s_show 0.98% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf time: 1341306635 PerfTop: 1232 irqs/sec kernel:88.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.11% [unknown] [.] 0x00007f84a7766b99 5.13% [kernel] [k] _raw_spin_lock 4.87% [kernel] [k] page_fault 4.33% [xfs] [k] _xfs_buf_find 4.16% [kernel] [k] memcpy 2.24% [kernel] [k] kmem_cache_alloc 1.84% [kernel] [k] _raw_spin_lock_irqsave 1.82% [kernel] [k] __d_lookup 1.65% [kernel] [k] _raw_spin_unlock_irqrestore 1.50% [xfs] [k] xfs_next_bit 1.49% [kernel] [k] __kmalloc 1.34% [kernel] [k] kmem_cache_free 1.32% [xfs] [k] xfs_buf_offset 1.13% [kernel] [k] link_path_walk 1.13% [kernel] [k] kfree 1.11% [kernel] [k] memset 1.06% [kernel] [k] s_show 1.03% [xfs] [k] xfs_buf_item_size 1.02% [xfs] [k] xfs_dir2_node_addname_int 0.98% [xfs] [k] xfs_btree_lookup 0.94% [xfs] [k] xfs_da_do_buf time: 1341306639 PerfTop: 1444 irqs/sec kernel:87.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 11.19% [unknown] [.] 0x00007f84a7766b99 5.10% [kernel] [k] _raw_spin_lock 4.95% [kernel] [k] page_fault 4.40% [xfs] [k] _xfs_buf_find 4.10% [kernel] [k] memcpy 2.20% [kernel] [k] kmem_cache_alloc 1.93% [kernel] [k] _raw_spin_lock_irqsave 1.81% [kernel] [k] __d_lookup 1.59% [kernel] [k] _raw_spin_unlock_irqrestore 1.50% [xfs] [k] xfs_next_bit 1.48% [kernel] [k] __kmalloc 1.37% [kernel] [k] kmem_cache_free 1.36% [xfs] [k] xfs_buf_offset 1.15% [kernel] [k] memset 1.12% [kernel] [k] s_show 1.12% [kernel] [k] link_path_walk 1.10% [kernel] [k] kfree 1.02% [xfs] [k] xfs_buf_item_size 0.99% [xfs] [k] xfs_btree_lookup 0.97% [xfs] [k] xfs_dir2_node_addname_int 0.94% [xfs] [k] xfs_da_do_buf time: 1341306639 PerfTop: 1195 irqs/sec kernel:90.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 10.00% [unknown] [.] 0x00007f84a7766b99 5.17% [kernel] [k] _raw_spin_lock 4.44% [xfs] [k] _xfs_buf_find 4.37% [kernel] [k] page_fault 4.37% [kernel] [k] memcpy 2.30% [kernel] [k] kmem_cache_alloc 1.90% [kernel] [k] _raw_spin_lock_irqsave 1.63% [kernel] [k] __d_lookup 1.62% [kernel] [k] _raw_spin_unlock_irqrestore 1.59% [xfs] [k] xfs_buf_offset 1.50% [kernel] [k] kmem_cache_free 1.50% [kernel] [k] __kmalloc 1.49% [xfs] [k] xfs_next_bit 1.33% [kernel] [k] memset 1.28% [kernel] [k] kfree 1.11% [xfs] [k] xfs_buf_item_size 1.07% [kernel] [k] s_show 1.07% [kernel] [k] link_path_walk 0.93% [xfs] [k] xfs_btree_lookup 0.90% [xfs] [k] xfs_da_do_buf 0.84% [xfs] [k] xfs_dir2_node_addname_int time: 1341306645 PerfTop: 1097 irqs/sec kernel:95.8% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 7.51% [unknown] [.] 0x00007f84a7766b99 5.02% [kernel] [k] _raw_spin_lock 4.63% [kernel] [k] memcpy 4.51% [xfs] [k] _xfs_buf_find 3.32% [kernel] [k] page_fault 2.37% [kernel] [k] kmem_cache_alloc 1.87% [kernel] [k] _raw_spin_lock_irqsave 1.77% [xfs] [k] xfs_buf_offset 1.75% [kernel] [k] __kmalloc 1.73% [xfs] [k] xfs_next_bit 1.65% [kernel] [k] memset 1.60% [kernel] [k] _raw_spin_unlock_irqrestore 1.53% [kernel] [k] kfree 1.52% [kernel] [k] kmem_cache_free 1.44% [xfs] [k] xfs_trans_ail_cursor_first 1.26% [kernel] [k] __d_lookup 1.26% [xfs] [k] xfs_buf_item_size 1.05% [kernel] [k] s_show 1.03% [xfs] [k] xfs_buf_item_format 0.92% [kernel] [k] __d_lookup_rcu 0.87% [kernel] [k] link_path_walk time: 1341306645 PerfTop: 1038 irqs/sec kernel:95.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.89% [unknown] [.] 0x00007f84a7766b99 5.18% [kernel] [k] memcpy 4.60% [xfs] [k] _xfs_buf_find 4.52% [kernel] [k] _raw_spin_lock 2.60% [kernel] [k] page_fault 2.42% [kernel] [k] kmem_cache_alloc 1.99% [kernel] [k] __kmalloc 1.96% [xfs] [k] xfs_next_bit 1.93% [xfs] [k] xfs_buf_offset 1.84% [kernel] [k] _raw_spin_lock_irqsave 1.83% [kernel] [k] memset 1.80% [kernel] [k] kmem_cache_free 1.68% [kernel] [k] kfree 1.47% [xfs] [k] xfs_buf_item_size 1.45% [kernel] [k] _raw_spin_unlock_irqrestore 1.18% [kernel] [k] __d_lookup_rcu 1.13% [xfs] [k] xfs_trans_ail_cursor_first 1.12% [xfs] [k] xfs_buf_item_format 1.04% [kernel] [k] s_show 1.01% [kernel] [k] __d_lookup 0.93% [xfs] [k] xfs_da_do_buf time: 1341306645 PerfTop: 1087 irqs/sec kernel:96.0% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.27% [kernel] [k] memcpy 4.77% [unknown] [.] 0x00007f84a7766b99 4.69% [xfs] [k] _xfs_buf_find 4.56% [kernel] [k] _raw_spin_lock 2.47% [kernel] [k] kmem_cache_alloc 2.18% [xfs] [k] xfs_next_bit 2.11% [kernel] [k] page_fault 2.00% [xfs] [k] xfs_buf_offset 1.99% [kernel] [k] __kmalloc 1.96% [kernel] [k] kmem_cache_free 1.85% [kernel] [k] kfree 1.82% [kernel] [k] memset 1.75% [kernel] [k] _raw_spin_lock_irqsave 1.58% [xfs] [k] xfs_buf_item_size 1.41% [kernel] [k] _raw_spin_unlock_irqrestore 1.23% [kernel] [k] __d_lookup_rcu 1.21% [xfs] [k] xfs_buf_item_format 0.99% [kernel] [k] s_show 0.97% [xfs] [k] xfs_perag_put 0.92% [xfs] [k] xfs_da_do_buf 0.92% [xfs] [k] xfs_trans_ail_cursor_first time: 1341306651 PerfTop: 1157 irqs/sec kernel:96.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.53% [kernel] [k] memcpy 4.61% [xfs] [k] _xfs_buf_find 4.40% [kernel] [k] _raw_spin_lock 3.83% [unknown] [.] 0x00007f84a7766b99 2.67% [kernel] [k] kmem_cache_alloc 2.32% [xfs] [k] xfs_next_bit 2.21% [kernel] [k] __kmalloc 2.21% [xfs] [k] xfs_buf_offset 2.19% [kernel] [k] kmem_cache_free 1.92% [kernel] [k] memset 1.89% [kernel] [k] kfree 1.80% [xfs] [k] xfs_buf_item_size 1.70% [kernel] [k] page_fault 1.62% [kernel] [k] _raw_spin_lock_irqsave 1.50% [kernel] [k] _raw_spin_unlock_irqrestore 1.30% [xfs] [k] xfs_buf_item_format 1.27% [kernel] [k] __d_lookup_rcu 0.97% [xfs] [k] xfs_da_do_buf 0.96% [xfs] [k] xlog_cil_prepare_log_vecs 0.93% [kernel] [k] s_show 0.93% [xfs] [k] xfs_perag_put time: 1341306651 PerfTop: 1073 irqs/sec kernel:95.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.76% [kernel] [k] memcpy 4.54% [xfs] [k] _xfs_buf_find 4.32% [kernel] [k] _raw_spin_lock 3.15% [unknown] [.] 0x00007f84a7766b99 2.77% [kernel] [k] kmem_cache_alloc 2.49% [xfs] [k] xfs_next_bit 2.36% [kernel] [k] __kmalloc 2.27% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset 1.88% [kernel] [k] memset 1.88% [kernel] [k] kfree 1.77% [xfs] [k] xfs_buf_item_size 1.62% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.42% [xfs] [k] xfs_buf_item_format 1.40% [kernel] [k] page_fault 1.39% [kernel] [k] __d_lookup_rcu 0.99% [xfs] [k] xfs_da_do_buf 0.96% [xfs] [k] xlog_cil_prepare_log_vecs 0.88% [kernel] [k] s_show 0.87% [xfs] [k] xfs_perag_put time: 1341306651 PerfTop: 492 irqs/sec kernel:85.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.74% [kernel] [k] memcpy 4.48% [xfs] [k] _xfs_buf_find 4.27% [kernel] [k] _raw_spin_lock 3.00% [unknown] [.] 0x00007f84a7766b99 2.76% [kernel] [k] kmem_cache_alloc 2.54% [xfs] [k] xfs_next_bit 2.39% [kernel] [k] __kmalloc 2.30% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset no symbols found in /bin/ps, maybe install a debug package? 1.96% [kernel] [k] kfree 1.92% [kernel] [k] memset 1.75% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.44% [xfs] [k] xfs_buf_item_format 1.39% [kernel] [k] __d_lookup_rcu 1.36% [kernel] [k] page_fault 0.99% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.86% [kernel] [k] s_show 0.85% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 70 irqs/sec kernel:72.9% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.73% [kernel] [k] memcpy 4.47% [xfs] [k] _xfs_buf_find 4.27% [kernel] [k] _raw_spin_lock 2.99% [unknown] [.] 0x00007f84a7766b99 2.75% [kernel] [k] kmem_cache_alloc 2.53% [xfs] [k] xfs_next_bit 2.39% [kernel] [k] __kmalloc 2.30% [kernel] [k] kmem_cache_free 2.20% [xfs] [k] xfs_buf_offset 1.96% [kernel] [k] kfree 1.92% [kernel] [k] memset 1.75% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.49% [kernel] [k] _raw_spin_unlock_irqrestore 1.43% [xfs] [k] xfs_buf_item_format 1.38% [kernel] [k] __d_lookup_rcu 1.37% [kernel] [k] page_fault 0.98% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.89% [kernel] [k] s_show 0.85% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 87 irqs/sec kernel:71.3% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.72% [kernel] [k] memcpy 4.45% [xfs] [k] _xfs_buf_find 4.25% [kernel] [k] _raw_spin_lock 2.99% [unknown] [.] 0x00007f84a7766b99 2.74% [kernel] [k] kmem_cache_alloc 2.52% [xfs] [k] xfs_next_bit 2.38% [kernel] [k] __kmalloc 2.29% [kernel] [k] kmem_cache_free 2.19% [xfs] [k] xfs_buf_offset 1.95% [kernel] [k] kfree 1.91% [kernel] [k] memset 1.74% [xfs] [k] xfs_buf_item_size 1.56% [kernel] [k] _raw_spin_lock_irqsave 1.48% [kernel] [k] _raw_spin_unlock_irqrestore 1.43% [xfs] [k] xfs_buf_item_format 1.38% [kernel] [k] page_fault 1.38% [kernel] [k] __d_lookup_rcu 0.98% [xfs] [k] xlog_cil_prepare_log_vecs 0.96% [xfs] [k] xfs_da_do_buf 0.93% [kernel] [k] s_show 0.84% [xfs] [k] xfs_perag_put time: 1341306657 PerfTop: 88 irqs/sec kernel:68.2% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------- 5.69% [kernel] [k] memcpy 4.42% [xfs] [k] _xfs_buf_find 4.25% [kernel] [k] _raw_spin_lock 2.98% [unknown] [.] 0x00007f84a7766b99 -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 10:59 ` Mel Gorman (?) @ 2012-07-03 11:44 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 11:44 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. latencytop is shit for > capturing information throughout a test and it does not easily allow you to > record a snapshot of a test. You can record all the console output of course > but that's a complete mess. I tried capturing /proc/latency_stats over time > instead because that can be trivially sorted on a system-wide basis but > as I write this I find that latency_stats was bust. It was just spitting out > > Latency Top version : v0.1 > > and nothing else. Either latency_stats is broken or my config is. Not sure > which it is right now and won't get enough time on this today to pinpoint it. > PEBKAC. Script that monitored /proc/latency_stats was not enabling latency top via /proc/sys/kernel/latencytop -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 11:44 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 11:44 UTC (permalink / raw) To: Dave Chinner Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. latencytop is shit for > capturing information throughout a test and it does not easily allow you to > record a snapshot of a test. You can record all the console output of course > but that's a complete mess. I tried capturing /proc/latency_stats over time > instead because that can be trivially sorted on a system-wide basis but > as I write this I find that latency_stats was bust. It was just spitting out > > Latency Top version : v0.1 > > and nothing else. Either latency_stats is broken or my config is. Not sure > which it is right now and won't get enough time on this today to pinpoint it. > PEBKAC. Script that monitored /proc/latency_stats was not enabling latency top via /proc/sys/kernel/latencytop -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 11:44 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 11:44 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. latencytop is shit for > capturing information throughout a test and it does not easily allow you to > record a snapshot of a test. You can record all the console output of course > but that's a complete mess. I tried capturing /proc/latency_stats over time > instead because that can be trivially sorted on a system-wide basis but > as I write this I find that latency_stats was bust. It was just spitting out > > Latency Top version : v0.1 > > and nothing else. Either latency_stats is broken or my config is. Not sure > which it is right now and won't get enough time on this today to pinpoint it. > PEBKAC. Script that monitored /proc/latency_stats was not enabling latency top via /proc/sys/kernel/latencytop -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 10:59 ` Mel Gorman (?) @ 2012-07-03 12:31 ` Daniel Vetter -1 siblings, 0 replies; 108+ messages in thread From: Daniel Vetter @ 2012-07-03 12:31 UTC (permalink / raw) To: Mel Gorman Cc: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. Well, presuming I understand things correctly the cpu die only goes into the lowest sleep state (which iirc switches off l3 caches and interconnects) when both the cpu and gpu are in the lowest sleep state. rc6 is that deep-sleep state for the gpu, so without that enabled your system won't go into these deep-sleep states. I guess the slight changes in wakeup latency, power consumption (cuts about 10W on an idle desktop snb with resulting big effect on what turbo boost can sustain for short amounts of time) and all the follow-on effects are good enough to massively change timing-critical things. So this having an effect isn't too weird. Obviously, if you also have X running while doing these tests there's the chance that the gpu dies because of an issue when waking up from rc6 (we've known a few of these), but if no drm client is up, that shouldn't be possible. So please retest without X running if that hasn't been done already. Yours, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 12:31 ` Daniel Vetter 0 siblings, 0 replies; 108+ messages in thread From: Daniel Vetter @ 2012-07-03 12:31 UTC (permalink / raw) To: Mel Gorman Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. Well, presuming I understand things correctly the cpu die only goes into the lowest sleep state (which iirc switches off l3 caches and interconnects) when both the cpu and gpu are in the lowest sleep state. rc6 is that deep-sleep state for the gpu, so without that enabled your system won't go into these deep-sleep states. I guess the slight changes in wakeup latency, power consumption (cuts about 10W on an idle desktop snb with resulting big effect on what turbo boost can sustain for short amounts of time) and all the follow-on effects are good enough to massively change timing-critical things. So this having an effect isn't too weird. Obviously, if you also have X running while doing these tests there's the chance that the gpu dies because of an issue when waking up from rc6 (we've known a few of these), but if no drm client is up, that shouldn't be possible. So please retest without X running if that hasn't been done already. Yours, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 12:31 ` Daniel Vetter 0 siblings, 0 replies; 108+ messages in thread From: Daniel Vetter @ 2012-07-03 12:31 UTC (permalink / raw) To: Mel Gorman Cc: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. Well, presuming I understand things correctly the cpu die only goes into the lowest sleep state (which iirc switches off l3 caches and interconnects) when both the cpu and gpu are in the lowest sleep state. rc6 is that deep-sleep state for the gpu, so without that enabled your system won't go into these deep-sleep states. I guess the slight changes in wakeup latency, power consumption (cuts about 10W on an idle desktop snb with resulting big effect on what turbo boost can sustain for short amounts of time) and all the follow-on effects are good enough to massively change timing-critical things. So this having an effect isn't too weird. Obviously, if you also have X running while doing these tests there's the chance that the gpu dies because of an issue when waking up from rc6 (we've known a few of these), but if no drm client is up, that shouldn't be possible. So please retest without X running if that hasn't been done already. Yours, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 12:31 ` Daniel Vetter (?) @ 2012-07-03 13:08 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 13:08 UTC (permalink / raw) To: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Chris Wilson On Tue, Jul 03, 2012 at 02:31:19PM +0200, Daniel Vetter wrote: > On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > > Adding dri-devel and a few others because an i915 patch contributed to > > > > the regression. > > > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > > you have lots of CPU power, there will be little difference in > > > > > > > performance... > > > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > > optimize away the second call. I'd defintively like to track down where > > > > > > the performance changes happend, at least to a major version but even > > > > > > better to a -rc or git commit. > > > > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > > automated git bisection. As you know the milage with an automated bisect > > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > > that enables RC6 to reduce power usage. > > > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > > > Doesn't seem to be the major cause of the regression. By itself, it > > > has impact, but the majority comes from the XFS change... > > > > > > > The fact it has an impact at all is weird but lets see what the DRI > > folks think about it. > > Well, presuming I understand things correctly the cpu die only goes into > the lowest sleep state (which iirc switches off l3 caches and > interconnects) when both the cpu and gpu are in the lowest sleep state. I made a mistake in my previous mail. gdm and X were were *not* running. Once the screen blanked I would guess the GPU is in a low sleep state the majority of the time. > rc6 is that deep-sleep state for the gpu, so without that enabled your > system won't go into these deep-sleep states. > > I guess the slight changes in wakeup latency, power consumption (cuts > about 10W on an idle desktop snb with resulting big effect on what turbo > boost can sustain for short amounts of time) and all the follow-on effects > are good enough to massively change timing-critical things. > Maybe. How aggressively is the lowest sleep state entered and how long does it take to exit? > So this having an effect isn't too weird. > > Obviously, if you also have X running while doing these tests there's the > chance that the gpu dies because of an issue when waking up from rc6 > (we've known a few of these), but if no drm client is up, that shouldn't > be possible. So please retest without X running if that hasn't been done > already. > Again, sorry for the confusion but the posted results are without X running. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 13:08 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 13:08 UTC (permalink / raw) To: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Chris Wilson On Tue, Jul 03, 2012 at 02:31:19PM +0200, Daniel Vetter wrote: > On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > > Adding dri-devel and a few others because an i915 patch contributed to > > > > the regression. > > > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > > you have lots of CPU power, there will be little difference in > > > > > > > performance... > > > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > > optimize away the second call. I'd defintively like to track down where > > > > > > the performance changes happend, at least to a major version but even > > > > > > better to a -rc or git commit. > > > > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > > automated git bisection. As you know the milage with an automated bisect > > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > > that enables RC6 to reduce power usage. > > > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > > > Doesn't seem to be the major cause of the regression. By itself, it > > > has impact, but the majority comes from the XFS change... > > > > > > > The fact it has an impact at all is weird but lets see what the DRI > > folks think about it. > > Well, presuming I understand things correctly the cpu die only goes into > the lowest sleep state (which iirc switches off l3 caches and > interconnects) when both the cpu and gpu are in the lowest sleep state. I made a mistake in my previous mail. gdm and X were were *not* running. Once the screen blanked I would guess the GPU is in a low sleep state the majority of the time. > rc6 is that deep-sleep state for the gpu, so without that enabled your > system won't go into these deep-sleep states. > > I guess the slight changes in wakeup latency, power consumption (cuts > about 10W on an idle desktop snb with resulting big effect on what turbo > boost can sustain for short amounts of time) and all the follow-on effects > are good enough to massively change timing-critical things. > Maybe. How aggressively is the lowest sleep state entered and how long does it take to exit? > So this having an effect isn't too weird. > > Obviously, if you also have X running while doing these tests there's the > chance that the gpu dies because of an issue when waking up from rc6 > (we've known a few of these), but if no drm client is up, that shouldn't > be possible. So please retest without X running if that hasn't been done > already. > Again, sorry for the confusion but the posted results are without X running. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 13:08 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 13:08 UTC (permalink / raw) To: Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Chris Wilson On Tue, Jul 03, 2012 at 02:31:19PM +0200, Daniel Vetter wrote: > On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > > Adding dri-devel and a few others because an i915 patch contributed to > > > > the regression. > > > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > > you have lots of CPU power, there will be little difference in > > > > > > > performance... > > > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > > optimize away the second call. I'd defintively like to track down where > > > > > > the performance changes happend, at least to a major version but even > > > > > > better to a -rc or git commit. > > > > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > > automated git bisection. As you know the milage with an automated bisect > > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > > that enables RC6 to reduce power usage. > > > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > > > Doesn't seem to be the major cause of the regression. By itself, it > > > has impact, but the majority comes from the XFS change... > > > > > > > The fact it has an impact at all is weird but lets see what the DRI > > folks think about it. > > Well, presuming I understand things correctly the cpu die only goes into > the lowest sleep state (which iirc switches off l3 caches and > interconnects) when both the cpu and gpu are in the lowest sleep state. I made a mistake in my previous mail. gdm and X were were *not* running. Once the screen blanked I would guess the GPU is in a low sleep state the majority of the time. > rc6 is that deep-sleep state for the gpu, so without that enabled your > system won't go into these deep-sleep states. > > I guess the slight changes in wakeup latency, power consumption (cuts > about 10W on an idle desktop snb with resulting big effect on what turbo > boost can sustain for short amounts of time) and all the follow-on effects > are good enough to massively change timing-critical things. > Maybe. How aggressively is the lowest sleep state entered and how long does it take to exit? > So this having an effect isn't too weird. > > Obviously, if you also have X running while doing these tests there's the > chance that the gpu dies because of an issue when waking up from rc6 > (we've known a few of these), but if no drm client is up, that shouldn't > be possible. So please retest without X running if that hasn't been done > already. > Again, sorry for the confusion but the posted results are without X running. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 12:31 ` Daniel Vetter @ 2012-07-03 13:28 ` Eugeni Dodonov -1 siblings, 0 replies; 108+ messages in thread From: Eugeni Dodonov @ 2012-07-03 13:28 UTC (permalink / raw) To: Mel Gorman, Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Chris Wilson Cc: Daniel Vetter [-- Attachment #1: Type: text/plain, Size: 1366 bytes --] On Tue, Jul 3, 2012 at 9:31 AM, Daniel Vetter <daniel@ffwll.ch> wrote: > Well, presuming I understand things correctly the cpu die only goes into > the lowest sleep state (which iirc switches off l3 caches and > interconnects) when both the cpu and gpu are in the lowest sleep state. > rc6 is that deep-sleep state for the gpu, so without that enabled your > system won't go into these deep-sleep states. > > I guess the slight changes in wakeup latency, power consumption (cuts > about 10W on an idle desktop snb with resulting big effect on what turbo > boost can sustain for short amounts of time) and all the follow-on effects > are good enough to massively change timing-critical things. > The sad side effect is that the software has very little control over the RC6 entry and exit, the hardware enters and leaves RC6 state on its own when it detects that the GPU is idle beyond a threshold. Chances are that if you are not running any GPU workload, the GPU simple enters RC6 state and stays there. It is possible to observe the current state and also time spent in rc6 by looking at the /sys/kernel/debug/dri/0/i915_drpc_info file. One other effect of RC6 is that it also allows CPU to go into higher turbo modes as it has more watts to spend while GPU is idle, perhaps this is what causes the issue here? -- Eugeni Dodonov <http://eugeni.dodonov.net/> [-- Attachment #2: Type: text/html, Size: 1809 bytes --] ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 13:28 ` Eugeni Dodonov 0 siblings, 0 replies; 108+ messages in thread From: Eugeni Dodonov @ 2012-07-03 13:28 UTC (permalink / raw) To: Mel Gorman, Dave Chinner, Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Chris Wilson Cc: Daniel Vetter [-- Attachment #1.1: Type: text/plain, Size: 1366 bytes --] On Tue, Jul 3, 2012 at 9:31 AM, Daniel Vetter <daniel@ffwll.ch> wrote: > Well, presuming I understand things correctly the cpu die only goes into > the lowest sleep state (which iirc switches off l3 caches and > interconnects) when both the cpu and gpu are in the lowest sleep state. > rc6 is that deep-sleep state for the gpu, so without that enabled your > system won't go into these deep-sleep states. > > I guess the slight changes in wakeup latency, power consumption (cuts > about 10W on an idle desktop snb with resulting big effect on what turbo > boost can sustain for short amounts of time) and all the follow-on effects > are good enough to massively change timing-critical things. > The sad side effect is that the software has very little control over the RC6 entry and exit, the hardware enters and leaves RC6 state on its own when it detects that the GPU is idle beyond a threshold. Chances are that if you are not running any GPU workload, the GPU simple enters RC6 state and stays there. It is possible to observe the current state and also time spent in rc6 by looking at the /sys/kernel/debug/dri/0/i915_drpc_info file. One other effect of RC6 is that it also allows CPU to go into higher turbo modes as it has more watts to spend while GPU is idle, perhaps this is what causes the issue here? -- Eugeni Dodonov <http://eugeni.dodonov.net/> [-- Attachment #1.2: Type: text/html, Size: 1809 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 10:59 ` Mel Gorman (?) @ 2012-07-04 0:47 ` Dave Chinner -1 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-04 0:47 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > Which indicates that there is workqueue scheduling issues, I think. > > The same amount of work is being done, but half of it is being > > pushed off into a workqueue to avoid stack overflow issues (*). I > > tested the above patch in anger on an 8p machine, similar to the > > machine you saw no regressions on, but the workload didn't drive it > > to being completely CPU bound (only about 90%) so the allocation > > work was probably always scheduled quickly. > > What test were you using? fsmark, dbench, compilebench, and a few fio workloads. Also, xfstests times each test and I keep track of overall runtime, and none of those showed any performance differential, either... Indeed, running on a current 3.5-rc5 tree, my usual fsmark benchmarks are running at the same numbers I've been seeing since about 3.0 - somewhere around 18k files/s for a single thread, and 110-115k files/s for 8 threads. I just ran your variant, and I'm getting about 20kfile/s for a single thread, which is about right because you're using smaller directories than I am (22500 files per dir vs 100k in my tests). > > How many worker threads have been spawned on these machines > > that are showing the regression? > > 20 or 21 generally. An example list as spotted by top looks like Pretty normal. > > What is the context switch rate on the machines whenteh test is running? ..... > Vanilla average context switch rate 4278.53 > Revert average context switch rate 1095 That seems about right, too. > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. Where the context switches are coming from, and how long they are abeing stalled for. Just to get the context switch locations, you can use perf on the sched:sched_switch event, but that doesn't give you stall times. Local testing tells me that about 40% of the switches are from xfs_alloc_vextent, 55% are from the work threads, and the rest are CPU idling events, which is exactly as I'd expect. > > A pert top profile comparison might be informative, > > too... > > > > I'm not sure if this is what you really wanted. I thought an oprofile or > perf report would have made more sense but I recorded perf top over time > anyway and it's at the end of the mail. perf report and oprofile give you CPU usage across the run, it's not instantaneous and that's where all the interesting information is. e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or it might be 100% for 1s - that's the behaviour run profiles cannot give you insight into.... As it is, the output you posted is nothing unusual. > For just these XFS tests I've uploaded a tarball of the logs to > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz Ok, so the main thing I missed when first looking at this is that you are concerned about single thread regressions. Well, I can't reproduce your results here. Single threaded with or without the workqueue based allocation gives me roughly 20k +/-0.5k files/s one a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM machine. That's the same results I've been seeing since I wrote this patch almost 12 months ago.... So, given that this is a metadata intensive workload, the only extent allocation is going to be through inode and directory block allocation. These paths do not consume a large amount of stack, so we can tell the allocator not to switch to workqueue stack for these allocations easily. The patch below does this. It completely removes all the allocation based context switches from the no-data fsmark workloads being used for this testing. It makes no noticable difference to performance here, so I'm interested if it solves the regression you are seeing on your machines. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: don't defer metadata allocation to the workqueue From: Dave Chinner <dchinner@redhat.com> Almost all metadata allocations come from shallow stack usage situations. Avoid the overhead of switching the allocation to a workqueue as we are not in danger of running out of stack when making these allocations. Metadata allocations are already marked through the args that are passed down, so this is trivial to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_alloc.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index f654f51..4f33c32 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -2434,13 +2434,22 @@ xfs_alloc_vextent_worker( current_restore_flags_nested(&pflags, PF_FSTRANS); } - -int /* error */ +/* + * Data allocation requests often come in with little stack to work on. Push + * them off to a worker thread so there is lots of stack to use. Metadata + * requests, OTOH, are generally from low stack usage paths, so avoid the + * context switch overhead here. + */ +int xfs_alloc_vextent( - xfs_alloc_arg_t *args) /* allocation argument structure */ + struct xfs_alloc_arg *args) { DECLARE_COMPLETION_ONSTACK(done); + if (!args->userdata) + return __xfs_alloc_vextent(args); + + args->done = &done; INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker); queue_work(xfs_alloc_wq, &args->work); ^ permalink raw reply related [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-04 0:47 ` Dave Chinner 0 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-04 0:47 UTC (permalink / raw) To: Mel Gorman Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > Which indicates that there is workqueue scheduling issues, I think. > > The same amount of work is being done, but half of it is being > > pushed off into a workqueue to avoid stack overflow issues (*). I > > tested the above patch in anger on an 8p machine, similar to the > > machine you saw no regressions on, but the workload didn't drive it > > to being completely CPU bound (only about 90%) so the allocation > > work was probably always scheduled quickly. > > What test were you using? fsmark, dbench, compilebench, and a few fio workloads. Also, xfstests times each test and I keep track of overall runtime, and none of those showed any performance differential, either... Indeed, running on a current 3.5-rc5 tree, my usual fsmark benchmarks are running at the same numbers I've been seeing since about 3.0 - somewhere around 18k files/s for a single thread, and 110-115k files/s for 8 threads. I just ran your variant, and I'm getting about 20kfile/s for a single thread, which is about right because you're using smaller directories than I am (22500 files per dir vs 100k in my tests). > > How many worker threads have been spawned on these machines > > that are showing the regression? > > 20 or 21 generally. An example list as spotted by top looks like Pretty normal. > > What is the context switch rate on the machines whenteh test is running? ..... > Vanilla average context switch rate 4278.53 > Revert average context switch rate 1095 That seems about right, too. > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. Where the context switches are coming from, and how long they are abeing stalled for. Just to get the context switch locations, you can use perf on the sched:sched_switch event, but that doesn't give you stall times. Local testing tells me that about 40% of the switches are from xfs_alloc_vextent, 55% are from the work threads, and the rest are CPU idling events, which is exactly as I'd expect. > > A pert top profile comparison might be informative, > > too... > > > > I'm not sure if this is what you really wanted. I thought an oprofile or > perf report would have made more sense but I recorded perf top over time > anyway and it's at the end of the mail. perf report and oprofile give you CPU usage across the run, it's not instantaneous and that's where all the interesting information is. e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or it might be 100% for 1s - that's the behaviour run profiles cannot give you insight into.... As it is, the output you posted is nothing unusual. > For just these XFS tests I've uploaded a tarball of the logs to > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz Ok, so the main thing I missed when first looking at this is that you are concerned about single thread regressions. Well, I can't reproduce your results here. Single threaded with or without the workqueue based allocation gives me roughly 20k +/-0.5k files/s one a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM machine. That's the same results I've been seeing since I wrote this patch almost 12 months ago.... So, given that this is a metadata intensive workload, the only extent allocation is going to be through inode and directory block allocation. These paths do not consume a large amount of stack, so we can tell the allocator not to switch to workqueue stack for these allocations easily. The patch below does this. It completely removes all the allocation based context switches from the no-data fsmark workloads being used for this testing. It makes no noticable difference to performance here, so I'm interested if it solves the regression you are seeing on your machines. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: don't defer metadata allocation to the workqueue From: Dave Chinner <dchinner@redhat.com> Almost all metadata allocations come from shallow stack usage situations. Avoid the overhead of switching the allocation to a workqueue as we are not in danger of running out of stack when making these allocations. Metadata allocations are already marked through the args that are passed down, so this is trivial to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_alloc.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index f654f51..4f33c32 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -2434,13 +2434,22 @@ xfs_alloc_vextent_worker( current_restore_flags_nested(&pflags, PF_FSTRANS); } - -int /* error */ +/* + * Data allocation requests often come in with little stack to work on. Push + * them off to a worker thread so there is lots of stack to use. Metadata + * requests, OTOH, are generally from low stack usage paths, so avoid the + * context switch overhead here. + */ +int xfs_alloc_vextent( - xfs_alloc_arg_t *args) /* allocation argument structure */ + struct xfs_alloc_arg *args) { DECLARE_COMPLETION_ONSTACK(done); + if (!args->userdata) + return __xfs_alloc_vextent(args); + + args->done = &done; INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker); queue_work(xfs_alloc_wq, &args->work); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-04 0:47 ` Dave Chinner 0 siblings, 0 replies; 108+ messages in thread From: Dave Chinner @ 2012-07-04 0:47 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > Which indicates that there is workqueue scheduling issues, I think. > > The same amount of work is being done, but half of it is being > > pushed off into a workqueue to avoid stack overflow issues (*). I > > tested the above patch in anger on an 8p machine, similar to the > > machine you saw no regressions on, but the workload didn't drive it > > to being completely CPU bound (only about 90%) so the allocation > > work was probably always scheduled quickly. > > What test were you using? fsmark, dbench, compilebench, and a few fio workloads. Also, xfstests times each test and I keep track of overall runtime, and none of those showed any performance differential, either... Indeed, running on a current 3.5-rc5 tree, my usual fsmark benchmarks are running at the same numbers I've been seeing since about 3.0 - somewhere around 18k files/s for a single thread, and 110-115k files/s for 8 threads. I just ran your variant, and I'm getting about 20kfile/s for a single thread, which is about right because you're using smaller directories than I am (22500 files per dir vs 100k in my tests). > > How many worker threads have been spawned on these machines > > that are showing the regression? > > 20 or 21 generally. An example list as spotted by top looks like Pretty normal. > > What is the context switch rate on the machines whenteh test is running? ..... > Vanilla average context switch rate 4278.53 > Revert average context switch rate 1095 That seems about right, too. > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. Where the context switches are coming from, and how long they are abeing stalled for. Just to get the context switch locations, you can use perf on the sched:sched_switch event, but that doesn't give you stall times. Local testing tells me that about 40% of the switches are from xfs_alloc_vextent, 55% are from the work threads, and the rest are CPU idling events, which is exactly as I'd expect. > > A pert top profile comparison might be informative, > > too... > > > > I'm not sure if this is what you really wanted. I thought an oprofile or > perf report would have made more sense but I recorded perf top over time > anyway and it's at the end of the mail. perf report and oprofile give you CPU usage across the run, it's not instantaneous and that's where all the interesting information is. e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or it might be 100% for 1s - that's the behaviour run profiles cannot give you insight into.... As it is, the output you posted is nothing unusual. > For just these XFS tests I've uploaded a tarball of the logs to > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz Ok, so the main thing I missed when first looking at this is that you are concerned about single thread regressions. Well, I can't reproduce your results here. Single threaded with or without the workqueue based allocation gives me roughly 20k +/-0.5k files/s one a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM machine. That's the same results I've been seeing since I wrote this patch almost 12 months ago.... So, given that this is a metadata intensive workload, the only extent allocation is going to be through inode and directory block allocation. These paths do not consume a large amount of stack, so we can tell the allocator not to switch to workqueue stack for these allocations easily. The patch below does this. It completely removes all the allocation based context switches from the no-data fsmark workloads being used for this testing. It makes no noticable difference to performance here, so I'm interested if it solves the regression you are seeing on your machines. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: don't defer metadata allocation to the workqueue From: Dave Chinner <dchinner@redhat.com> Almost all metadata allocations come from shallow stack usage situations. Avoid the overhead of switching the allocation to a workqueue as we are not in danger of running out of stack when making these allocations. Metadata allocations are already marked through the args that are passed down, so this is trivial to do. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_alloc.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index f654f51..4f33c32 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -2434,13 +2434,22 @@ xfs_alloc_vextent_worker( current_restore_flags_nested(&pflags, PF_FSTRANS); } - -int /* error */ +/* + * Data allocation requests often come in with little stack to work on. Push + * them off to a worker thread so there is lots of stack to use. Metadata + * requests, OTOH, are generally from low stack usage paths, so avoid the + * context switch overhead here. + */ +int xfs_alloc_vextent( - xfs_alloc_arg_t *args) /* allocation argument structure */ + struct xfs_alloc_arg *args) { DECLARE_COMPLETION_ONSTACK(done); + if (!args->userdata) + return __xfs_alloc_vextent(args); + + args->done = &done; INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker); queue_work(xfs_alloc_wq, &args->work); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-04 0:47 ` Dave Chinner (?) @ 2012-07-04 9:51 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 9:51 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote: > > > > <SNIP> > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > > > Which indicates that there is workqueue scheduling issues, I think. > > > The same amount of work is being done, but half of it is being > > > pushed off into a workqueue to avoid stack overflow issues (*). I > > > tested the above patch in anger on an 8p machine, similar to the > > > machine you saw no regressions on, but the workload didn't drive it > > > to being completely CPU bound (only about 90%) so the allocation > > > work was probably always scheduled quickly. > > > > What test were you using? > > fsmark, dbench, compilebench, and a few fio workloads. Also, > xfstests times each test and I keep track of overall runtime, and > none of those showed any performance differential, either... > Sound. I have some coverage on some of the same tests. When I get to them I'll keep an eye on the 3.4 figures. It might be due to the disk I'm using. It's a single disk and nothing to write home about in terms of performance. It's not exactly XFS's usual target audience. > Indeed, running on a current 3.5-rc5 tree, my usual fsmark > benchmarks are running at the same numbers I've been seeing since > about 3.0 - somewhere around 18k files/s for a single thread, and > 110-115k files/s for 8 threads. > > I just ran your variant, and I'm getting about 20kfile/s for a > single thread, which is about right because you're using smaller > directories than I am (22500 files per dir vs 100k in my tests). > I had data for an fsmark-single test running with 30M files and FWIW the 3.4 performance figures were in line with 3.0 and later kernels. > > > How many worker threads have been spawned on these machines > > > that are showing the regression? > > > > 20 or 21 generally. An example list as spotted by top looks like > > Pretty normal. > > > > What is the context switch rate on the machines whenteh test is running? > ..... > > Vanilla average context switch rate 4278.53 > > Revert average context switch rate 1095 > > That seems about right, too. > > > > Can you run latencytop to see > > > if there is excessive starvation/wait times for allocation > > > completion? > > > > I'm not sure what format you are looking for. > > Where the context switches are coming from, and how long they are > abeing stalled for. Noted. Capturing latency_stats over time is enough to do that. It won't give a per-process breakdown but in the majority of cases that is not a problem. Extracting the data is a bit annoying but not impossible and better than parsing latencytop. Ideally, latencytop would be able to log data in some sensible format. hmmm. > Just to get the context switch locations, you > can use perf on the sched:sched_switch event, but that doesn't give > you stall times. No, but both can be captured and roughly correlated with each other given sufficient motivation. > Local testing tells me that about 40% of the > switches are from xfs_alloc_vextent, 55% are from the work threads, > and the rest are CPU idling events, which is exactly as I'd expect. > > > > A pert top profile comparison might be informative, > > > too... > > > > > > > I'm not sure if this is what you really wanted. I thought an oprofile or > > perf report would have made more sense but I recorded perf top over time > > anyway and it's at the end of the mail. > > perf report and oprofile give you CPU usage across the run, it's not > instantaneous and that's where all the interesting information is. > e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or > it might be 100% for 1s - that's the behaviour run profiles cannot > give you insight into.... > Fair point. I'll fix up the timestamping and keep the monitor for future reference. > As it is, the output you posted is nothing unusual. > Grand. I had taken a look but I saw nothing particularly unusual but I also was not 100% sure what I should be looking for., > > For just these XFS tests I've uploaded a tarball of the logs to > > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz > > Ok, so the main thing I missed when first looking at this is that > you are concerned about single thread regressions. In this specific test, yes. In the original data I posted I had threaded benchmarks but they did not show the regression. This was a rerun with just the single threaded case. Generally I run both because I see bug reports involving both types of test. > Well, I can't > reproduce your results here. Single threaded with or without the > workqueue based allocation gives me roughly 20k +/-0.5k files/s one > a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM > machine. That's the same results I've been seeing since I wrote > this patch almost 12 months ago.... > > So, given that this is a metadata intensive workload, the only > extent allocation is going to be through inode and directory block > allocation. These paths do not consume a large amount of stack, so > we can tell the allocator not to switch to workqueue stack for these > allocations easily. > > The patch below does this. It completely removes all the allocation > based context switches from the no-data fsmark workloads being used > for this testing. It makes no noticable difference to performance > here, so I'm interested if it solves the regression you are seeing > on your machines. > It does. nodefer-metadata is just your patch applied on top of 3.4 and is the right-most column. It's within the noise for the reverted patches and approximately the same performance as 3.3. If you look at the timing stats at the bottom you'll see that hte patch brings the System time way down so consider this a Tested-by: Mel Gorman <mgorman@suse.de> FS-Mark Single Threaded fsmark-single single-3.4.0 single-3.4.0 single-3.4.0 single-3.4.0 3.4.0-vanilla revert-aa464191 revert-c999a223 revert-both nodefer-metadata Files/s min 14176.40 ( 0.00%) 17830.60 (25.78%) 24186.70 (70.61%) 25108.00 (77.11%) 25448.40 (79.51%) Files/s mean 16783.35 ( 0.00%) 25029.69 (49.13%) 37513.72 (123.52%) 38169.97 (127.43%) 36393.09 (116.84%) Files/s stddev 1007.26 ( 0.00%) 2644.87 (162.58%) 5344.99 (430.65%) 5599.65 (455.93%) 5961.48 (491.85%) Files/s max 18475.40 ( 0.00%) 27966.10 (51.37%) 45564.60 (146.62%) 47918.10 (159.36%) 47146.20 (155.18%) Overhead min 593978.00 ( 0.00%) 386173.00 (34.99%) 253812.00 (57.27%) 247396.00 (58.35%) 248906.00 (58.10%) Overhead mean 637782.80 ( 0.00%) 429229.33 (32.70%) 322868.20 (49.38%) 287141.73 (54.98%) 284274.93 (55.43%) Overhead stddev 72440.72 ( 0.00%) 100056.96 (-38.12%) 175001.08 (-141.58%) 102018.14 (-40.83%) 114055.47 (-57.45%) Overhead max 855637.00 ( 0.00%) 753541.00 (11.93%) 880531.00 (-2.91%) 637932.00 (25.44%) 710720.00 (16.94%) MMTests Statistics: duration Sys Time Running Test (seconds) 44.06 32.25 24.19 23.99 24.38 User+Sys Time Running Test (seconds) 50.19 36.35 27.24 26.7 27.12 Total Elapsed Time (seconds) 59.21 44.76 34.95 34.14 36.11 Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-04 9:51 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 9:51 UTC (permalink / raw) To: Dave Chinner Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote: > > > > <SNIP> > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > > > Which indicates that there is workqueue scheduling issues, I think. > > > The same amount of work is being done, but half of it is being > > > pushed off into a workqueue to avoid stack overflow issues (*). I > > > tested the above patch in anger on an 8p machine, similar to the > > > machine you saw no regressions on, but the workload didn't drive it > > > to being completely CPU bound (only about 90%) so the allocation > > > work was probably always scheduled quickly. > > > > What test were you using? > > fsmark, dbench, compilebench, and a few fio workloads. Also, > xfstests times each test and I keep track of overall runtime, and > none of those showed any performance differential, either... > Sound. I have some coverage on some of the same tests. When I get to them I'll keep an eye on the 3.4 figures. It might be due to the disk I'm using. It's a single disk and nothing to write home about in terms of performance. It's not exactly XFS's usual target audience. > Indeed, running on a current 3.5-rc5 tree, my usual fsmark > benchmarks are running at the same numbers I've been seeing since > about 3.0 - somewhere around 18k files/s for a single thread, and > 110-115k files/s for 8 threads. > > I just ran your variant, and I'm getting about 20kfile/s for a > single thread, which is about right because you're using smaller > directories than I am (22500 files per dir vs 100k in my tests). > I had data for an fsmark-single test running with 30M files and FWIW the 3.4 performance figures were in line with 3.0 and later kernels. > > > How many worker threads have been spawned on these machines > > > that are showing the regression? > > > > 20 or 21 generally. An example list as spotted by top looks like > > Pretty normal. > > > > What is the context switch rate on the machines whenteh test is running? > ..... > > Vanilla average context switch rate 4278.53 > > Revert average context switch rate 1095 > > That seems about right, too. > > > > Can you run latencytop to see > > > if there is excessive starvation/wait times for allocation > > > completion? > > > > I'm not sure what format you are looking for. > > Where the context switches are coming from, and how long they are > abeing stalled for. Noted. Capturing latency_stats over time is enough to do that. It won't give a per-process breakdown but in the majority of cases that is not a problem. Extracting the data is a bit annoying but not impossible and better than parsing latencytop. Ideally, latencytop would be able to log data in some sensible format. hmmm. > Just to get the context switch locations, you > can use perf on the sched:sched_switch event, but that doesn't give > you stall times. No, but both can be captured and roughly correlated with each other given sufficient motivation. > Local testing tells me that about 40% of the > switches are from xfs_alloc_vextent, 55% are from the work threads, > and the rest are CPU idling events, which is exactly as I'd expect. > > > > A pert top profile comparison might be informative, > > > too... > > > > > > > I'm not sure if this is what you really wanted. I thought an oprofile or > > perf report would have made more sense but I recorded perf top over time > > anyway and it's at the end of the mail. > > perf report and oprofile give you CPU usage across the run, it's not > instantaneous and that's where all the interesting information is. > e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or > it might be 100% for 1s - that's the behaviour run profiles cannot > give you insight into.... > Fair point. I'll fix up the timestamping and keep the monitor for future reference. > As it is, the output you posted is nothing unusual. > Grand. I had taken a look but I saw nothing particularly unusual but I also was not 100% sure what I should be looking for., > > For just these XFS tests I've uploaded a tarball of the logs to > > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz > > Ok, so the main thing I missed when first looking at this is that > you are concerned about single thread regressions. In this specific test, yes. In the original data I posted I had threaded benchmarks but they did not show the regression. This was a rerun with just the single threaded case. Generally I run both because I see bug reports involving both types of test. > Well, I can't > reproduce your results here. Single threaded with or without the > workqueue based allocation gives me roughly 20k +/-0.5k files/s one > a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM > machine. That's the same results I've been seeing since I wrote > this patch almost 12 months ago.... > > So, given that this is a metadata intensive workload, the only > extent allocation is going to be through inode and directory block > allocation. These paths do not consume a large amount of stack, so > we can tell the allocator not to switch to workqueue stack for these > allocations easily. > > The patch below does this. It completely removes all the allocation > based context switches from the no-data fsmark workloads being used > for this testing. It makes no noticable difference to performance > here, so I'm interested if it solves the regression you are seeing > on your machines. > It does. nodefer-metadata is just your patch applied on top of 3.4 and is the right-most column. It's within the noise for the reverted patches and approximately the same performance as 3.3. If you look at the timing stats at the bottom you'll see that hte patch brings the System time way down so consider this a Tested-by: Mel Gorman <mgorman@suse.de> FS-Mark Single Threaded fsmark-single single-3.4.0 single-3.4.0 single-3.4.0 single-3.4.0 3.4.0-vanilla revert-aa464191 revert-c999a223 revert-both nodefer-metadata Files/s min 14176.40 ( 0.00%) 17830.60 (25.78%) 24186.70 (70.61%) 25108.00 (77.11%) 25448.40 (79.51%) Files/s mean 16783.35 ( 0.00%) 25029.69 (49.13%) 37513.72 (123.52%) 38169.97 (127.43%) 36393.09 (116.84%) Files/s stddev 1007.26 ( 0.00%) 2644.87 (162.58%) 5344.99 (430.65%) 5599.65 (455.93%) 5961.48 (491.85%) Files/s max 18475.40 ( 0.00%) 27966.10 (51.37%) 45564.60 (146.62%) 47918.10 (159.36%) 47146.20 (155.18%) Overhead min 593978.00 ( 0.00%) 386173.00 (34.99%) 253812.00 (57.27%) 247396.00 (58.35%) 248906.00 (58.10%) Overhead mean 637782.80 ( 0.00%) 429229.33 (32.70%) 322868.20 (49.38%) 287141.73 (54.98%) 284274.93 (55.43%) Overhead stddev 72440.72 ( 0.00%) 100056.96 (-38.12%) 175001.08 (-141.58%) 102018.14 (-40.83%) 114055.47 (-57.45%) Overhead max 855637.00 ( 0.00%) 753541.00 (11.93%) 880531.00 (-2.91%) 637932.00 (25.44%) 710720.00 (16.94%) MMTests Statistics: duration Sys Time Running Test (seconds) 44.06 32.25 24.19 23.99 24.38 User+Sys Time Running Test (seconds) 50.19 36.35 27.24 26.7 27.12 Total Elapsed Time (seconds) 59.21 44.76 34.95 34.14 36.11 Thanks. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-04 9:51 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 9:51 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Wed, Jul 04, 2012 at 10:47:06AM +1000, Dave Chinner wrote: > > > > <SNIP> > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > > > Which indicates that there is workqueue scheduling issues, I think. > > > The same amount of work is being done, but half of it is being > > > pushed off into a workqueue to avoid stack overflow issues (*). I > > > tested the above patch in anger on an 8p machine, similar to the > > > machine you saw no regressions on, but the workload didn't drive it > > > to being completely CPU bound (only about 90%) so the allocation > > > work was probably always scheduled quickly. > > > > What test were you using? > > fsmark, dbench, compilebench, and a few fio workloads. Also, > xfstests times each test and I keep track of overall runtime, and > none of those showed any performance differential, either... > Sound. I have some coverage on some of the same tests. When I get to them I'll keep an eye on the 3.4 figures. It might be due to the disk I'm using. It's a single disk and nothing to write home about in terms of performance. It's not exactly XFS's usual target audience. > Indeed, running on a current 3.5-rc5 tree, my usual fsmark > benchmarks are running at the same numbers I've been seeing since > about 3.0 - somewhere around 18k files/s for a single thread, and > 110-115k files/s for 8 threads. > > I just ran your variant, and I'm getting about 20kfile/s for a > single thread, which is about right because you're using smaller > directories than I am (22500 files per dir vs 100k in my tests). > I had data for an fsmark-single test running with 30M files and FWIW the 3.4 performance figures were in line with 3.0 and later kernels. > > > How many worker threads have been spawned on these machines > > > that are showing the regression? > > > > 20 or 21 generally. An example list as spotted by top looks like > > Pretty normal. > > > > What is the context switch rate on the machines whenteh test is running? > ..... > > Vanilla average context switch rate 4278.53 > > Revert average context switch rate 1095 > > That seems about right, too. > > > > Can you run latencytop to see > > > if there is excessive starvation/wait times for allocation > > > completion? > > > > I'm not sure what format you are looking for. > > Where the context switches are coming from, and how long they are > abeing stalled for. Noted. Capturing latency_stats over time is enough to do that. It won't give a per-process breakdown but in the majority of cases that is not a problem. Extracting the data is a bit annoying but not impossible and better than parsing latencytop. Ideally, latencytop would be able to log data in some sensible format. hmmm. > Just to get the context switch locations, you > can use perf on the sched:sched_switch event, but that doesn't give > you stall times. No, but both can be captured and roughly correlated with each other given sufficient motivation. > Local testing tells me that about 40% of the > switches are from xfs_alloc_vextent, 55% are from the work threads, > and the rest are CPU idling events, which is exactly as I'd expect. > > > > A pert top profile comparison might be informative, > > > too... > > > > > > > I'm not sure if this is what you really wanted. I thought an oprofile or > > perf report would have made more sense but I recorded perf top over time > > anyway and it's at the end of the mail. > > perf report and oprofile give you CPU usage across the run, it's not > instantaneous and that's where all the interesting information is. > e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or > it might be 100% for 1s - that's the behaviour run profiles cannot > give you insight into.... > Fair point. I'll fix up the timestamping and keep the monitor for future reference. > As it is, the output you posted is nothing unusual. > Grand. I had taken a look but I saw nothing particularly unusual but I also was not 100% sure what I should be looking for., > > For just these XFS tests I've uploaded a tarball of the logs to > > http://www.csn.ul.ie/~mel/postings/xfsbisect-20120703/xfsbisect-logs.tar.gz > > Ok, so the main thing I missed when first looking at this is that > you are concerned about single thread regressions. In this specific test, yes. In the original data I posted I had threaded benchmarks but they did not show the regression. This was a rerun with just the single threaded case. Generally I run both because I see bug reports involving both types of test. > Well, I can't > reproduce your results here. Single threaded with or without the > workqueue based allocation gives me roughly 20k +/-0.5k files/s one > a single disk, a 12 disk RAID0 array and a RAM disk on a 8p/4GB RAM > machine. That's the same results I've been seeing since I wrote > this patch almost 12 months ago.... > > So, given that this is a metadata intensive workload, the only > extent allocation is going to be through inode and directory block > allocation. These paths do not consume a large amount of stack, so > we can tell the allocator not to switch to workqueue stack for these > allocations easily. > > The patch below does this. It completely removes all the allocation > based context switches from the no-data fsmark workloads being used > for this testing. It makes no noticable difference to performance > here, so I'm interested if it solves the regression you are seeing > on your machines. > It does. nodefer-metadata is just your patch applied on top of 3.4 and is the right-most column. It's within the noise for the reverted patches and approximately the same performance as 3.3. If you look at the timing stats at the bottom you'll see that hte patch brings the System time way down so consider this a Tested-by: Mel Gorman <mgorman@suse.de> FS-Mark Single Threaded fsmark-single single-3.4.0 single-3.4.0 single-3.4.0 single-3.4.0 3.4.0-vanilla revert-aa464191 revert-c999a223 revert-both nodefer-metadata Files/s min 14176.40 ( 0.00%) 17830.60 (25.78%) 24186.70 (70.61%) 25108.00 (77.11%) 25448.40 (79.51%) Files/s mean 16783.35 ( 0.00%) 25029.69 (49.13%) 37513.72 (123.52%) 38169.97 (127.43%) 36393.09 (116.84%) Files/s stddev 1007.26 ( 0.00%) 2644.87 (162.58%) 5344.99 (430.65%) 5599.65 (455.93%) 5961.48 (491.85%) Files/s max 18475.40 ( 0.00%) 27966.10 (51.37%) 45564.60 (146.62%) 47918.10 (159.36%) 47146.20 (155.18%) Overhead min 593978.00 ( 0.00%) 386173.00 (34.99%) 253812.00 (57.27%) 247396.00 (58.35%) 248906.00 (58.10%) Overhead mean 637782.80 ( 0.00%) 429229.33 (32.70%) 322868.20 (49.38%) 287141.73 (54.98%) 284274.93 (55.43%) Overhead stddev 72440.72 ( 0.00%) 100056.96 (-38.12%) 175001.08 (-141.58%) 102018.14 (-40.83%) 114055.47 (-57.45%) Overhead max 855637.00 ( 0.00%) 753541.00 (11.93%) 880531.00 (-2.91%) 637932.00 (25.44%) 710720.00 (16.94%) MMTests Statistics: duration Sys Time Running Test (seconds) 44.06 32.25 24.19 23.99 24.38 User+Sys Time Running Test (seconds) 50.19 36.35 27.24 26.7 27.12 Total Elapsed Time (seconds) 59.21 44.76 34.95 34.14 36.11 Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-02 19:35 ` Mel Gorman (?) @ 2012-07-03 13:04 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 13:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > <SNIP> > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [c999a223: xfs: introduce an allocation workqueue] > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > gdm was running on the machine so i915 would have been in use. Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded but X is not. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 13:04 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 13:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, linux-mm, linux-fsdevel, Eugeni Dodonov On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > <SNIP> > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [c999a223: xfs: introduce an allocation workqueue] > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > gdm was running on the machine so i915 would have been in use. Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded but X is not. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 13:04 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-03 13:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > <SNIP> > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [c999a223: xfs: introduce an allocation workqueue] > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > gdm was running on the machine so i915 would have been in use. Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded but X is not. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-03 13:04 ` Mel Gorman (?) @ 2012-07-03 14:04 ` Daniel Vetter -1 siblings, 0 replies; 108+ messages in thread From: Daniel Vetter @ 2012-07-03 14:04 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 02:04:14PM +0100, Mel Gorman wrote: > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > <SNIP> > > > > > It was obvious very quickly that there were two distinct regression so I > > ran two bisections. One led to a XFS and the other led to an i915 patch > > that enables RC6 to reduce power usage. > > > > [c999a223: xfs: introduce an allocation workqueue] > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > gdm was running on the machine so i915 would have been in use. > > Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded > but X is not. See my little explanation of rc6, just loading the driver will have effects. But I'm happy to know that the issue also happens without using it, makes it really unlikely it's an issue with the gpu or i915.ko ;-) -Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 14:04 ` Daniel Vetter 0 siblings, 0 replies; 108+ messages in thread From: Daniel Vetter @ 2012-07-03 14:04 UTC (permalink / raw) To: Mel Gorman Cc: Keith Packard, Chris Wilson, Daniel Vetter, linux-kernel, dri-devel, xfs, Christoph Hellwig, linux-mm, linux-fsdevel, Eugeni Dodonov On Tue, Jul 03, 2012 at 02:04:14PM +0100, Mel Gorman wrote: > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > <SNIP> > > > > > It was obvious very quickly that there were two distinct regression so I > > ran two bisections. One led to a XFS and the other led to an i915 patch > > that enables RC6 to reduce power usage. > > > > [c999a223: xfs: introduce an allocation workqueue] > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > gdm was running on the machine so i915 would have been in use. > > Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded > but X is not. See my little explanation of rc6, just loading the driver will have effects. But I'm happy to know that the issue also happens without using it, makes it really unlikely it's an issue with the gpu or i915.ko ;-) -Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-03 14:04 ` Daniel Vetter 0 siblings, 0 replies; 108+ messages in thread From: Daniel Vetter @ 2012-07-03 14:04 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Dave Chinner, linux-mm, linux-kernel, linux-fsdevel, xfs, dri-devel, Keith Packard, Eugeni Dodonov, Daniel Vetter, Chris Wilson On Tue, Jul 03, 2012 at 02:04:14PM +0100, Mel Gorman wrote: > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > <SNIP> > > > > > It was obvious very quickly that there were two distinct regression so I > > ran two bisections. One led to a XFS and the other led to an i915 patch > > that enables RC6 to reduce power usage. > > > > [c999a223: xfs: introduce an allocation workqueue] > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > gdm was running on the machine so i915 would have been in use. > > Bah, more PEBKAC. gdm was *not* running on this machine. i915 is loaded > but X is not. See my little explanation of rc6, just loading the driver will have effects. But I'm happy to know that the issue also happens without using it, makes it really unlikely it's an issue with the gpu or i915.ko ;-) -Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS 2012-07-01 23:54 ` Dave Chinner (?) @ 2012-07-02 13:30 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 13:30 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote: > On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote: > > Configuration: global-dhp__io-metadata-xfs > > Benchmarks: dbench3, fsmark-single, fsmark-threaded > > > > Summary > > ======= > > Most of the figures look good and in general there has been consistent good > > performance from XFS. However, fsmark-single is showing a severe performance > > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single > > thread took a particularly bad dive in 3.4 for two machines that is worth > > examining closer. > > That will be caused by the fact we changed all the metadata updates > to be logged, which means a transaction every time .dirty_inode is > called. > Ok. > This should mostly go away when XFS is converted to use .update_time > rather than .dirty_inode to only issue transactions when the VFS > updates the atime rather than every .dirty_inode call... > Sound. I'll keep an eye out for it in the future. If you want to use the same test configuration then be sure you set the partition configuration correctly. For example, these are the values I used for config-global-dhp__io-metadata configuration file. export TESTDISK_PARTITION=/dev/sda6 export TESTDISK_FILESYSTEM=xfs export TESTDISK_MKFS_PARAM="-f -d agcount=8" export TESTDISK_MOUNT_ARGS=inode64,delaylog,logbsize=262144,nobarrier > > Unfortunately it is harder to easy conclusions as the > > gains/losses are not consistent between machines which may be related to > > the available number of CPU threads. > > It increases the CPU overhead (dirty_inode can be called up to 4 > times per write(2) call, IIRC), so with limited numbers of > threads/limited CPU power it will result in lower performance. Where > you have lots of CPU power, there will be little difference in > performance... > Thanks for that clarification. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 13:30 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 13:30 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel, xfs On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote: > On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote: > > Configuration: global-dhp__io-metadata-xfs > > Benchmarks: dbench3, fsmark-single, fsmark-threaded > > > > Summary > > ======= > > Most of the figures look good and in general there has been consistent good > > performance from XFS. However, fsmark-single is showing a severe performance > > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single > > thread took a particularly bad dive in 3.4 for two machines that is worth > > examining closer. > > That will be caused by the fact we changed all the metadata updates > to be logged, which means a transaction every time .dirty_inode is > called. > Ok. > This should mostly go away when XFS is converted to use .update_time > rather than .dirty_inode to only issue transactions when the VFS > updates the atime rather than every .dirty_inode call... > Sound. I'll keep an eye out for it in the future. If you want to use the same test configuration then be sure you set the partition configuration correctly. For example, these are the values I used for config-global-dhp__io-metadata configuration file. export TESTDISK_PARTITION=/dev/sda6 export TESTDISK_FILESYSTEM=xfs export TESTDISK_MKFS_PARAM="-f -d agcount=8" export TESTDISK_MOUNT_ARGS=inode64,delaylog,logbsize=262144,nobarrier > > Unfortunately it is harder to easy conclusions as the > > gains/losses are not consistent between machines which may be related to > > the available number of CPU threads. > > It increases the CPU overhead (dirty_inode can be called up to 4 > times per write(2) call, IIRC), so with limited numbers of > threads/limited CPU power it will result in lower performance. Where > you have lots of CPU power, there will be little difference in > performance... > Thanks for that clarification. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] IO metadata on XFS @ 2012-07-02 13:30 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-02 13:30 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-mm, linux-kernel, linux-fsdevel, xfs On Mon, Jul 02, 2012 at 09:54:58AM +1000, Dave Chinner wrote: > On Fri, Jun 29, 2012 at 12:25:06PM +0100, Mel Gorman wrote: > > Configuration: global-dhp__io-metadata-xfs > > Benchmarks: dbench3, fsmark-single, fsmark-threaded > > > > Summary > > ======= > > Most of the figures look good and in general there has been consistent good > > performance from XFS. However, fsmark-single is showing a severe performance > > dip in a few cases somewhere between 3.1 and 3.4. fs-mark running a single > > thread took a particularly bad dive in 3.4 for two machines that is worth > > examining closer. > > That will be caused by the fact we changed all the metadata updates > to be logged, which means a transaction every time .dirty_inode is > called. > Ok. > This should mostly go away when XFS is converted to use .update_time > rather than .dirty_inode to only issue transactions when the VFS > updates the atime rather than every .dirty_inode call... > Sound. I'll keep an eye out for it in the future. If you want to use the same test configuration then be sure you set the partition configuration correctly. For example, these are the values I used for config-global-dhp__io-metadata configuration file. export TESTDISK_PARTITION=/dev/sda6 export TESTDISK_FILESYSTEM=xfs export TESTDISK_MKFS_PARAM="-f -d agcount=8" export TESTDISK_MOUNT_ARGS=inode64,delaylog,logbsize=262144,nobarrier > > Unfortunately it is harder to easy conclusions as the > > gains/losses are not consistent between machines which may be related to > > the available number of CPU threads. > > It increases the CPU overhead (dirty_inode can be called up to 4 > times per write(2) call, IIRC), so with limited numbers of > threads/limited CPU power it will result in lower performance. Where > you have lots of CPU power, there will be little difference in > performance... > Thanks for that clarification. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page reclaim performance on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-04 15:52 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 15:52 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__pagereclaim-performance-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= largedd is showing that in 3.0 that reclaim started writing pages. This has been "fixed" by deferring the writeback to flusher threads and immediately reclaiming. This is mostly fixed now but it would be worth refreshing the memory as to why this happened in the first place. Slab shrinking is another area to pay attention to. Scanning in recent kernels is consistent with earlier kernels but there are cases where the number of inode being reclaimed has changed significantly. Inode stealing in one case is why higher since 3.1. postmark detected that there was excessive swapping for some kernels between 3.0 and 3.3 but that it's not always reproducible and depends on both the workload and the amount of available memory. It does indicate that 3.1 and 3.2 might have been very bad kernels for interactivity performance under memory pressure. largedd in is showing in some cases that there was excessive swap activity in 3.1 and 3.2 kernels. This has been fixed but the fixes were not backported. Benchmark notes =============== Each of the benchmarks trigger page reclaim in a fairly simple manner. The intention is not to be exhaustive but to test basic reclaim patterns that the VM should never get wrong. Regressions may also be due to changes in the IO scheduler or underlying filesystem. The workloads are predominately file-based. Anonymous page reclaim stress testing is covered by another test. mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. postmark o 15000 transactions o File size ranged 3096 bytes to 5M o 100 subdirectories o Total footprint approximately 4*TOTAL_RAM This workload is a single-threaded benchmark intended to measure filesystem performance for many short-lived and small files. It's primary weakness is that it does no application processing and so the page aging is basically on per-file granularity. largedd o Target size 8*TOTAL_RAM This downloads a large file and makes copies with dd until the target footprint size is reached. fsmark o Parallel directories were used o 1 Thread per CPU o 30M Filesize o 16 directories o 256 files per directory o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-25458/1 -D 16 -N 256 -n 532 -L 15 -S0 -s 31457280 Thread: ./fs_mark -d /tmp/fsmark-28217/1 -d /tmp/fsmark-28217/2 -d /tmp/fsmark-28217/3 -d /tmp/fsmark-28217/4 -d /tmp/fsmark-28217/5 -d /tmp/fsmark-28217/6 -d /tmp/fsmark-28217/7 -d /tmp/fsmark-28217/8 -d /tmp/fsmark-28217/9 -d /tmp/fsmark-28217/10 -d /tmp/fsmark-28217/11 -d /tmp/fsmark-28217/12 -d /tmp/fsmark-28217/13 -d /tmp/fsmark-28217/14 -d /tmp/fsmark-28217/15 -d /tmp/fsmark-28217/16 -d /tmp/fsmark-28217/17 -d /tmp/fsmark-28217/18 -d /tmp/fsmark-28217/19 -d /tmp/fsmark-28217/20 -d /tmp/fsmark-28217/21 -d /tmp/fsmark-28217/22 -d /tmp/fsmark-28217/23 -d /tmp/fsmark-28217/24 -d /tmp/fsmark-28217/25 -d /tmp/fsmark-28217/26 -d /tmp/fsmark-28217/27 -d /tmp/fsmark-28217/28 -d /tmp/fsmark-28217/29 -d /tmp/fsmark-28217/30 -d /tmp/fsmark-28217/31 -d /tmp/fsmark-28217/32 -D 16 -N 256 -n 16 -L 15 -S0 -s 31457280 micro o Total mapping size 10*TOTAL_RAM o NR_CPU threads o 5 iterations This creates one process per CPU and creates a large file-backed mapping up to the total mapping size. Each of the threads does a streaming read of the mapping for a number of iterations. It then restarts with a streaming write. Completion time is the primary factor here but be careful as a good completion time can be due to excessive reclaim which has an adverse effect on many other workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Generally good but pay attention to largedd =========================================================== fsmark-single ------------- Generally this is looking good. There was a small performance dip in 3.3 that has been recovered. All the reclaim was done from kswapd in all cases and efficiency was high. fsmark-threaded --------------- Generally looking good although the higher variance in 3.4 is a surprise. It is interesting to note that in 2.6.32 and 2.6.34 that direct reclaim was used and that this is no longer the case. kswapd efficiency is high throughout as the scanning velocity is steady. A point of interest is that slabs were not scanned in 3.4. This could be good or bad. postmark -------- postmark performance took a dip in 3.2 and 3.3 and this roughly correlates with some other IO tests. It is very interesting to note that there was swap activity in 3.1 and 3.2 that has been since resolved but is something that -stable users might care about. Otherwise kswapd efficiency and scanning rates look good. largedd ------- Figures generally look good here. All the reclaim is done by kswapd as expected. It is very interesting to note that in 3.0 and 3.1 that reclaim started writing out pages and that it is now being deferred to flusher threads and then immediately reclaimed. micro ----- Completion times generally look good and are improving. Some direct reclaim is happening but at a steady percentage on each release. Efficiency is much lower than other workloads but at least it is consistent. Slab shrinking may need examination. It clearly shows that scanning is taking place but fewer inodes are being reclaimed in recent kernels. This is not necessarily bad but worth paying attention to. One point of concern is that kswapd CPU usage is high in recent kernels for this test. The graph is a complete mess and needs closer examination. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok but postmark shows high inode steals ========================================================== fsmark-single ------------- This is all over the place. 3.0 was bad, 3.1 was good but performance has declined since then. The reclaim statistics look ok although slabs are being scanned without being reclaimed which is curious. kswapd is scanning less but that is not necessarily the cause of the problem. fsmark-threaded --------------- In contrast to the single-threaded case, this is looking steady although overall kernels are slower than 2.6.32 was. Direct reclaim velocity is slightly higher but still a small percentage of overall scanning. It's cause for some concern but not an immediate panic as reclaim efficiency is high. postmark ------- postmark performance took a serious dip in 3.2 and 3.3 and while it has recovered in 3.4 a bit, it's still far below the peak seen in 3.1. For the most part, reclaim statistics look ok with the exception of slab shrinking. Inode stealing is way up 3.1 and later kernels. largedd ------- Recent figures look good but in 3.1 and 3.2 there was excessive swapping. This was matched by pages reclaimed by direct reclaim in the same kernels and this likely shot interactive performance to hell with doing large copies on those kernels. This roughly matches other bug reports so it's interesting but clearly the patches did not get backported to -stable. micro ----- Completion times looking generally good. 3.1 and 3.2 show better times but as it is matched by excessive amounts of reclaim it is not likely to be a good trade-off so lets not "fix" that. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== fsmark-single ------------- There was a performance dip in 3.4 but otherwise looks ok. Reclaim stats look fine. fsmark-threaded --------------- Other than something crazy happening to variance in 3.1, figures look ok. There is some direct reclaim going on but the efficiency is high. There are 8 threads on this machine and only one disk so some stalling due to direct reclaim would be expected if IO was not keeping up. postmark -------- postmark performance took a serious dip in 3.1 and 3.2 which is a kernel version earlier than the same dip seen in hydra. No explanation for this but it is matched by swap activity in the same kernels so it might be an indication there was general swapping-related damage in the 3.0-3.3 time-frame. largedd ------- Recent figures look good. Again, some swap activity is visible in 3.1 and 3.2 that has been since fixed but obviously not backported. micro ----- Okish I suppose. Completion times have improved but are all over the place and as seen elsewhere good completion times on micro can be sometimes due to excessive reclaim and is not necessarily a good thing. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page reclaim performance on ext3 @ 2012-07-04 15:52 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 15:52 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__pagereclaim-performance-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= largedd is showing that in 3.0 that reclaim started writing pages. This has been "fixed" by deferring the writeback to flusher threads and immediately reclaiming. This is mostly fixed now but it would be worth refreshing the memory as to why this happened in the first place. Slab shrinking is another area to pay attention to. Scanning in recent kernels is consistent with earlier kernels but there are cases where the number of inode being reclaimed has changed significantly. Inode stealing in one case is why higher since 3.1. postmark detected that there was excessive swapping for some kernels between 3.0 and 3.3 but that it's not always reproducible and depends on both the workload and the amount of available memory. It does indicate that 3.1 and 3.2 might have been very bad kernels for interactivity performance under memory pressure. largedd in is showing in some cases that there was excessive swap activity in 3.1 and 3.2 kernels. This has been fixed but the fixes were not backported. Benchmark notes =============== Each of the benchmarks trigger page reclaim in a fairly simple manner. The intention is not to be exhaustive but to test basic reclaim patterns that the VM should never get wrong. Regressions may also be due to changes in the IO scheduler or underlying filesystem. The workloads are predominately file-based. Anonymous page reclaim stress testing is covered by another test. mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. postmark o 15000 transactions o File size ranged 3096 bytes to 5M o 100 subdirectories o Total footprint approximately 4*TOTAL_RAM This workload is a single-threaded benchmark intended to measure filesystem performance for many short-lived and small files. It's primary weakness is that it does no application processing and so the page aging is basically on per-file granularity. largedd o Target size 8*TOTAL_RAM This downloads a large file and makes copies with dd until the target footprint size is reached. fsmark o Parallel directories were used o 1 Thread per CPU o 30M Filesize o 16 directories o 256 files per directory o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-25458/1 -D 16 -N 256 -n 532 -L 15 -S0 -s 31457280 Thread: ./fs_mark -d /tmp/fsmark-28217/1 -d /tmp/fsmark-28217/2 -d /tmp/fsmark-28217/3 -d /tmp/fsmark-28217/4 -d /tmp/fsmark-28217/5 -d /tmp/fsmark-28217/6 -d /tmp/fsmark-28217/7 -d /tmp/fsmark-28217/8 -d /tmp/fsmark-28217/9 -d /tmp/fsmark-28217/10 -d /tmp/fsmark-28217/11 -d /tmp/fsmark-28217/12 -d /tmp/fsmark-28217/13 -d /tmp/fsmark-28217/14 -d /tmp/fsmark-28217/15 -d /tmp/fsmark-28217/16 -d /tmp/fsmark-28217/17 -d /tmp/fsmark-28217/18 -d /tmp/fsmark-28217/19 -d /tmp/fsmark-28217/20 -d /tmp/fsmark-28217/21 -d /tmp/fsmark-28217/22 -d /tmp/fsmark-28217/23 -d /tmp/fsmark-28217/24 -d /tmp/fsmark-28217/25 -d /tmp/fsmark-28217/26 -d /tmp/fsmark-28217/27 -d /tmp/fsmark-28217/28 -d /tmp/fsmark-28217/29 -d /tmp/fsmark-28217/30 -d /tmp/fsmark-28217/31 -d /tmp/fsmark-28217/32 -D 16 -N 256 -n 16 -L 15 -S0 -s 31457280 micro o Total mapping size 10*TOTAL_RAM o NR_CPU threads o 5 iterations This creates one process per CPU and creates a large file-backed mapping up to the total mapping size. Each of the threads does a streaming read of the mapping for a number of iterations. It then restarts with a streaming write. Completion time is the primary factor here but be careful as a good completion time can be due to excessive reclaim which has an adverse effect on many other workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Generally good but pay attention to largedd =========================================================== fsmark-single ------------- Generally this is looking good. There was a small performance dip in 3.3 that has been recovered. All the reclaim was done from kswapd in all cases and efficiency was high. fsmark-threaded --------------- Generally looking good although the higher variance in 3.4 is a surprise. It is interesting to note that in 2.6.32 and 2.6.34 that direct reclaim was used and that this is no longer the case. kswapd efficiency is high throughout as the scanning velocity is steady. A point of interest is that slabs were not scanned in 3.4. This could be good or bad. postmark -------- postmark performance took a dip in 3.2 and 3.3 and this roughly correlates with some other IO tests. It is very interesting to note that there was swap activity in 3.1 and 3.2 that has been since resolved but is something that -stable users might care about. Otherwise kswapd efficiency and scanning rates look good. largedd ------- Figures generally look good here. All the reclaim is done by kswapd as expected. It is very interesting to note that in 3.0 and 3.1 that reclaim started writing out pages and that it is now being deferred to flusher threads and then immediately reclaimed. micro ----- Completion times generally look good and are improving. Some direct reclaim is happening but at a steady percentage on each release. Efficiency is much lower than other workloads but at least it is consistent. Slab shrinking may need examination. It clearly shows that scanning is taking place but fewer inodes are being reclaimed in recent kernels. This is not necessarily bad but worth paying attention to. One point of concern is that kswapd CPU usage is high in recent kernels for this test. The graph is a complete mess and needs closer examination. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok but postmark shows high inode steals ========================================================== fsmark-single ------------- This is all over the place. 3.0 was bad, 3.1 was good but performance has declined since then. The reclaim statistics look ok although slabs are being scanned without being reclaimed which is curious. kswapd is scanning less but that is not necessarily the cause of the problem. fsmark-threaded --------------- In contrast to the single-threaded case, this is looking steady although overall kernels are slower than 2.6.32 was. Direct reclaim velocity is slightly higher but still a small percentage of overall scanning. It's cause for some concern but not an immediate panic as reclaim efficiency is high. postmark ------- postmark performance took a serious dip in 3.2 and 3.3 and while it has recovered in 3.4 a bit, it's still far below the peak seen in 3.1. For the most part, reclaim statistics look ok with the exception of slab shrinking. Inode stealing is way up 3.1 and later kernels. largedd ------- Recent figures look good but in 3.1 and 3.2 there was excessive swapping. This was matched by pages reclaimed by direct reclaim in the same kernels and this likely shot interactive performance to hell with doing large copies on those kernels. This roughly matches other bug reports so it's interesting but clearly the patches did not get backported to -stable. micro ----- Completion times looking generally good. 3.1 and 3.2 show better times but as it is matched by excessive amounts of reclaim it is not likely to be a good trade-off so lets not "fix" that. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== fsmark-single ------------- There was a performance dip in 3.4 but otherwise looks ok. Reclaim stats look fine. fsmark-threaded --------------- Other than something crazy happening to variance in 3.1, figures look ok. There is some direct reclaim going on but the efficiency is high. There are 8 threads on this machine and only one disk so some stalling due to direct reclaim would be expected if IO was not keeping up. postmark -------- postmark performance took a serious dip in 3.1 and 3.2 which is a kernel version earlier than the same dip seen in hydra. No explanation for this but it is matched by swap activity in the same kernels so it might be an indication there was general swapping-related damage in the 3.0-3.3 time-frame. largedd ------- Recent figures look good. Again, some swap activity is visible in 3.1 and 3.2 that has been since fixed but obviously not backported. micro ----- Okish I suppose. Completion times have improved but are all over the place and as seen elsewhere good completion times on micro can be sometimes due to excessive reclaim and is not necessarily a good thing. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page reclaim performance on ext4 2012-06-29 11:19 ` Mel Gorman @ 2012-07-04 15:53 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-ext4 Configuration: global-dhp__pagereclaim-performance-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= fsmark is showing a performance dip in 3.4 on a 32-bit machine that is not matched by results elsewhere. A number of tests show that there was swapping activity in 3.1 and 3.2 which should have been completely unnecessary. It has been fixed but the fixes should have been backported as some were based on reports of poor interactivity performance during IO. One largedd test on hydra showed that a number of dirty pages are reaching the end of the LRU. This is not visible on other tests but should be monitored. I added linux-ext4 to the list because there are some performance drops that are not visible elsewhere so may be filesystem-specific. Benchmark notes =============== Each of the benchmarks trigger page reclaim in a fairly simple manner. The intention is not to be exhaustive but to test basic reclaim patterns that the VM should never get wrong. Regressions may also be due to changes in the IO scheduler or underlying filesystem. The workloads are predominately file-based. Anonymous page reclaim stress testing is covered by another test. mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. postmark o 15000 transactions o File size ranged 3096 bytes to 5M o 100 subdirectories o Total footprint approximately 4*TOTAL_RAM This workload is a single-threaded benchmark intended to measure filesystem performance for many short-lived and small files. It's primary weakness is that it does no application processing and so the page aging is basically on per-file granularity. largedd o Target size 8*TOTAL_RAM This downloads a large file and makes copies with dd until the target footprint size is reached. fsmark o Parallel directories were used o 1 Thread per CPU o 30M Filesize o 16 directories o 256 files per directory o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-25458/1 -D 16 -N 256 -n 532 -L 15 -S0 -s 31457280 Thread: ./fs_mark -d /tmp/fsmark-28217/1 -d /tmp/fsmark-28217/2 -d /tmp/fsmark-28217/3 -d /tmp/fsmark-28217/4 -d /tmp/fsmark-28217/5 -d /tmp/fsmark-28217/6 -d /tmp/fsmark-28217/7 -d /tmp/fsmark-28217/8 -d /tmp/fsmark-28217/9 -d /tmp/fsmark-28217/10 -d /tmp/fsmark-28217/11 -d /tmp/fsmark-28217/12 -d /tmp/fsmark-28217/13 -d /tmp/fsmark-28217/14 -d /tmp/fsmark-28217/15 -d /tmp/fsmark-28217/16 -d /tmp/fsmark-28217/17 -d /tmp/fsmark-28217/18 -d /tmp/fsmark-28217/19 -d /tmp/fsmark-28217/20 -d /tmp/fsmark-28217/21 -d /tmp/fsmark-28217/22 -d /tmp/fsmark-28217/23 -d /tmp/fsmark-28217/24 -d /tmp/fsmark-28217/25 -d /tmp/fsmark-28217/26 -d /tmp/fsmark-28217/27 -d /tmp/fsmark-28217/28 -d /tmp/fsmark-28217/29 -d /tmp/fsmark-28217/30 -d /tmp/fsmark-28217/31 -d /tmp/fsmark-28217/32 -D 16 -N 256 -n 16 -L 15 -S0 -s 31457280 micro o Total mapping size 10*TOTAL_RAM o NR_CPU threads o 5 iterations This creates one process per CPU and creates a large file-backed mapping up to the total mapping size. Each of the threads does a streaming read of the mapping for a number of iterations. It then restarts with a streaming write. Completion time is the primary factor here but be careful as a good completion time can be due to excessive reclaim which has an adverse effect on many other workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Generally good but swapping in 3.1 and 3.2 =========================================================== fsmark-single ------------- There was a performance dip in 3.4 that is not visible for ext3 and so may be filesystem-specific. From a reclaim perspective the figures look ok. fsmark-threaded --------------- This also shows a large performance dip in 3.4 and an increase in variance but again the reclaim stats look ok. postmark -------- As seen on ext3 there was a performance dip for kernels 3.1 to 3.3 that is not quite been recovered in 3.4. There was also swapping activity for the 3.1 and 3.2 kernels which may partially explain the problem. largedd ------- Completion times are mixed. Again some swapping activity is visible in 3.1 and 3.2 which has been resolved but not backported. micro ----- Completion figures are not looking bad for 3.4 ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok but postmark shows high inode steals ========================================================== fsmark-single ------------- Performance was declining until 3.1 and been steady every since. Some minor amounts of swapping is visible in 3.1 fsmark-threaded --------------- Surprisingly stead performance but recent kernels show some direct reclaim is going on. It's a tiny percentage and lower than it has been historically but keep an eye on it. postmark ------- As with other tests 3.1 saw a performance dip and swapping activity. 3.2 was also swapping although performance was not affected. While the reclaim figures currently look ok, the actual performance sucks. As the same is not visible on ext3, this may be a filesystem problem. largedd ------- Completion figures look good but as before, swapping in 3.1 and 3.2. What is of concern is the pages reclaimed by PageReclaim are excessively high in 3.3 and 3.4. This implies that a large number of dirty pages are reaching the end of the LRU and this can be a problem. Minimally it increases kswapd CPU usage but can also be indicate a flushing problem. micro ----- Looks ok. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Generally ok, but swapping in 3.1 and 3.2 ========================================================== fsmark-single ------------- Steady performance throughout. Tiny swap activity visible on 3.1 and 3.2 which caused no harm but does correlate with other tests. fsmark-threaded --------------- Also steady with the exception of 3.2 which is bizzare from a reclaim perspective. There was no direct reclaim scanning but a lot of inodes were reclaimed. postmark -------- Same performance dip in 3.1 and 3.2 and accompanied by the same swapping problem. largedd ------- Completion figures generally looking ok although again 3.1 is bad from a swapping and direct reclaim perspective. micro ----- Looks ok -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page reclaim performance on ext4 @ 2012-07-04 15:53 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-ext4 Configuration: global-dhp__pagereclaim-performance-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= fsmark is showing a performance dip in 3.4 on a 32-bit machine that is not matched by results elsewhere. A number of tests show that there was swapping activity in 3.1 and 3.2 which should have been completely unnecessary. It has been fixed but the fixes should have been backported as some were based on reports of poor interactivity performance during IO. One largedd test on hydra showed that a number of dirty pages are reaching the end of the LRU. This is not visible on other tests but should be monitored. I added linux-ext4 to the list because there are some performance drops that are not visible elsewhere so may be filesystem-specific. Benchmark notes =============== Each of the benchmarks trigger page reclaim in a fairly simple manner. The intention is not to be exhaustive but to test basic reclaim patterns that the VM should never get wrong. Regressions may also be due to changes in the IO scheduler or underlying filesystem. The workloads are predominately file-based. Anonymous page reclaim stress testing is covered by another test. mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. postmark o 15000 transactions o File size ranged 3096 bytes to 5M o 100 subdirectories o Total footprint approximately 4*TOTAL_RAM This workload is a single-threaded benchmark intended to measure filesystem performance for many short-lived and small files. It's primary weakness is that it does no application processing and so the page aging is basically on per-file granularity. largedd o Target size 8*TOTAL_RAM This downloads a large file and makes copies with dd until the target footprint size is reached. fsmark o Parallel directories were used o 1 Thread per CPU o 30M Filesize o 16 directories o 256 files per directory o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-25458/1 -D 16 -N 256 -n 532 -L 15 -S0 -s 31457280 Thread: ./fs_mark -d /tmp/fsmark-28217/1 -d /tmp/fsmark-28217/2 -d /tmp/fsmark-28217/3 -d /tmp/fsmark-28217/4 -d /tmp/fsmark-28217/5 -d /tmp/fsmark-28217/6 -d /tmp/fsmark-28217/7 -d /tmp/fsmark-28217/8 -d /tmp/fsmark-28217/9 -d /tmp/fsmark-28217/10 -d /tmp/fsmark-28217/11 -d /tmp/fsmark-28217/12 -d /tmp/fsmark-28217/13 -d /tmp/fsmark-28217/14 -d /tmp/fsmark-28217/15 -d /tmp/fsmark-28217/16 -d /tmp/fsmark-28217/17 -d /tmp/fsmark-28217/18 -d /tmp/fsmark-28217/19 -d /tmp/fsmark-28217/20 -d /tmp/fsmark-28217/21 -d /tmp/fsmark-28217/22 -d /tmp/fsmark-28217/23 -d /tmp/fsmark-28217/24 -d /tmp/fsmark-28217/25 -d /tmp/fsmark-28217/26 -d /tmp/fsmark-28217/27 -d /tmp/fsmark-28217/28 -d /tmp/fsmark-28217/29 -d /tmp/fsmark-28217/30 -d /tmp/fsmark-28217/31 -d /tmp/fsmark-28217/32 -D 16 -N 256 -n 16 -L 15 -S0 -s 31457280 micro o Total mapping size 10*TOTAL_RAM o NR_CPU threads o 5 iterations This creates one process per CPU and creates a large file-backed mapping up to the total mapping size. Each of the threads does a streaming read of the mapping for a number of iterations. It then restarts with a streaming write. Completion time is the primary factor here but be careful as a good completion time can be due to excessive reclaim which has an adverse effect on many other workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Generally good but swapping in 3.1 and 3.2 =========================================================== fsmark-single ------------- There was a performance dip in 3.4 that is not visible for ext3 and so may be filesystem-specific. From a reclaim perspective the figures look ok. fsmark-threaded --------------- This also shows a large performance dip in 3.4 and an increase in variance but again the reclaim stats look ok. postmark -------- As seen on ext3 there was a performance dip for kernels 3.1 to 3.3 that is not quite been recovered in 3.4. There was also swapping activity for the 3.1 and 3.2 kernels which may partially explain the problem. largedd ------- Completion times are mixed. Again some swapping activity is visible in 3.1 and 3.2 which has been resolved but not backported. micro ----- Completion figures are not looking bad for 3.4 ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok but postmark shows high inode steals ========================================================== fsmark-single ------------- Performance was declining until 3.1 and been steady every since. Some minor amounts of swapping is visible in 3.1 fsmark-threaded --------------- Surprisingly stead performance but recent kernels show some direct reclaim is going on. It's a tiny percentage and lower than it has been historically but keep an eye on it. postmark ------- As with other tests 3.1 saw a performance dip and swapping activity. 3.2 was also swapping although performance was not affected. While the reclaim figures currently look ok, the actual performance sucks. As the same is not visible on ext3, this may be a filesystem problem. largedd ------- Completion figures look good but as before, swapping in 3.1 and 3.2. What is of concern is the pages reclaimed by PageReclaim are excessively high in 3.3 and 3.4. This implies that a large number of dirty pages are reaching the end of the LRU and this can be a problem. Minimally it increases kswapd CPU usage but can also be indicate a flushing problem. micro ----- Looks ok. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Generally ok, but swapping in 3.1 and 3.2 ========================================================== fsmark-single ------------- Steady performance throughout. Tiny swap activity visible on 3.1 and 3.2 which caused no harm but does correlate with other tests. fsmark-threaded --------------- Also steady with the exception of 3.2 which is bizzare from a reclaim perspective. There was no direct reclaim scanning but a lot of inodes were reclaimed. postmark -------- Same performance dip in 3.1 and 3.2 and accompanied by the same swapping problem. largedd ------- Completion figures generally looking ok although again 3.1 is bad from a swapping and direct reclaim perspective. micro ----- Looks ok -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page reclaim performance on xfs 2012-06-29 11:19 ` Mel Gorman @ 2012-07-04 15:53 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__pagereclaim-performance-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= For the most part this is looking good. There is excessive swapping visible on 3.1 and 3.2 which is seen elsewhere. There is a concern with postmark figures. It is showing that we started entering direct reclaim in 3.1 on hydra and while the swapping problem has been addressed, we are still using direct reclaim and in some cases it is quite a high percentage. This did not happen on older kernels. Benchmark notes =============== Each of the benchmarks trigger page reclaim in a fairly simple manner. The intention is not to be exhaustive but to test basic reclaim patterns that the VM should never get wrong. Regressions may also be due to changes in the IO scheduler or underlying filesystem. The workloads are predominately file-based. Anonymous page reclaim stress testing is covered by another test. mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. postmark o 15000 transactions o File size ranged 3096 bytes to 5M o 100 subdirectories o Total footprint approximately 4*TOTAL_RAM This workload is a single-threaded benchmark intended to measure filesystem performance for many short-lived and small files. It's primary weakness is that it does no application processing and so the page aging is basically on per-file granularity. largedd o Target size 8*TOTAL_RAM This downloads a large file and makes copies with dd until the target footprint size is reached. fsmark o Parallel directories were used o 1 Thread per CPU o 30M Filesize o 16 directories o 256 files per directory o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-25458/1 -D 16 -N 256 -n 532 -L 15 -S0 -s 31457280 Thread: ./fs_mark -d /tmp/fsmark-28217/1 -d /tmp/fsmark-28217/2 -d /tmp/fsmark-28217/3 -d /tmp/fsmark-28217/4 -d /tmp/fsmark-28217/5 -d /tmp/fsmark-28217/6 -d /tmp/fsmark-28217/7 -d /tmp/fsmark-28217/8 -d /tmp/fsmark-28217/9 -d /tmp/fsmark-28217/10 -d /tmp/fsmark-28217/11 -d /tmp/fsmark-28217/12 -d /tmp/fsmark-28217/13 -d /tmp/fsmark-28217/14 -d /tmp/fsmark-28217/15 -d /tmp/fsmark-28217/16 -d /tmp/fsmark-28217/17 -d /tmp/fsmark-28217/18 -d /tmp/fsmark-28217/19 -d /tmp/fsmark-28217/20 -d /tmp/fsmark-28217/21 -d /tmp/fsmark-28217/22 -d /tmp/fsmark-28217/23 -d /tmp/fsmark-28217/24 -d /tmp/fsmark-28217/25 -d /tmp/fsmark-28217/26 -d /tmp/fsmark-28217/27 -d /tmp/fsmark-28217/28 -d /tmp/fsmark-28217/29 -d /tmp/fsmark-28217/30 -d /tmp/fsmark-28217/31 -d /tmp/fsmark-28217/32 -D 16 -N 256 -n 16 -L 15 -S0 -s 31457280 micro o Total mapping size 10*TOTAL_RAM o NR_CPU threads o 5 iterations This creates one process per CPU and creates a large file-backed mapping up to the total mapping size. Each of the threads does a streaming read of the mapping for a number of iterations. It then restarts with a streaming write. Completion time is the primary factor here but be careful as a good completion time can be due to excessive reclaim which has an adverse effect on many other workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Generally good but postmark shows direct reclaim =========================================================== fsmark-single ------------- Generally good, steady performance throughout. There was direct reclaim activity in early kernels but not any more. fsmark-threaded --------------- Generally good as well. postmark -------- This is interesting. There was a mild performance dip in 3.2 but while the excessive swapping is visible in 3.1 and 3.2 as seen on other tests, it did not translate into a performance drop. What is of concern is that direct reclaim figures are still high for recent kernels even if it is not swapping. largedd ------- Completion times have suffered a little and the usual swapping in 3.1 and 3.2 is visible but it's tiny. micro ----- Looking great. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Generally ok, but swapping in 3.1 and 3.2 ========================================================== fsmark-single ------------- Generally good, swap in 3.1 and 3.2 fsmark-threaded --------------- Generally good, no direct reclaim on recent kernels postmark ------- Looking great other than swapping in 3.1 and 3.2 which again does not appear to translate into a performance drop. Direct reclaim started around kernel 3.1 and this has not eased off. It's a sizable percentage. largedd ------- Completion times ok, swap on 3.1 and 3.2 is not. micro ----- Ok. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Generally ok, but swapping in 3.1 and 3.2 ========================================================== fsmark-single ------------- Generally good. No swapping visible on 3.1 and 3.2 but this machine also has more memory. fsmark-threaded --------------- Generally good although 3.2 showed that a lot of inodes were reclaimed. This matches a similar test on ext4 so something odd happened there. postmark -------- Looking great for performance although some swapping in 3.1 and 3.2 and direct reclaim scanning is still high. largedd ------- Completion times look good micro ----- Look ok. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Page reclaim performance on xfs @ 2012-07-04 15:53 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-04 15:53 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__pagereclaim-performance-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= For the most part this is looking good. There is excessive swapping visible on 3.1 and 3.2 which is seen elsewhere. There is a concern with postmark figures. It is showing that we started entering direct reclaim in 3.1 on hydra and while the swapping problem has been addressed, we are still using direct reclaim and in some cases it is quite a high percentage. This did not happen on older kernels. Benchmark notes =============== Each of the benchmarks trigger page reclaim in a fairly simple manner. The intention is not to be exhaustive but to test basic reclaim patterns that the VM should never get wrong. Regressions may also be due to changes in the IO scheduler or underlying filesystem. The workloads are predominately file-based. Anonymous page reclaim stress testing is covered by another test. mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. postmark o 15000 transactions o File size ranged 3096 bytes to 5M o 100 subdirectories o Total footprint approximately 4*TOTAL_RAM This workload is a single-threaded benchmark intended to measure filesystem performance for many short-lived and small files. It's primary weakness is that it does no application processing and so the page aging is basically on per-file granularity. largedd o Target size 8*TOTAL_RAM This downloads a large file and makes copies with dd until the target footprint size is reached. fsmark o Parallel directories were used o 1 Thread per CPU o 30M Filesize o 16 directories o 256 files per directory o TOTAL_RAM_IN_BYTES/FILESIZE files per iteration o 15 iterations Single: ./fs_mark -d /tmp/fsmark-25458/1 -D 16 -N 256 -n 532 -L 15 -S0 -s 31457280 Thread: ./fs_mark -d /tmp/fsmark-28217/1 -d /tmp/fsmark-28217/2 -d /tmp/fsmark-28217/3 -d /tmp/fsmark-28217/4 -d /tmp/fsmark-28217/5 -d /tmp/fsmark-28217/6 -d /tmp/fsmark-28217/7 -d /tmp/fsmark-28217/8 -d /tmp/fsmark-28217/9 -d /tmp/fsmark-28217/10 -d /tmp/fsmark-28217/11 -d /tmp/fsmark-28217/12 -d /tmp/fsmark-28217/13 -d /tmp/fsmark-28217/14 -d /tmp/fsmark-28217/15 -d /tmp/fsmark-28217/16 -d /tmp/fsmark-28217/17 -d /tmp/fsmark-28217/18 -d /tmp/fsmark-28217/19 -d /tmp/fsmark-28217/20 -d /tmp/fsmark-28217/21 -d /tmp/fsmark-28217/22 -d /tmp/fsmark-28217/23 -d /tmp/fsmark-28217/24 -d /tmp/fsmark-28217/25 -d /tmp/fsmark-28217/26 -d /tmp/fsmark-28217/27 -d /tmp/fsmark-28217/28 -d /tmp/fsmark-28217/29 -d /tmp/fsmark-28217/30 -d /tmp/fsmark-28217/31 -d /tmp/fsmark-28217/32 -D 16 -N 256 -n 16 -L 15 -S0 -s 31457280 micro o Total mapping size 10*TOTAL_RAM o NR_CPU threads o 5 iterations This creates one process per CPU and creates a large file-backed mapping up to the total mapping size. Each of the threads does a streaming read of the mapping for a number of iterations. It then restarts with a streaming write. Completion time is the primary factor here but be careful as a good completion time can be due to excessive reclaim which has an adverse effect on many other workloads. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Generally good but postmark shows direct reclaim =========================================================== fsmark-single ------------- Generally good, steady performance throughout. There was direct reclaim activity in early kernels but not any more. fsmark-threaded --------------- Generally good as well. postmark -------- This is interesting. There was a mild performance dip in 3.2 but while the excessive swapping is visible in 3.1 and 3.2 as seen on other tests, it did not translate into a performance drop. What is of concern is that direct reclaim figures are still high for recent kernels even if it is not swapping. largedd ------- Completion times have suffered a little and the usual swapping in 3.1 and 3.2 is visible but it's tiny. micro ----- Looking great. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Generally ok, but swapping in 3.1 and 3.2 ========================================================== fsmark-single ------------- Generally good, swap in 3.1 and 3.2 fsmark-threaded --------------- Generally good, no direct reclaim on recent kernels postmark ------- Looking great other than swapping in 3.1 and 3.2 which again does not appear to translate into a performance drop. Direct reclaim started around kernel 3.1 and this has not eased off. It's a sizable percentage. largedd ------- Completion times ok, swap on 3.1 and 3.2 is not. micro ----- Ok. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__pagereclaim-performance-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Generally ok, but swapping in 3.1 and 3.2 ========================================================== fsmark-single ------------- Generally good. No swapping visible on 3.1 and 3.2 but this machine also has more memory. fsmark-threaded --------------- Generally good although 3.2 showed that a lot of inodes were reclaimed. This matches a similar test on ext4 so something odd happened there. postmark -------- Looking great for performance although some swapping in 3.1 and 3.2 and direct reclaim scanning is still high. largedd ------- Completion times look good micro ----- Look ok. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Interactivity during IO on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-05 14:56 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-05 14:56 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-interactive-performance-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= There are some terrible results in here that might explain some of the interactivity mess if the distribution defaulted to ext3 or was was chosen by the user for any reason. In some cases average read latency has doubled, tripled and in one case almost quadrupled since 2.6.32. Worse, we are not consistently good or bad. I see patterns like great release, bad release, good release, bad again etc. Benchmark notes =============== NOTE: This configuration is new and very experimental. This is my first time looking at the results of this type of test so flaws are inevitable. There is ample scope for improvement but I had to start somewhere. This configuration is very different in that it is trying to analyse the impact of IO on interactive performance. Some interactivity problems are due to an application trying to read() cache-cold data such as configuration files or cached images. If there is a lot of IO going on, the application may stall while this happens. This is a limited scenario for measuring interactivity but a common one. These tests are fairly standard except that there is a background application running in parallel. It begins by creating a 100M file and using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is complete it will try to read 1M from the file every few seconds and record the latency. When it reaches the end of the file, it dumps it from cache and starts again. This latency is a *proxy* measure of interactivity, not a true measure. A variation would be to measure the time for small writes for applications that are logging data or applications like gnome-terminal that do small writes to /tmp as part of its buffer management. The main strength is that if we get this basic case wrong, then the complex cases are almost certainly screwed as well. There are two areas to pay attention to. One is completion time and how it is affected by the small reads taking place in parallel. A comprehensive analysis would show exactly how much the workload is affected by a parallel read but right now I'm just looking at wall time. The second area to pay attention to is the read latencies paying particular attention to the average latency and the max latencies. The variations are harder to draw decent conclusions from. A sensible option would be to plot a CDF to get a better idea what the probability of a given read latency is but for now that's a TODO item. As it is, the graphs are barely usable and I'll be giving that more thought. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== fsmark-single ------------- Completion times since 3.2 have been badly affected which coincides with the introduction of IO-less dirty page throttling. 3.3 was particularly bad. 2.6.32 was TERRIBLE in terms of read-latencies with the average latency and max latencies looking awful. The 90th percentile was close to 4 seconds and as a result the graphs are even more of a complete mess than they might have been otherwise. Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the reads were below 206ms but in 3.2 this had grown to 273ms. The latency of the other 5% results increased from 481ms to 774ms. 3.4 is looking better at least. fsmark-threaded --------------- With multiple writers, completion times have been affected and again 3.2 showed a big increase. Again, 2.6.32 is a complete disaster and mucks up all the graphs. Otherwise, our average read latencies do not look too bad. However, our worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst a read() can take 4.3 seconds when there are multiple parallel writers. This must be fairly rare as 99% of the latencies were below 1 second but a 4 second stall in an application sometimes would feel pretty bad. Maximum latencies have improved a bit in 3.4 but are still around a half second higher than 3.0 and 3.1 kernels. postmark -------- This is interesting in that 3.2 kernels results show an improvement in maximum read latencies and 3.4 is looking worse. The completion times for postmark were very badly affected in 3.4. Almost the opposite of what the fsmark workloads showed. It's hard to draw any sensible conclusions from this that match up with fsmark. largedd ------- Completion times are more or less unaffected. Maximum read latencies are affected though. In 2.6.39, our maximum latency was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain some of the interactivity complains around those kernels when a large cp was going on. Right now, things are looking very good. micro ----- Completion times look ok. 2.6.32 is again hilariously bad. 3.1 also showed very poor maximum latencies but 3.2 and later kernels look good. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== fsmark-single ------------- Completion times are all over the place with a big increase in 3.2 that improved a bit since but not as good as 3.1 kernels were. Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more meaningful. Our maximum latencies have jumped around a lot with 3.2 being particularly bad and 3.4 not being much better. 3.1 and 3.3 were both good in terms of maximum latency. Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms. 3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either. fsmark-threaded --------------- Completion times are more or less ok. Maximum read latency is worse with increases of around 500ms in worst latency and even the 90th percentile is not looking great. Average latency is completely shot. postmark -------- Again impossible to draw sensible conclusions from this. The throughput graph makes a nice sawtooth pattern suitable for poking you in the eye until it bleeds. It's all over the place in terms of completion times. Average latency figures are relatively ok but still regressed. Maximum latencies have increased. largedd ------- Completion times are more or less steady although 3.2 showed a large jump in the length time it took to copy the files. 3.2 took almost 10 minutes more to copy the files than 3.1 or 3.3. Maximum latencies in 3.2 were very high and the 90th percentile also looked pretty bad. 3.4 is better but still way worse than 2.6.32. Average latency would be laughable if it was not so tragic. micro ----- This was looking better until 3.4 when max latencies jumped but by and large this looks good. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== fsmark-single ------------- Completion times are more or less ok. They've been worse since 3.2 but still better than 3.2 by a big margin. Read latencies are another story. Maximum latency has increased a LOT from 1.3 seconds to 3.1 seconds in kernel 3.4. Kernel 3.0 had a maximum latency of 6.5 seconds! The 90th percentile figures are not much better with latencies of more than 1 second being recorded from all the way back to 2.6.39. Average latencies have more than tripled from 230ms to 812ms. fsmark-threaded --------------- Completion times are generally good. Read latencies are completely screwed. Kernel 3.2 had a maximum latency of 15 seconds! 3.4 has improved but it's still way too high. Even the 90th percentile figures look completely crap and average latency is of course bad with such high latencies being recorded. postmark -------- Once again the throughput figures make a nice stab stab shape for the eyes. The latency figures are sufficiently crap that it depresses me to talk about them. largedd ------- Completion times look decent. Which does not get over the shock of the latency figures were again are shocking. 6 second maximum latencies in 3.3 and 3.4 kernels although 3.2 was actually quite good. Even 90th percentile maximum latencies have almost doubled since 3.2 and of course the average latencies have almost tripled in line with other results. micro ----- Completion times look decent. The read latencies are ok. The average latency is higher because the latencies to the 90th percentile are higher but the maximum latencies have improved so overall I guess this is a win. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Interactivity during IO on ext3 @ 2012-07-05 14:56 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-05 14:56 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-interactive-performance-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= There are some terrible results in here that might explain some of the interactivity mess if the distribution defaulted to ext3 or was was chosen by the user for any reason. In some cases average read latency has doubled, tripled and in one case almost quadrupled since 2.6.32. Worse, we are not consistently good or bad. I see patterns like great release, bad release, good release, bad again etc. Benchmark notes =============== NOTE: This configuration is new and very experimental. This is my first time looking at the results of this type of test so flaws are inevitable. There is ample scope for improvement but I had to start somewhere. This configuration is very different in that it is trying to analyse the impact of IO on interactive performance. Some interactivity problems are due to an application trying to read() cache-cold data such as configuration files or cached images. If there is a lot of IO going on, the application may stall while this happens. This is a limited scenario for measuring interactivity but a common one. These tests are fairly standard except that there is a background application running in parallel. It begins by creating a 100M file and using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is complete it will try to read 1M from the file every few seconds and record the latency. When it reaches the end of the file, it dumps it from cache and starts again. This latency is a *proxy* measure of interactivity, not a true measure. A variation would be to measure the time for small writes for applications that are logging data or applications like gnome-terminal that do small writes to /tmp as part of its buffer management. The main strength is that if we get this basic case wrong, then the complex cases are almost certainly screwed as well. There are two areas to pay attention to. One is completion time and how it is affected by the small reads taking place in parallel. A comprehensive analysis would show exactly how much the workload is affected by a parallel read but right now I'm just looking at wall time. The second area to pay attention to is the read latencies paying particular attention to the average latency and the max latencies. The variations are harder to draw decent conclusions from. A sensible option would be to plot a CDF to get a better idea what the probability of a given read latency is but for now that's a TODO item. As it is, the graphs are barely usable and I'll be giving that more thought. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== fsmark-single ------------- Completion times since 3.2 have been badly affected which coincides with the introduction of IO-less dirty page throttling. 3.3 was particularly bad. 2.6.32 was TERRIBLE in terms of read-latencies with the average latency and max latencies looking awful. The 90th percentile was close to 4 seconds and as a result the graphs are even more of a complete mess than they might have been otherwise. Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the reads were below 206ms but in 3.2 this had grown to 273ms. The latency of the other 5% results increased from 481ms to 774ms. 3.4 is looking better at least. fsmark-threaded --------------- With multiple writers, completion times have been affected and again 3.2 showed a big increase. Again, 2.6.32 is a complete disaster and mucks up all the graphs. Otherwise, our average read latencies do not look too bad. However, our worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst a read() can take 4.3 seconds when there are multiple parallel writers. This must be fairly rare as 99% of the latencies were below 1 second but a 4 second stall in an application sometimes would feel pretty bad. Maximum latencies have improved a bit in 3.4 but are still around a half second higher than 3.0 and 3.1 kernels. postmark -------- This is interesting in that 3.2 kernels results show an improvement in maximum read latencies and 3.4 is looking worse. The completion times for postmark were very badly affected in 3.4. Almost the opposite of what the fsmark workloads showed. It's hard to draw any sensible conclusions from this that match up with fsmark. largedd ------- Completion times are more or less unaffected. Maximum read latencies are affected though. In 2.6.39, our maximum latency was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain some of the interactivity complains around those kernels when a large cp was going on. Right now, things are looking very good. micro ----- Completion times look ok. 2.6.32 is again hilariously bad. 3.1 also showed very poor maximum latencies but 3.2 and later kernels look good. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== fsmark-single ------------- Completion times are all over the place with a big increase in 3.2 that improved a bit since but not as good as 3.1 kernels were. Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more meaningful. Our maximum latencies have jumped around a lot with 3.2 being particularly bad and 3.4 not being much better. 3.1 and 3.3 were both good in terms of maximum latency. Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms. 3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either. fsmark-threaded --------------- Completion times are more or less ok. Maximum read latency is worse with increases of around 500ms in worst latency and even the 90th percentile is not looking great. Average latency is completely shot. postmark -------- Again impossible to draw sensible conclusions from this. The throughput graph makes a nice sawtooth pattern suitable for poking you in the eye until it bleeds. It's all over the place in terms of completion times. Average latency figures are relatively ok but still regressed. Maximum latencies have increased. largedd ------- Completion times are more or less steady although 3.2 showed a large jump in the length time it took to copy the files. 3.2 took almost 10 minutes more to copy the files than 3.1 or 3.3. Maximum latencies in 3.2 were very high and the 90th percentile also looked pretty bad. 3.4 is better but still way worse than 2.6.32. Average latency would be laughable if it was not so tragic. micro ----- This was looking better until 3.4 when max latencies jumped but by and large this looks good. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== fsmark-single ------------- Completion times are more or less ok. They've been worse since 3.2 but still better than 3.2 by a big margin. Read latencies are another story. Maximum latency has increased a LOT from 1.3 seconds to 3.1 seconds in kernel 3.4. Kernel 3.0 had a maximum latency of 6.5 seconds! The 90th percentile figures are not much better with latencies of more than 1 second being recorded from all the way back to 2.6.39. Average latencies have more than tripled from 230ms to 812ms. fsmark-threaded --------------- Completion times are generally good. Read latencies are completely screwed. Kernel 3.2 had a maximum latency of 15 seconds! 3.4 has improved but it's still way too high. Even the 90th percentile figures look completely crap and average latency is of course bad with such high latencies being recorded. postmark -------- Once again the throughput figures make a nice stab stab shape for the eyes. The latency figures are sufficiently crap that it depresses me to talk about them. largedd ------- Completion times look decent. Which does not get over the shock of the latency figures were again are shocking. 6 second maximum latencies in 3.3 and 3.4 kernels although 3.2 was actually quite good. Even 90th percentile maximum latencies have almost doubled since 3.2 and of course the average latencies have almost tripled in line with other results. micro ----- Completion times look decent. The read latencies are ok. The average latency is higher because the latencies to the 90th percentile are higher but the maximum latencies have improved so overall I guess this is a win. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Interactivity during IO on ext3 2012-07-05 14:56 ` Mel Gorman @ 2012-07-10 9:49 ` Jan Kara -1 siblings, 0 replies; 108+ messages in thread From: Jan Kara @ 2012-07-10 9:49 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel On Thu 05-07-12 15:56:52, Mel Gorman wrote: > Configuration: global-dhp__io-interactive-performance-ext3 > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3 > Benchmarks: postmark largedd fsmark-single fsmark-threaded micro > > Summary > ======= > > There are some terrible results in here that might explain some of the > interactivity mess if the distribution defaulted to ext3 or was was chosen > by the user for any reason. In some cases average read latency has doubled, > tripled and in one case almost quadrupled since 2.6.32. Worse, we are not > consistently good or bad. I see patterns like great release, bad release, > good release, bad again etc. > > Benchmark notes > =============== > > NOTE: This configuration is new and very experimental. This is my first > time looking at the results of this type of test so flaws are > inevitable. There is ample scope for improvement but I had to > start somewhere. > > This configuration is very different in that it is trying to analyse the > impact of IO on interactive performance. Some interactivity problems are > due to an application trying to read() cache-cold data such as configuration > files or cached images. If there is a lot of IO going on, the application > may stall while this happens. This is a limited scenario for measuring > interactivity but a common one. > > These tests are fairly standard except that there is a background > application running in parallel. It begins by creating a 100M file and > using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is > complete it will try to read 1M from the file every few seconds and record > the latency. When it reaches the end of the file, it dumps it from cache > and starts again. > > This latency is a *proxy* measure of interactivity, not a true measure. A > variation would be to measure the time for small writes for applications > that are logging data or applications like gnome-terminal that do small > writes to /tmp as part of its buffer management. The main strength is > that if we get this basic case wrong, then the complex cases are almost > certainly screwed as well. > > There are two areas to pay attention to. One is completion time and how > it is affected by the small reads taking place in parallel. A comprehensive > analysis would show exactly how much the workload is affected by a parallel > read but right now I'm just looking at wall time. > > The second area to pay attention to is the read latencies paying particular > attention to the average latency and the max latencies. The variations are > harder to draw decent conclusions from. A sensible option would be to plot > a CDF to get a better idea what the probability of a given read latency is > but for now that's a TODO item. As it is, the graphs are barely usable and > I'll be giving that more thought. > > =========================================================== > Machine: arnold > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html > Arch: x86 > CPUs: 1 socket, 2 threads > Model: Pentium 4 > Disk: Single Rotary Disk > =========================================================== > > fsmark-single > ------------- > Completion times since 3.2 have been badly affected which coincides with > the introduction of IO-less dirty page throttling. 3.3 was particularly > bad. > > 2.6.32 was TERRIBLE in terms of read-latencies with the average latency > and max latencies looking awful. The 90th percentile was close to 4 > seconds and as a result the graphs are even more of a complete mess than > they might have been otherwise. > > Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the > reads were below 206ms but in 3.2 this had grown to 273ms. The latency > of the other 5% results increased from 481ms to 774ms. > > 3.4 is looking better at least. Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged one more fix which should help the read latency. But all in all it's hard to tackle the latency problems with ext3 - we have a journal which synchronizes all the writes so we write to it with a high priority (we use WRITE_SYNC when there's some contention on the journal). But that naturally competes with reads and creates higher read latency. > fsmark-threaded > --------------- > With multiple writers, completion times have been affected and again 3.2 > showed a big increase. > > Again, 2.6.32 is a complete disaster and mucks up all the graphs. > > Otherwise, our average read latencies do not look too bad. However, our > worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst > a read() can take 4.3 seconds when there are multiple parallel writers. > This must be fairly rare as 99% of the latencies were below 1 second but > a 4 second stall in an application sometimes would feel pretty bad. > > Maximum latencies have improved a bit in 3.4 but are still around a half > second higher than 3.0 and 3.1 kernels. > > postmark > -------- > This is interesting in that 3.2 kernels results show an improvement in > maximum read latencies and 3.4 is looking worse. The completion times > for postmark were very badly affected in 3.4. Almost the opposite of what > the fsmark workloads showed. It's hard to draw any sensible conclusions > from this that match up with fsmark. > > largedd > ------- > Completion times are more or less unaffected. > > Maximum read latencies are affected though. In 2.6.39, our maximum latency > was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain > some of the interactivity complains around those kernels when a large > cp was going on. Right now, things are looking very good. > > micro > ----- > Completion times look ok. > > 2.6.32 is again hilariously bad. > > 3.1 also showed very poor maximum latencies but 3.2 and later kernels > look good. > > > ========================================================== > Machine: hydra > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html > Arch: x86-64 > CPUs: 1 socket, 4 threads > Model: AMD Phenom II X4 940 > Disk: Single Rotary Disk > ========================================================== > > fsmark-single > ------------- > Completion times are all over the place with a big increase in 3.2 that > improved a bit since but not as good as 3.1 kernels were. > > Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more > meaningful. Our maximum latencies have jumped around a lot with 3.2 > being particularly bad and 3.4 not being much better. 3.1 and 3.3 were > both good in terms of maximum latency. > > Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms. > 3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either. So I wonder what makes a difference between this machine and the previous one. The results seem completely different. Is it the amount of memory? Is it the difference in the disk? Or even the difference in the CPU? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Interactivity during IO on ext3 @ 2012-07-10 9:49 ` Jan Kara 0 siblings, 0 replies; 108+ messages in thread From: Jan Kara @ 2012-07-10 9:49 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel On Thu 05-07-12 15:56:52, Mel Gorman wrote: > Configuration: global-dhp__io-interactive-performance-ext3 > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3 > Benchmarks: postmark largedd fsmark-single fsmark-threaded micro > > Summary > ======= > > There are some terrible results in here that might explain some of the > interactivity mess if the distribution defaulted to ext3 or was was chosen > by the user for any reason. In some cases average read latency has doubled, > tripled and in one case almost quadrupled since 2.6.32. Worse, we are not > consistently good or bad. I see patterns like great release, bad release, > good release, bad again etc. > > Benchmark notes > =============== > > NOTE: This configuration is new and very experimental. This is my first > time looking at the results of this type of test so flaws are > inevitable. There is ample scope for improvement but I had to > start somewhere. > > This configuration is very different in that it is trying to analyse the > impact of IO on interactive performance. Some interactivity problems are > due to an application trying to read() cache-cold data such as configuration > files or cached images. If there is a lot of IO going on, the application > may stall while this happens. This is a limited scenario for measuring > interactivity but a common one. > > These tests are fairly standard except that there is a background > application running in parallel. It begins by creating a 100M file and > using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is > complete it will try to read 1M from the file every few seconds and record > the latency. When it reaches the end of the file, it dumps it from cache > and starts again. > > This latency is a *proxy* measure of interactivity, not a true measure. A > variation would be to measure the time for small writes for applications > that are logging data or applications like gnome-terminal that do small > writes to /tmp as part of its buffer management. The main strength is > that if we get this basic case wrong, then the complex cases are almost > certainly screwed as well. > > There are two areas to pay attention to. One is completion time and how > it is affected by the small reads taking place in parallel. A comprehensive > analysis would show exactly how much the workload is affected by a parallel > read but right now I'm just looking at wall time. > > The second area to pay attention to is the read latencies paying particular > attention to the average latency and the max latencies. The variations are > harder to draw decent conclusions from. A sensible option would be to plot > a CDF to get a better idea what the probability of a given read latency is > but for now that's a TODO item. As it is, the graphs are barely usable and > I'll be giving that more thought. > > =========================================================== > Machine: arnold > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html > Arch: x86 > CPUs: 1 socket, 2 threads > Model: Pentium 4 > Disk: Single Rotary Disk > =========================================================== > > fsmark-single > ------------- > Completion times since 3.2 have been badly affected which coincides with > the introduction of IO-less dirty page throttling. 3.3 was particularly > bad. > > 2.6.32 was TERRIBLE in terms of read-latencies with the average latency > and max latencies looking awful. The 90th percentile was close to 4 > seconds and as a result the graphs are even more of a complete mess than > they might have been otherwise. > > Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the > reads were below 206ms but in 3.2 this had grown to 273ms. The latency > of the other 5% results increased from 481ms to 774ms. > > 3.4 is looking better at least. Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged one more fix which should help the read latency. But all in all it's hard to tackle the latency problems with ext3 - we have a journal which synchronizes all the writes so we write to it with a high priority (we use WRITE_SYNC when there's some contention on the journal). But that naturally competes with reads and creates higher read latency. > fsmark-threaded > --------------- > With multiple writers, completion times have been affected and again 3.2 > showed a big increase. > > Again, 2.6.32 is a complete disaster and mucks up all the graphs. > > Otherwise, our average read latencies do not look too bad. However, our > worst-case latencies look pretty bad. Kernel 3.2 is showing that at worst > a read() can take 4.3 seconds when there are multiple parallel writers. > This must be fairly rare as 99% of the latencies were below 1 second but > a 4 second stall in an application sometimes would feel pretty bad. > > Maximum latencies have improved a bit in 3.4 but are still around a half > second higher than 3.0 and 3.1 kernels. > > postmark > -------- > This is interesting in that 3.2 kernels results show an improvement in > maximum read latencies and 3.4 is looking worse. The completion times > for postmark were very badly affected in 3.4. Almost the opposite of what > the fsmark workloads showed. It's hard to draw any sensible conclusions > from this that match up with fsmark. > > largedd > ------- > Completion times are more or less unaffected. > > Maximum read latencies are affected though. In 2.6.39, our maximum latency > was 781ms and was 13163ms in 3.0 and 1122ms in 3.2 which might explain > some of the interactivity complains around those kernels when a large > cp was going on. Right now, things are looking very good. > > micro > ----- > Completion times look ok. > > 2.6.32 is again hilariously bad. > > 3.1 also showed very poor maximum latencies but 3.2 and later kernels > look good. > > > ========================================================== > Machine: hydra > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html > Arch: x86-64 > CPUs: 1 socket, 4 threads > Model: AMD Phenom II X4 940 > Disk: Single Rotary Disk > ========================================================== > > fsmark-single > ------------- > Completion times are all over the place with a big increase in 3.2 that > improved a bit since but not as good as 3.1 kernels were. > > Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more > meaningful. Our maximum latencies have jumped around a lot with 3.2 > being particularly bad and 3.4 not being much better. 3.1 and 3.3 were > both good in terms of maximum latency. > > Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms. > 3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either. So I wonder what makes a difference between this machine and the previous one. The results seem completely different. Is it the amount of memory? Is it the difference in the disk? Or even the difference in the CPU? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Interactivity during IO on ext3 2012-07-10 9:49 ` Jan Kara @ 2012-07-10 11:30 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-10 11:30 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel On Tue, Jul 10, 2012 at 11:49:40AM +0200, Jan Kara wrote: > > =========================================================== > > Machine: arnold > > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html > > Arch: x86 > > CPUs: 1 socket, 2 threads > > Model: Pentium 4 > > Disk: Single Rotary Disk > > =========================================================== > > > > fsmark-single > > ------------- > > Completion times since 3.2 have been badly affected which coincides with > > the introduction of IO-less dirty page throttling. 3.3 was particularly > > bad. > > > > 2.6.32 was TERRIBLE in terms of read-latencies with the average latency > > and max latencies looking awful. The 90th percentile was close to 4 > > seconds and as a result the graphs are even more of a complete mess than > > they might have been otherwise. > > > > Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the > > reads were below 206ms but in 3.2 this had grown to 273ms. The latency > > of the other 5% results increased from 481ms to 774ms. > > > > 3.4 is looking better at least. > > Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged > one more fix which should help the read latency. When 3.5 comes out, I'll be queue up the same tests. Ideally I would be running against each rc but the machines are used for other tests as well and these ones take too long for continual testing to be practical. > But all in all it's hard > to tackle the latency problems with ext3 - we have a journal which > synchronizes all the writes so we write to it with a high priority > (we use WRITE_SYNC when there's some contention on the journal). But that > naturally competes with reads and creates higher read latency. > Thanks for the good explanation. I'll just know to look out for this in interactivity-related or IO-latency bugs. > > <SNIP> > > ========================================================== > > Machine: hydra > > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html > > Arch: x86-64 > > CPUs: 1 socket, 4 threads > > Model: AMD Phenom II X4 940 > > Disk: Single Rotary Disk > > ========================================================== > > > > fsmark-single > > ------------- > > Completion times are all over the place with a big increase in 3.2 that > > improved a bit since but not as good as 3.1 kernels were. > > > > Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more > > meaningful. Our maximum latencies have jumped around a lot with 3.2 > > being particularly bad and 3.4 not being much better. 3.1 and 3.3 were > > both good in terms of maximum latency. > > > > Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms. > > 3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either. > > So I wonder what makes a difference between this machine and the previous > one. The results seem completely different. Is it the amount of memory? Is > it the difference in the disk? Or even the difference in the CPU? > Two big differences are 32-bit versus 64-bit and the 32-bit machine having 4G of RAM and the 64-bit machine having 8G. On the 32-bit machine, bounce buffering may have been an issue but as -S0 was specified (no sync) there would also be differences on when dirty page balancing took place. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Interactivity during IO on ext3 @ 2012-07-10 11:30 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-10 11:30 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel On Tue, Jul 10, 2012 at 11:49:40AM +0200, Jan Kara wrote: > > =========================================================== > > Machine: arnold > > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/arnold/comparison.html > > Arch: x86 > > CPUs: 1 socket, 2 threads > > Model: Pentium 4 > > Disk: Single Rotary Disk > > =========================================================== > > > > fsmark-single > > ------------- > > Completion times since 3.2 have been badly affected which coincides with > > the introduction of IO-less dirty page throttling. 3.3 was particularly > > bad. > > > > 2.6.32 was TERRIBLE in terms of read-latencies with the average latency > > and max latencies looking awful. The 90th percentile was close to 4 > > seconds and as a result the graphs are even more of a complete mess than > > they might have been otherwise. > > > > Otherwise it's worth looking closely at 3.0 and 3.2. In 3.0, 95% of the > > reads were below 206ms but in 3.2 this had grown to 273ms. The latency > > of the other 5% results increased from 481ms to 774ms. > > > > 3.4 is looking better at least. > > Yeah, 3.4 looks OK and I'd be interested in 3.5 results since I've merged > one more fix which should help the read latency. When 3.5 comes out, I'll be queue up the same tests. Ideally I would be running against each rc but the machines are used for other tests as well and these ones take too long for continual testing to be practical. > But all in all it's hard > to tackle the latency problems with ext3 - we have a journal which > synchronizes all the writes so we write to it with a high priority > (we use WRITE_SYNC when there's some contention on the journal). But that > naturally competes with reads and creates higher read latency. > Thanks for the good explanation. I'll just know to look out for this in interactivity-related or IO-latency bugs. > > <SNIP> > > ========================================================== > > Machine: hydra > > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext3/hydra/comparison.html > > Arch: x86-64 > > CPUs: 1 socket, 4 threads > > Model: AMD Phenom II X4 940 > > Disk: Single Rotary Disk > > ========================================================== > > > > fsmark-single > > ------------- > > Completion times are all over the place with a big increase in 3.2 that > > improved a bit since but not as good as 3.1 kernels were. > > > > Unlike arnold, 2.6.32 is not a complete mess and makes a comparison more > > meaningful. Our maximum latencies have jumped around a lot with 3.2 > > being particularly bad and 3.4 not being much better. 3.1 and 3.3 were > > both good in terms of maximum latency. > > > > Average latency is shot to hell. In 2.6.32 it was 349ms and it's now 781ms. > > 3.2 was really bad but it's not like 3.0 or 3.1 were fantastic either. > > So I wonder what makes a difference between this machine and the previous > one. The results seem completely different. Is it the amount of memory? Is > it the difference in the disk? Or even the difference in the CPU? > Two big differences are 32-bit versus 64-bit and the 32-bit machine having 4G of RAM and the 64-bit machine having 8G. On the 32-bit machine, bounce buffering may have been an issue but as -S0 was specified (no sync) there would also be differences on when dirty page balancing took place. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Interactivity during IO on ext4 2012-06-29 11:19 ` Mel Gorman @ 2012-07-05 14:57 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-05 14:57 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-interactive-performance-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= Unlike ext3, these figures look generally good. There are a few wrinkles in there but indications are that interactivity jitter experienced by users may have a filesystem-specific component. One possibility is that there are differences in how metadata reads are sent to the IO scheduler but I did not confirm this. Benchmark notes =============== NOTE: This configuration is new and very experimental. This is my first time looking at the results of this type of test so flaws are inevitable. There is ample scope for improvement but I had to start somewhere. This configuration is very different in that it is trying to analyse the impact of IO on interactive performance. Some interactivity problems are due to an application trying to read() cache-cold data such as configuration files or cached images. If there is a lot of IO going on, the application may stall while this happens. This is a limited scenario for measuring interactivity but a common one. These tests are fairly standard except that there is a background application running in parallel. It begins by creating a 100M file and using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is complete it will try to read 1M from the file every few seconds and record the latency. When it reaches the end of the file, it dumps it from cache and starts again. This latency is a *proxy* measure of interactivity, not a true measure. A variation would be to measure the time for small writes for applications that are logging data or applications like gnome-terminal that do small writes to /tmp as part of its buffer management. The main strength is that if we get this basic case wrong, then the complex cases are almost certainly screwed as well. There are two areas to pay attention to. One is completion time and how it is affected by the small reads taking place in parallel. A comprehensive analysis would show exactly how much the workload is affected by a parallel read but right now I'm just looking at wall time. The second area to pay attention to is the read latencies paying particular attention to the average latency and the max latencies. The variations are harder to draw decent conclusions from. A sensible option would be to plot a CDF to get a better idea what the probability of a given read latency is but for now that's a TODO item. As it is, the graphs are barely usable and I'll be giving that more thought. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: =========================================================== fsmark-single ------------- Completion times are more or less ok. 3.2 showed a big improvement which is not in line with what was experienced in ext3. As with ext3, kernel 2.6.32 was a disaster but otherwise our maximum read latencies were looking up until 3.3 when there was a big jump that was not fixed in 3.4. By and large though the average latencies are looking good and while the max latency is bad, the 99th percentile was looking good implying that the worst latencies are rarely experienced. fsmark-threaded --------------- Completion times look generally good with 3.1 being an exception. Latencies are also looking good. postmark -------- Similar story. Completion times and latencies generally look good. largedd ------- Completion times were higher from 2.6.39 up until 3.3 taking nearly two minutes to complete the copy in some cases. This is reflected in some of the maximum latencies in that window but by and large the read latencies are much improved. micro ----- Looking good all round. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== fsmark-single ------------- Completion times have degraded slightly but are acceptable. All the latency figures look good with some big improvements. fsmark-threaded --------------- Same story, generally looking good with big improvements. postmark -------- Completion times are a bit varied but latencies look good. largedd ------- Completion times look good. Latency has improved since 2.6.32 but there is a big wrinkle in there. Maximum latency was 337ms in kernel 3.2 but in 3.3 it was 707ms and in 3.4 was 990ms. The 99th percentile figures look good but something happened to allow bigger outliers. micro ----- For the most part, looks good but there was a big jump in the maximum latency in kernel 3.4. Like largedd, the 99th percentil did not look as bad so it might be an outlier. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== fsmark-single ------------- Completion times have degraded slightly but are acceptble. All the latency figures look good with some big improvements. fsmark-threaded --------------- Completion times are improved although curiously it is not reflected in the performance figures for fsmark itself. Maximum latency figures generally look good other than a mild jump in 3.2 that has almost being recovered. postmark -------- Completion times have varied a lot and 3.4 is particularly high. The latency figures in general regressed in 3.4 in comparison to 3.3 but by and large the figures look good. largedd ------- Completion times generally look good but were noticably worse for a number of releases between 2.6.39 and 3.2. This same window showed much higher latency figures with kernel 3.1 showing a maximum latency of 1.3 seconds for example. These were mostly outliers though as the 99th percentile generally looked ok. micro ----- Generally much improved. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Interactivity during IO on ext4 @ 2012-07-05 14:57 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-05 14:57 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-interactive-performance-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4 Benchmarks: postmark largedd fsmark-single fsmark-threaded micro Summary ======= Unlike ext3, these figures look generally good. There are a few wrinkles in there but indications are that interactivity jitter experienced by users may have a filesystem-specific component. One possibility is that there are differences in how metadata reads are sent to the IO scheduler but I did not confirm this. Benchmark notes =============== NOTE: This configuration is new and very experimental. This is my first time looking at the results of this type of test so flaws are inevitable. There is ample scope for improvement but I had to start somewhere. This configuration is very different in that it is trying to analyse the impact of IO on interactive performance. Some interactivity problems are due to an application trying to read() cache-cold data such as configuration files or cached images. If there is a lot of IO going on, the application may stall while this happens. This is a limited scenario for measuring interactivity but a common one. These tests are fairly standard except that there is a background application running in parallel. It begins by creating a 100M file and using fadvise(POSIX_FADV_DONTNEED) to evict it from cache. Once that is complete it will try to read 1M from the file every few seconds and record the latency. When it reaches the end of the file, it dumps it from cache and starts again. This latency is a *proxy* measure of interactivity, not a true measure. A variation would be to measure the time for small writes for applications that are logging data or applications like gnome-terminal that do small writes to /tmp as part of its buffer management. The main strength is that if we get this basic case wrong, then the complex cases are almost certainly screwed as well. There are two areas to pay attention to. One is completion time and how it is affected by the small reads taking place in parallel. A comprehensive analysis would show exactly how much the workload is affected by a parallel read but right now I'm just looking at wall time. The second area to pay attention to is the read latencies paying particular attention to the average latency and the max latencies. The variations are harder to draw decent conclusions from. A sensible option would be to plot a CDF to get a better idea what the probability of a given read latency is but for now that's a TODO item. As it is, the graphs are barely usable and I'll be giving that more thought. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: =========================================================== fsmark-single ------------- Completion times are more or less ok. 3.2 showed a big improvement which is not in line with what was experienced in ext3. As with ext3, kernel 2.6.32 was a disaster but otherwise our maximum read latencies were looking up until 3.3 when there was a big jump that was not fixed in 3.4. By and large though the average latencies are looking good and while the max latency is bad, the 99th percentile was looking good implying that the worst latencies are rarely experienced. fsmark-threaded --------------- Completion times look generally good with 3.1 being an exception. Latencies are also looking good. postmark -------- Similar story. Completion times and latencies generally look good. largedd ------- Completion times were higher from 2.6.39 up until 3.3 taking nearly two minutes to complete the copy in some cases. This is reflected in some of the maximum latencies in that window but by and large the read latencies are much improved. micro ----- Looking good all round. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== fsmark-single ------------- Completion times have degraded slightly but are acceptable. All the latency figures look good with some big improvements. fsmark-threaded --------------- Same story, generally looking good with big improvements. postmark -------- Completion times are a bit varied but latencies look good. largedd ------- Completion times look good. Latency has improved since 2.6.32 but there is a big wrinkle in there. Maximum latency was 337ms in kernel 3.2 but in 3.3 it was 707ms and in 3.4 was 990ms. The 99th percentile figures look good but something happened to allow bigger outliers. micro ----- For the most part, looks good but there was a big jump in the maximum latency in kernel 3.4. Like largedd, the 99th percentil did not look as bad so it might be an outlier. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-interactive-performance-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== fsmark-single ------------- Completion times have degraded slightly but are acceptble. All the latency figures look good with some big improvements. fsmark-threaded --------------- Completion times are improved although curiously it is not reflected in the performance figures for fsmark itself. Maximum latency figures generally look good other than a mild jump in 3.2 that has almost being recovered. postmark -------- Completion times have varied a lot and 3.4 is particularly high. The latency figures in general regressed in 3.4 in comparison to 3.3 but by and large the figures look good. largedd ------- Completion times generally look good but were noticably worse for a number of releases between 2.6.39 and 3.2. This same window showed much higher latency figures with kernel 3.1 showing a maximum latency of 1.3 seconds for example. These were mostly outliers though as the 99th percentile generally looked ok. micro ----- Generally much improved. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Scheduler 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:12 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:12 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__scheduler-performance Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance Benchmarks: hackbench-pipes hackbench-sockets pipetest starve lmbench Summary ======= This is a mixed bag. The results on an I7 generally look great! There are some major improvemnets in there and I think this may be due to scheduler developers working with the latest chips. The other machines did not far as well. Look at pipetest on hydra for an example of a particularly bad set of results. Benchmark notes =============== starve (http://www.hpl.hp.com/research/linux/kernel/o1-starve.php) was chosen because even though it is designed to isolate a bug in the O(1) scheduler, it is still interesting to monitor for performance regressions. It does not take any special parameters. hackbench was chosen because it's a general scheduler benchmark that is sensitive to regressions in the scheduler fast-path. It is difficult to draw conclusions from as it is somewhat sensitive to the starting conditions of the machine but trends over time may be observed. It is run in both pipe and sockets mode and for each number of clients, it is run for 30 iterations. pipetest is a scheduler ping-pong test that measures context switch latency. It runs for 30 iterations. lmbench is just running the lat_ctx test and is another measure of context switch latency. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Context switch latency is regressing. =========================================================== starve is looking ok except for 3.0 and 3.1 where System CPU time and elapsed time increased. This was fixed in later kernels but worth noting for users of -stable. lmbench showed a small regression in 3.0 where context switch latency was increased and this has not been recovered yet. 3.3.6 was particularly bad for low numbers of clients. hackbench-pipes looks ok in comparison to 2.6.32. The "Time ratio" graph shows that kernels are below the red line reflecting that most kernels are faster. However, it also shows that 2.6.34 was the "best" kernel and recent kernels have regressed slightly hackbench-sockets regressed badly after 2.6.34 until 3.3 which should be investigated. Again this is most obvious in the Time Ratio graph pipetest is showing major regressions in latency since some time between 2.6.34 and 2.6.39. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: pipetest is particularly bad. ========================================================== starve is generally ok although again, 3.0 and 3.1 both regressed on System CPU time. This was improved on kernels after that but it's still a little worse than 2.6.32 was. lmbench shows no regression in 3.0 unlike on arnold but later kernels are much worse with the latency of 3.4 being generally higher than it was in 3.2 hackbench-pipes generally looks ok. hackbench-sockets is generally bad. 3.1 was particularly bad and while 3.4 has improved the situation a bit, it is still worse than 2.6.32. pipetest is showing major regressions. 3.2 regressed particularly badly. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Generally great. ========================================================== starve is generally ok. 3.0 regressed in terms of System CPU time but recent kernels are very good. This might reflect that a lot of people are testing with later Intel processors to the detriment of older models. lmbench is looking superb. hackbench-pipes looks great. hackbench-sockets does not look as great but it's still very good. pipetest is generally looking good in comparison to 2.6.32. However, I am concerned that 3.4 is worse than 3.3. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Scheduler @ 2012-07-23 21:12 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:12 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__scheduler-performance Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance Benchmarks: hackbench-pipes hackbench-sockets pipetest starve lmbench Summary ======= This is a mixed bag. The results on an I7 generally look great! There are some major improvemnets in there and I think this may be due to scheduler developers working with the latest chips. The other machines did not far as well. Look at pipetest on hydra for an example of a particularly bad set of results. Benchmark notes =============== starve (http://www.hpl.hp.com/research/linux/kernel/o1-starve.php) was chosen because even though it is designed to isolate a bug in the O(1) scheduler, it is still interesting to monitor for performance regressions. It does not take any special parameters. hackbench was chosen because it's a general scheduler benchmark that is sensitive to regressions in the scheduler fast-path. It is difficult to draw conclusions from as it is somewhat sensitive to the starting conditions of the machine but trends over time may be observed. It is run in both pipe and sockets mode and for each number of clients, it is run for 30 iterations. pipetest is a scheduler ping-pong test that measures context switch latency. It runs for 30 iterations. lmbench is just running the lat_ctx test and is another measure of context switch latency. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk Status: Context switch latency is regressing. =========================================================== starve is looking ok except for 3.0 and 3.1 where System CPU time and elapsed time increased. This was fixed in later kernels but worth noting for users of -stable. lmbench showed a small regression in 3.0 where context switch latency was increased and this has not been recovered yet. 3.3.6 was particularly bad for low numbers of clients. hackbench-pipes looks ok in comparison to 2.6.32. The "Time ratio" graph shows that kernels are below the red line reflecting that most kernels are faster. However, it also shows that 2.6.34 was the "best" kernel and recent kernels have regressed slightly hackbench-sockets regressed badly after 2.6.34 until 3.3 which should be investigated. Again this is most obvious in the Time Ratio graph pipetest is showing major regressions in latency since some time between 2.6.34 and 2.6.39. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: pipetest is particularly bad. ========================================================== starve is generally ok although again, 3.0 and 3.1 both regressed on System CPU time. This was improved on kernels after that but it's still a little worse than 2.6.32 was. lmbench shows no regression in 3.0 unlike on arnold but later kernels are much worse with the latency of 3.4 being generally higher than it was in 3.2 hackbench-pipes generally looks ok. hackbench-sockets is generally bad. 3.1 was particularly bad and while 3.4 has improved the situation a bit, it is still worse than 2.6.32. pipetest is showing major regressions. 3.2 regressed particularly badly. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__scheduler-performance/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: Generally great. ========================================================== starve is generally ok. 3.0 regressed in terms of System CPU time but recent kernels are very good. This might reflect that a lot of people are testing with later Intel processors to the detriment of older models. lmbench is looking superb. hackbench-pipes looks great. hackbench-sockets does not look as great but it's still very good. pipetest is generally looking good in comparison to 2.6.32. However, I am concerned that 3.4 is worse than 3.3. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Sysbench read-only on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:13 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:13 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__io-sysbench-large-ro-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3 Benchmarks: sysbench Summary ======= Very large number of regressions. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. sysbench is an OLTP-like benchmark. The test type was "complex" and read-only. The table size was 50,000,000 rows regardless of memory size but far exceeds the memory size of any of the test machines. sysbench was chosen because it's a reasonably complex OLTP-like benchmark with straight-forward prerequisites. The backing database was postgres. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== sysbench -------- Oddly two clients is better but 1 or 4 is worse. Swapping for kernels 3.1 and 3.2 is crazy. Direct reclaim started since 2.6.39 and has not eased off but in the context of the overall test is very low. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== sysbench -------- There are a lot of regressions here that were mostly introduced between 2.6.39 and 3.0. In general, this is looking bad. Swapping in kernel 3.1 was higher. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== Generally this is telling a much better story but this could be because of the much larger memory size of this machine offsetting some other regression. Swapping in 3.1 and 3.2. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Sysbench read-only on ext3 @ 2012-07-23 21:13 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:13 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__io-sysbench-large-ro-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3 Benchmarks: sysbench Summary ======= Very large number of regressions. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. sysbench is an OLTP-like benchmark. The test type was "complex" and read-only. The table size was 50,000,000 rows regardless of memory size but far exceeds the memory size of any of the test machines. sysbench was chosen because it's a reasonably complex OLTP-like benchmark with straight-forward prerequisites. The backing database was postgres. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== sysbench -------- Oddly two clients is better but 1 or 4 is worse. Swapping for kernels 3.1 and 3.2 is crazy. Direct reclaim started since 2.6.39 and has not eased off but in the context of the overall test is very low. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== sysbench -------- There are a lot of regressions here that were mostly introduced between 2.6.39 and 3.0. In general, this is looking bad. Swapping in kernel 3.1 was higher. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== Generally this is telling a much better story but this could be because of the much larger memory size of this machine offsetting some other regression. Swapping in 3.1 and 3.2. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Sysbench read-only on ext3 2012-07-23 21:13 ` Mel Gorman @ 2012-07-24 2:29 ` Mike Galbraith -1 siblings, 0 replies; 108+ messages in thread From: Mike Galbraith @ 2012-07-24 2:29 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote: > The backing database was postgres. FWIW, that wouldn't have been my choice. I don't know if it still does, but it used to use userland spinlocks to achieve scalability. Turning your CPUs into space heaters to combat concurrency issues makes a pretty flat graph, but probably doesn't test kernels as well as something that did not do that. -Mike ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Sysbench read-only on ext3 @ 2012-07-24 2:29 ` Mike Galbraith 0 siblings, 0 replies; 108+ messages in thread From: Mike Galbraith @ 2012-07-24 2:29 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote: > The backing database was postgres. FWIW, that wouldn't have been my choice. I don't know if it still does, but it used to use userland spinlocks to achieve scalability. Turning your CPUs into space heaters to combat concurrency issues makes a pretty flat graph, but probably doesn't test kernels as well as something that did not do that. -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Sysbench read-only on ext3 2012-07-24 2:29 ` Mike Galbraith @ 2012-07-24 8:19 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-24 8:19 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-mm, linux-kernel On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote: > On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote: > > > The backing database was postgres. > > FWIW, that wouldn't have been my choice. I don't know if it still does, > but it used to use userland spinlocks to achieve scalability. The tests used to support mysql but the code bit-rotted and eventually got deleted. I'm not going to get into a mysql vs postgres discussion on which is better :O Were you thinking of mysql or something else as an alternative? Completely different test? > Turning > your CPUs into space heaters to combat concurrency issues makes a pretty > flat graph, but probably doesn't test kernels as well as something that > did not do that. > I did not check the source, but even if it is true then your comments only applies to testing scalability of locking. If someone really cares to check, the postgres version was 9.0.4. However, even if they are using user-space locking, the test is still useful for looking at the IO performance, page reclaim decisions and so on. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Sysbench read-only on ext3 @ 2012-07-24 8:19 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-24 8:19 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-mm, linux-kernel On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote: > On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote: > > > The backing database was postgres. > > FWIW, that wouldn't have been my choice. I don't know if it still does, > but it used to use userland spinlocks to achieve scalability. The tests used to support mysql but the code bit-rotted and eventually got deleted. I'm not going to get into a mysql vs postgres discussion on which is better :O Were you thinking of mysql or something else as an alternative? Completely different test? > Turning > your CPUs into space heaters to combat concurrency issues makes a pretty > flat graph, but probably doesn't test kernels as well as something that > did not do that. > I did not check the source, but even if it is true then your comments only applies to testing scalability of locking. If someone really cares to check, the postgres version was 9.0.4. However, even if they are using user-space locking, the test is still useful for looking at the IO performance, page reclaim decisions and so on. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Sysbench read-only on ext3 2012-07-24 8:19 ` Mel Gorman @ 2012-07-24 8:32 ` Mike Galbraith -1 siblings, 0 replies; 108+ messages in thread From: Mike Galbraith @ 2012-07-24 8:32 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel On Tue, 2012-07-24 at 09:19 +0100, Mel Gorman wrote: > On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote: > > On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote: > > > > > The backing database was postgres. > > > > FWIW, that wouldn't have been my choice. I don't know if it still does, > > but it used to use userland spinlocks to achieve scalability. > > The tests used to support mysql but the code bit-rotted and eventually > got deleted. I'm not going to get into a mysql vs postgres discussion on > which is better :O > > Were you thinking of mysql or something else as an alternative? > Completely different test? Which db is under the hood doesn't matter much, but those spinlocks got me thinking. > > Turning > > your CPUs into space heaters to combat concurrency issues makes a pretty > > flat graph, but probably doesn't test kernels as well as something that > > did not do that. > > > > I did not check the source, but even if it is true then your comments only > applies to testing scalability of locking. If someone really cares to check, > the postgres version was 9.0.4. However, even if they are using user-space > locking, the test is still useful for looking at the IO performance, > page reclaim decisions and so on. I was thinking while you're spinning in userspace, you're not giving the kernel decisions to make. But you're right. If they didn't have spinning locks, they'd have sleeping locks. With spinning locks they can be less smart I suppose. -Mike ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] Sysbench read-only on ext3 @ 2012-07-24 8:32 ` Mike Galbraith 0 siblings, 0 replies; 108+ messages in thread From: Mike Galbraith @ 2012-07-24 8:32 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel On Tue, 2012-07-24 at 09:19 +0100, Mel Gorman wrote: > On Tue, Jul 24, 2012 at 04:29:29AM +0200, Mike Galbraith wrote: > > On Mon, 2012-07-23 at 22:13 +0100, Mel Gorman wrote: > > > > > The backing database was postgres. > > > > FWIW, that wouldn't have been my choice. I don't know if it still does, > > but it used to use userland spinlocks to achieve scalability. > > The tests used to support mysql but the code bit-rotted and eventually > got deleted. I'm not going to get into a mysql vs postgres discussion on > which is better :O > > Were you thinking of mysql or something else as an alternative? > Completely different test? Which db is under the hood doesn't matter much, but those spinlocks got me thinking. > > Turning > > your CPUs into space heaters to combat concurrency issues makes a pretty > > flat graph, but probably doesn't test kernels as well as something that > > did not do that. > > > > I did not check the source, but even if it is true then your comments only > applies to testing scalability of locking. If someone really cares to check, > the postgres version was 9.0.4. However, even if they are using user-space > locking, the test is still useful for looking at the IO performance, > page reclaim decisions and so on. I was thinking while you're spinning in userspace, you're not giving the kernel decisions to make. But you're right. If they didn't have spinning locks, they'd have sleeping locks. With spinning locks they can be less smart I suppose. -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Sysbench read-only on ext4 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:14 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:14 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__io-sysbench-large-ro-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4 Benchmarks: sysbench Summary ======= Looking better in places than ext3 but still of concern. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. sysbench is an OLTP-like benchmark. The test type was "complex" and read-only. The table size was 50,000,000 rows regardless of memory size but far exceeds the memory size of any of the test machines. sysbench was chosen because it's a reasonably complex OLTP-like benchmark with straight-forward prerequisites. The backing database was postgres. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== sysbench -------- Generally regresssed. Swapping for kernels 3.1 and 3.2 is very high. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== sysbench -------- For low number of clients, this has generally improved. Swapping in kernel 3.1 was high. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== Generally this is telling a much better story but this could be because of the much larger memory size of this machine offsetting some other regression. Swapping in 3.1 and 3.2. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Sysbench read-only on ext4 @ 2012-07-23 21:14 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:14 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__io-sysbench-large-ro-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4 Benchmarks: sysbench Summary ======= Looking better in places than ext3 but still of concern. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. sysbench is an OLTP-like benchmark. The test type was "complex" and read-only. The table size was 50,000,000 rows regardless of memory size but far exceeds the memory size of any of the test machines. sysbench was chosen because it's a reasonably complex OLTP-like benchmark with straight-forward prerequisites. The backing database was postgres. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== sysbench -------- Generally regresssed. Swapping for kernels 3.1 and 3.2 is very high. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== sysbench -------- For low number of clients, this has generally improved. Swapping in kernel 3.1 was high. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== Generally this is telling a much better story but this could be because of the much larger memory size of this machine offsetting some other regression. Swapping in 3.1 and 3.2. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Sysbench read-only on xfs 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:15 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:15 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__io-sysbench-large-ro-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs Benchmarks: sysbench Summary ======= Looking better in places than ext3 but still of concern. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. sysbench is an OLTP-like benchmark. The test type was "complex" and read-only. The table size was 50,000,000 rows regardless of memory size but far exceeds the memory size of any of the test machines. sysbench was chosen because it's a reasonably complex OLTP-like benchmark with straight-forward prerequisites. The backing database was postgres. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== sysbench -------- Everything regressed. Swapping for kernels 3.1 and 3.2 was nuts. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== sysbench -------- For low number of clients, this has generally improved but regressed for larger number of clients. Swapping in kernel 3.1 was high. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== Generally this is telling a much better story but this could be because of the much larger memory size of this machine offsetting some other regression. Swapping in 3.1 and 3.2. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Sysbench read-only on xfs @ 2012-07-23 21:15 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:15 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__io-sysbench-large-ro-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs Benchmarks: sysbench Summary ======= Looking better in places than ext3 but still of concern. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. sysbench is an OLTP-like benchmark. The test type was "complex" and read-only. The table size was 50,000,000 rows regardless of memory size but far exceeds the memory size of any of the test machines. sysbench was chosen because it's a reasonably complex OLTP-like benchmark with straight-forward prerequisites. The backing database was postgres. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== sysbench -------- Everything regressed. Swapping for kernels 3.1 and 3.2 was nuts. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== sysbench -------- For low number of clients, this has generally improved but regressed for larger number of clients. Swapping in kernel 3.1 was high. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-sysbench-large-ro-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== Generally this is telling a much better story but this could be because of the much larger memory size of this machine offsetting some other regression. Swapping in 3.1 and 3.2. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] memcachetest and parallel IO on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:17 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:17 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__parallelio-memcachetest-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3 Benchmarks: parallelio Summary ======= Indications are not very clear as different machines point to different kernels. Very broadly speaking, swapping got worse between 2.6.39 and 3.0 and then again between 3.2 and 3.3. Benchmark notes =============== This is an experimental benchmark designed to measure the impact of background IO on a target workload. mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. The target workload in this case is memcached and memcachetest. This is a benchmark of memcached and the workload is mostly anonymous. The benchmark was chosen as it was a random client that is considered a valid benchmark for memcache and does not consume much memory in the client. The server was configured to use 80% of memory. In the background, dd is used to generate IO of varying sizes. As the sizes increase, memory pressure may push the target workload out of memory. The benchmark is meant to measure how much the target workload is affected and may be used as a proxy measure for page reclaim decisions. Unlike other benchmarks, only the run with the worst throughput is displayed. This benchmark varies quite a bit depending on the reference pattern from the client. This hides the interesting result in the noise so we only consider the worst case. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== parallelio-memcachetest ----------------------- Even for small amounts of background IO the memcached process is being pushed into swap. This is due to a regression somewhere between 2.6.34 and 2.6.39 and a much larger regression between 2.6.39 and 3.0. This is even worse in 3.3 and 3.4. The "page reclaim immediate" figures started increasing from 3.2 implying that a lot of dirty LRU pages are reaching the end of the LRU lists. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== parallelio-memcachetest ----------------------- Performance was reasonable until relatively recent kernels. The results show that for 3.3 and later kernels that swapping started for moderate amounts of IO (1624M) and performance dropped off sharply as a result. As with arnold, dirty pages are reaching the end of the LRU list. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== parallelio-memcachetest ----------------------- This is showing everything smells of roses and the IO is not interfering at all. It is possible that this is due to the amount of memory and that the IO is being completed fast enough. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] memcachetest and parallel IO on ext3 @ 2012-07-23 21:17 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:17 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__parallelio-memcachetest-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3 Benchmarks: parallelio Summary ======= Indications are not very clear as different machines point to different kernels. Very broadly speaking, swapping got worse between 2.6.39 and 3.0 and then again between 3.2 and 3.3. Benchmark notes =============== This is an experimental benchmark designed to measure the impact of background IO on a target workload. mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. The target workload in this case is memcached and memcachetest. This is a benchmark of memcached and the workload is mostly anonymous. The benchmark was chosen as it was a random client that is considered a valid benchmark for memcache and does not consume much memory in the client. The server was configured to use 80% of memory. In the background, dd is used to generate IO of varying sizes. As the sizes increase, memory pressure may push the target workload out of memory. The benchmark is meant to measure how much the target workload is affected and may be used as a proxy measure for page reclaim decisions. Unlike other benchmarks, only the run with the worst throughput is displayed. This benchmark varies quite a bit depending on the reference pattern from the client. This hides the interesting result in the noise so we only consider the worst case. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== parallelio-memcachetest ----------------------- Even for small amounts of background IO the memcached process is being pushed into swap. This is due to a regression somewhere between 2.6.34 and 2.6.39 and a much larger regression between 2.6.39 and 3.0. This is even worse in 3.3 and 3.4. The "page reclaim immediate" figures started increasing from 3.2 implying that a lot of dirty LRU pages are reaching the end of the LRU lists. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== parallelio-memcachetest ----------------------- Performance was reasonable until relatively recent kernels. The results show that for 3.3 and later kernels that swapping started for moderate amounts of IO (1624M) and performance dropped off sharply as a result. As with arnold, dirty pages are reaching the end of the LRU list. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== parallelio-memcachetest ----------------------- This is showing everything smells of roses and the IO is not interfering at all. It is possible that this is due to the amount of memory and that the IO is being completed fast enough. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] memcachetest and parallel IO on xfs 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:19 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:19 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__parallelio-memcachetest-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs Benchmarks: parallelio Summary ======= Indications are that there was a large regression in page reclaim decisions between 2.6.39 and 3.0 as swapping increased a lot. Benchmark notes =============== This is an experimental benchmark designed to measure the impact of background IO on a target workload. mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. The target workload in this case is memcached and memcachetest. This is a benchmark of memcached and the workload is mostly anonymous. The benchmark was chosen as it was a random client that is considered a valid benchmark for memcache and does not consume much memory in the client. The server was configured to use 80% of memory. In the background, dd is used to generate IO of varying sizes. As the sizes increase, memory pressure may push the target workload out of memory. The benchmark is meant to measure how much the target workload is affected and may be used as a proxy measure for page reclaim decisions. Unlike other benchmarks, only the run with the worst throughput is displayed. This benchmark varies quite a bit depending on the reference pattern from the client. This hides the interesting result in the noise so we only consider the worst case. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== parallelio-memcachetest ----------------------- Even for small amounts of background IO the memcached process is being pushed into swap for 3.3 and 3.4 although earlier kernels fared better. There are indications that there was a serious regression between 2.6.39 and 3.0 as throughput dropped for larger amounts of IO and swapping was high. The "page reclaim immediate" figures started increasing from 3.2 implying that a lot of dirty LRU pages are reaching the end of the LRU lists. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== parallelio-memcachetest ----------------------- Performance again dropped sharply betwen 2.6.39 and 3.0 with huge jumps in the amount of swap IO. As with arnold, dirty pages are reaching the end of the LRU list. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== No results available. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] memcachetest and parallel IO on xfs @ 2012-07-23 21:19 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:19 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__parallelio-memcachetest-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs Benchmarks: parallelio Summary ======= Indications are that there was a large regression in page reclaim decisions between 2.6.39 and 3.0 as swapping increased a lot. Benchmark notes =============== This is an experimental benchmark designed to measure the impact of background IO on a target workload. mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. The target workload in this case is memcached and memcachetest. This is a benchmark of memcached and the workload is mostly anonymous. The benchmark was chosen as it was a random client that is considered a valid benchmark for memcache and does not consume much memory in the client. The server was configured to use 80% of memory. In the background, dd is used to generate IO of varying sizes. As the sizes increase, memory pressure may push the target workload out of memory. The benchmark is meant to measure how much the target workload is affected and may be used as a proxy measure for page reclaim decisions. Unlike other benchmarks, only the run with the worst throughput is displayed. This benchmark varies quite a bit depending on the reference pattern from the client. This hides the interesting result in the noise so we only consider the worst case. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== parallelio-memcachetest ----------------------- Even for small amounts of background IO the memcached process is being pushed into swap for 3.3 and 3.4 although earlier kernels fared better. There are indications that there was a serious regression between 2.6.39 and 3.0 as throughput dropped for larger amounts of IO and swapping was high. The "page reclaim immediate" figures started increasing from 3.2 implying that a lot of dirty LRU pages are reaching the end of the LRU lists. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== parallelio-memcachetest ----------------------- Performance again dropped sharply betwen 2.6.39 and 3.0 with huge jumps in the amount of swap IO. As with arnold, dirty pages are reaching the end of the LRU list. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__parallelio-memcachetest-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== No results available. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Stress high-order allocations on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:20 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:20 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__stress-highalloc-performance-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3 Benchmarks: kernbench vmr-stream sysbench stress-highalloc Summary ======= Allocation success rates of huge pages were looking great until 3.4 when they dropped through the floor. Benchmark notes =============== All machines were booted with mem=4096M due to limitations of the test This is an old series of benchmarks that stressed anti-fragmentation and the allocation of huge pages. It is being replaced with other series of tests which will be more representative but it still produces some interesting results. I tend to use these results as an early warning system before doing a more detailed series of tests. Only the results from the stress-highalloc benchmark are actually of interest and the other benchmarks are just there to age the machine in terms of fragmentation. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== stress-highalloc ---------------- Generally this is going in the right direction. High-order allocations are reasonably successful and where they drop, they have been matched by a large reduction in the length of time it takes to complete the test. Success rates in 3.4 did drop sharply though. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== stress-highalloc ---------------- Until 3.4, this was looking good. Unfortunately in 3.4 there was a massive drop in success rates. This correlates with the removal of lumpy reclaim which compaction indirectly depended upon. This strongly indicates that enough memory is not being reclaimed for compaction to make forward progress or compaction is being disabled routinely due to failed attempts at compaction. The success rates at the end of the test when the machine is idle are still high implying that anti-fragmentation itself is still working as expected. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== Same as hydra, this was looking good until 3.4 and then success rates dropped through the floor. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Stress high-order allocations on ext3 @ 2012-07-23 21:20 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:20 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel Configuration: global-dhp__stress-highalloc-performance-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3 Benchmarks: kernbench vmr-stream sysbench stress-highalloc Summary ======= Allocation success rates of huge pages were looking great until 3.4 when they dropped through the floor. Benchmark notes =============== All machines were booted with mem=4096M due to limitations of the test This is an old series of benchmarks that stressed anti-fragmentation and the allocation of huge pages. It is being replaced with other series of tests which will be more representative but it still produces some interesting results. I tend to use these results as an early warning system before doing a more detailed series of tests. Only the results from the stress-highalloc benchmark are actually of interest and the other benchmarks are just there to age the machine in terms of fragmentation. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== stress-highalloc ---------------- Generally this is going in the right direction. High-order allocations are reasonably successful and where they drop, they have been matched by a large reduction in the length of time it takes to complete the test. Success rates in 3.4 did drop sharply though. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== stress-highalloc ---------------- Until 3.4, this was looking good. Unfortunately in 3.4 there was a massive drop in success rates. This correlates with the removal of lumpy reclaim which compaction indirectly depended upon. This strongly indicates that enough memory is not being reclaimed for compaction to make forward progress or compaction is being disabled routinely due to failed attempts at compaction. The success rates at the end of the test when the machine is idle are still high implying that anti-fragmentation itself is still working as expected. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== Same as hydra, this was looking good until 3.4 and then success rates dropped through the floor. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] dbench4 async on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:21 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:21 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-dbench4-async-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 Benchmarks: dbench4 Summary ======= In general there was a massive drop in throughput after 3.0. Very broadly speaking it looks like the Read operation got faster but at the cost of a big regression in the Flush operation. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying number of clients were used up to 64*NR_CPU. osync, sync-directory and fsync were all off. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== dbench4 ------- Generally worse with a big drop in throughput after 3.0 for small number of clients. In some cases there is an improvement in latency for 3.0 and later kernels but not always. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench4 ------- Similar to arnold, big drop in throughput after 3.0 for small numbers of clients. Unlike arnold, this is matched by an improvement in latency so it may be the case that IO is more fair even if dbench complains about the latency. Very very broadly speaking, it looks like the read operation got a lot faster but flush got a lot slower. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== dbench4 ------- Same story, big drop in throughput after 3.0 with flush again looking very expensive for 3.1 and later kernels. Latency figures are a mixed bag. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] dbench4 async on ext3 @ 2012-07-23 21:21 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:21 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-dbench4-async-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 Benchmarks: dbench4 Summary ======= In general there was a massive drop in throughput after 3.0. Very broadly speaking it looks like the Read operation got faster but at the cost of a big regression in the Flush operation. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying number of clients were used up to 64*NR_CPU. osync, sync-directory and fsync were all off. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== dbench4 ------- Generally worse with a big drop in throughput after 3.0 for small number of clients. In some cases there is an improvement in latency for 3.0 and later kernels but not always. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench4 ------- Similar to arnold, big drop in throughput after 3.0 for small numbers of clients. Unlike arnold, this is matched by an improvement in latency so it may be the case that IO is more fair even if dbench complains about the latency. Very very broadly speaking, it looks like the read operation got a lot faster but flush got a lot slower. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== dbench4 ------- Same story, big drop in throughput after 3.0 with flush again looking very expensive for 3.1 and later kernels. Latency figures are a mixed bag. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] dbench4 async on ext3 2012-07-23 21:21 ` Mel Gorman @ 2012-08-16 14:52 ` Jan Kara -1 siblings, 0 replies; 108+ messages in thread From: Jan Kara @ 2012-08-16 14:52 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel On Mon 23-07-12 22:21:46, Mel Gorman wrote: > Configuration: global-dhp__io-dbench4-async-ext3 > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 > Benchmarks: dbench4 > > Summary > ======= > > In general there was a massive drop in throughput after 3.0. Very broadly > speaking it looks like the Read operation got faster but at the cost of > a big regression in the Flush operation. This looks bad. Also quickly looking into changelogs I don't see any change which could cause this. I'll try to reproduce this and track it down. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] dbench4 async on ext3 @ 2012-08-16 14:52 ` Jan Kara 0 siblings, 0 replies; 108+ messages in thread From: Jan Kara @ 2012-08-16 14:52 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel On Mon 23-07-12 22:21:46, Mel Gorman wrote: > Configuration: global-dhp__io-dbench4-async-ext3 > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 > Benchmarks: dbench4 > > Summary > ======= > > In general there was a massive drop in throughput after 3.0. Very broadly > speaking it looks like the Read operation got faster but at the cost of > a big regression in the Flush operation. This looks bad. Also quickly looking into changelogs I don't see any change which could cause this. I'll try to reproduce this and track it down. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] dbench4 async on ext3 2012-07-23 21:21 ` Mel Gorman @ 2012-08-21 22:00 ` Jan Kara -1 siblings, 0 replies; 108+ messages in thread From: Jan Kara @ 2012-08-21 22:00 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel On Mon 23-07-12 22:21:46, Mel Gorman wrote: > Configuration: global-dhp__io-dbench4-async-ext3 > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 > Benchmarks: dbench4 > > Summary > ======= > > In general there was a massive drop in throughput after 3.0. Very broadly > speaking it looks like the Read operation got faster but at the cost of > a big regression in the Flush operation. Mel, I had a look into this and it's actually very likely only a configuration issue. In 3.1 ext3 started to default to enabled barriers (barrier=1 in mount options) which is a safer but slower choice. When I set barriers explicitely, I see no performance difference for dbench4 between 3.0 and 3.1. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] dbench4 async on ext3 @ 2012-08-21 22:00 ` Jan Kara 0 siblings, 0 replies; 108+ messages in thread From: Jan Kara @ 2012-08-21 22:00 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, linux-fsdevel On Mon 23-07-12 22:21:46, Mel Gorman wrote: > Configuration: global-dhp__io-dbench4-async-ext3 > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 > Benchmarks: dbench4 > > Summary > ======= > > In general there was a massive drop in throughput after 3.0. Very broadly > speaking it looks like the Read operation got faster but at the cost of > a big regression in the Flush operation. Mel, I had a look into this and it's actually very likely only a configuration issue. In 3.1 ext3 started to default to enabled barriers (barrier=1 in mount options) which is a safer but slower choice. When I set barriers explicitely, I see no performance difference for dbench4 between 3.0 and 3.1. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] dbench4 async on ext3 2012-08-21 22:00 ` Jan Kara @ 2012-08-22 10:48 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-08-22 10:48 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel On Wed, Aug 22, 2012 at 12:00:38AM +0200, Jan Kara wrote: > On Mon 23-07-12 22:21:46, Mel Gorman wrote: > > Configuration: global-dhp__io-dbench4-async-ext3 > > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 > > Benchmarks: dbench4 > > > > Summary > > ======= > > > > In general there was a massive drop in throughput after 3.0. Very broadly > > speaking it looks like the Read operation got faster but at the cost of > > a big regression in the Flush operation. > > Mel, I had a look into this and it's actually very likely only a > configuration issue. In 3.1 ext3 started to default to enabled barriers > (barrier=1 in mount options) which is a safer but slower choice. When I set > barriers explicitely, I see no performance difference for dbench4 between > 3.0 and 3.1. > I've confirmed that disabling barriers fixed it, for one test machine and one test at least. I'll reschedule the tests to run with barriers disabled at some point in the future. Thanks for tracking it down, I was at least two weeks away before I got the chance to even look. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* Re: [MMTests] dbench4 async on ext3 @ 2012-08-22 10:48 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-08-22 10:48 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, linux-kernel, linux-fsdevel On Wed, Aug 22, 2012 at 12:00:38AM +0200, Jan Kara wrote: > On Mon 23-07-12 22:21:46, Mel Gorman wrote: > > Configuration: global-dhp__io-dbench4-async-ext3 > > Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext3 > > Benchmarks: dbench4 > > > > Summary > > ======= > > > > In general there was a massive drop in throughput after 3.0. Very broadly > > speaking it looks like the Read operation got faster but at the cost of > > a big regression in the Flush operation. > > Mel, I had a look into this and it's actually very likely only a > configuration issue. In 3.1 ext3 started to default to enabled barriers > (barrier=1 in mount options) which is a safer but slower choice. When I set > barriers explicitely, I see no performance difference for dbench4 between > 3.0 and 3.1. > I've confirmed that disabling barriers fixed it, for one test machine and one test at least. I'll reschedule the tests to run with barriers disabled at some point in the future. Thanks for tracking it down, I was at least two weeks away before I got the chance to even look. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] dbench4 async on ext4 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:23 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:23 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-dbench4-async-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4 Benchmarks: dbench4 Summary ======= Nothing majorly exciting although throughput has been declining slightly in a number of cases. However, this is not consistent between machines and latency has also been variable. Broadly speaking, there is not need to take any action here. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying number of clients were used up to 64*NR_CPU. osync, sync-directory and fsync were all off. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== dbench4 ------- In very vague terms, throughput has been getting worse over time but it's very gradual. Latency has also been getting worse. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench4 ------- This is a mixed bag, there are gains and losses and it's hard to draw any meaningful conclusion. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== dbench4 ------- For the most part, there are few changes of note. Latency has been getting better particularly in 3.2 and later kernels. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] dbench4 async on ext4 @ 2012-07-23 21:23 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:23 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-dbench4-async-ext4 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4 Benchmarks: dbench4 Summary ======= Nothing majorly exciting although throughput has been declining slightly in a number of cases. However, this is not consistent between machines and latency has also been variable. Broadly speaking, there is not need to take any action here. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. dbench 4 was used. Tests ran for 180 seconds once warmed up. A varying number of clients were used up to 64*NR_CPU. osync, sync-directory and fsync were all off. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk =========================================================== dbench4 ------- In very vague terms, throughput has been getting worse over time but it's very gradual. Latency has also been getting worse. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk Status: Ok ========================================================== dbench4 ------- This is a mixed bag, there are gains and losses and it's hard to draw any meaningful conclusion. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-dbench4-async-ext4/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk Status: ========================================================== dbench4 ------- For the most part, there are few changes of note. Latency has been getting better particularly in 3.2 and later kernels. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Threaded IO Performance on ext3 2012-06-29 11:19 ` Mel Gorman @ 2012-07-23 21:24 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:24 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-threaded-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3 Benchmarks: tiobench Summary ======= Some good results but some 3.x kernels were bad and this varied between machines. In some, 3.1 and 3.2 were particularly bad. 3.4 regressed on one machine with a large amount of memory. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. The size parameter for tiobench was 2*RAM. This is barely sufficient for this particular test where the size parameter should be multiple times the size of memory. The running time of the benchmark is already excessive and this is not likely to be changed. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk ========================================================== tiobench -------- This has regressed in almost all cases although for this machine the main damage was between 2.6.32 and 2.6.34. 3.2.9 performed particularly badly. It's interesting to note that 3.1 and 3.2 kernels both swapped and unexpected swapping has been seen in other tests. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== tiobench -------- This is a mixed bag. For low numbers of clients, throughput on sequential reads has improved with the exception of 3.2.9 which was a disaster. For larger number of clients, it is a mix of gains and losses. This could be due to weakness in the methodology due to both a small filesize and a small number of iterations. Random read has improved. With the exception of 3.2.9, sequential writes have generally improved. Random write has a number of regressions and 3.2.9 is a diaster. Kernels 3.1 and 3.2 had unexpected swapping. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== tiobench -------- Like hydra, sequential reads were generally better for low numbers of clients. 3.4 is notable in that it regressed. Unlike hydra, 3.1 was the first bad kernel for sequential reads unlikely hydra where it was 3.2. There are differences in the memory sizes and therefore the filesize and it implies that there is not a single cause of the regression. Random read has improved. Sequential writes have generally improved although it is interesting to note that 3.1 was a kernel that regressed. 3.4 is better than 2.6.32 but it is interesting to note that it has regressed in comparison to 3.3. Random write has generally improved but again 3.4 is worse than 3.3. Like the other machines, 3.1 and 3.2 saw unexpected swapping. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Threaded IO Performance on ext3 @ 2012-07-23 21:24 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:24 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel Configuration: global-dhp__io-threaded-ext3 Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3 Benchmarks: tiobench Summary ======= Some good results but some 3.x kernels were bad and this varied between machines. In some, 3.1 and 3.2 were particularly bad. 3.4 regressed on one machine with a large amount of memory. Benchmark notes =============== mkfs was run on system startup. No attempt was made to age it. No special mkfs or mount options were used. The size parameter for tiobench was 2*RAM. This is barely sufficient for this particular test where the size parameter should be multiple times the size of memory. The running time of the benchmark is already excessive and this is not likely to be changed. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk ========================================================== tiobench -------- This has regressed in almost all cases although for this machine the main damage was between 2.6.32 and 2.6.34. 3.2.9 performed particularly badly. It's interesting to note that 3.1 and 3.2 kernels both swapped and unexpected swapping has been seen in other tests. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== tiobench -------- This is a mixed bag. For low numbers of clients, throughput on sequential reads has improved with the exception of 3.2.9 which was a disaster. For larger number of clients, it is a mix of gains and losses. This could be due to weakness in the methodology due to both a small filesize and a small number of iterations. Random read has improved. With the exception of 3.2.9, sequential writes have generally improved. Random write has a number of regressions and 3.2.9 is a diaster. Kernels 3.1 and 3.2 had unexpected swapping. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-ext3/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== tiobench -------- Like hydra, sequential reads were generally better for low numbers of clients. 3.4 is notable in that it regressed. Unlike hydra, 3.1 was the first bad kernel for sequential reads unlikely hydra where it was 3.2. There are differences in the memory sizes and therefore the filesize and it implies that there is not a single cause of the regression. Random read has improved. Sequential writes have generally improved although it is interesting to note that 3.1 was a kernel that regressed. 3.4 is better than 2.6.32 but it is interesting to note that it has regressed in comparison to 3.3. Random write has generally improved but again 3.4 is worse than 3.3. Like the other machines, 3.1 and 3.2 saw unexpected swapping. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Threaded IO Performance on xfs 2012-06-29 11:19 ` Mel Gorman (?) @ 2012-07-23 21:25 ` Mel Gorman -1 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:25 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs Configuration: global-dhp__io-threaded-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs Benchmarks: tiobench Summary ======= There have been many improvements in the sequential read/write case but 3.4 is noticably worse than 3.3 in a number of cases. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. The size parameter for tiobench was 2*RAM. This is barely sufficient for this particular test where the size parameter should be multiple times the size of memory. The running time of the benchmark is already excessive and this is not likely to be changed. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk ========================================================== tiobench -------- This is a mixed bag. For low numbers of clients, throughput on sequential reads has improved. For larger number of clients, there are many regressions but this is not consistent. This could be due to weakness in the methodology due to both a small filesize and a small number of iterations. Random read is generally bad. For many kernels sequential write is good with the notable exception of 2.6.39 and 3.0 kernels. There was unexpected swapping on 3.1 and 3.2 kernels. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== tiobench -------- Like arnold, performance for sequential read is good for low number of clients. Random read looks good. With the exception of 3.0 in general and single threaded writes for all kernels, sequential writes have generally improved. Random write has a number of regressions. Kernels 3.1 and 3.2 had unexpected swapping. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== tiobench -------- Like hydra, sequential reads were generally better for low numbers of clients. 3.4 is notable in that it regressed and 3.1 was also bad which is roughly similar to what was seen on ext3. There are differences in the memory sizes and therefore the filesize and it implies that there is not a single cause of the regression. Random read has generally improved except with the obvious exception of the single-threaded case. Sequential writes have generally improved but it is interesting to note that 3.4 is worse than 3.3 and this was also seen for ext3. Random write is a mixed bad but again 3.4 is worse than 3.3. Like the other machines, 3.1 and 3.2 saw unexpected swapping. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Threaded IO Performance on xfs @ 2012-07-23 21:25 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:25 UTC (permalink / raw) To: linux-mm; +Cc: linux-fsdevel, linux-kernel, xfs Configuration: global-dhp__io-threaded-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs Benchmarks: tiobench Summary ======= There have been many improvements in the sequential read/write case but 3.4 is noticably worse than 3.3 in a number of cases. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. The size parameter for tiobench was 2*RAM. This is barely sufficient for this particular test where the size parameter should be multiple times the size of memory. The running time of the benchmark is already excessive and this is not likely to be changed. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk ========================================================== tiobench -------- This is a mixed bag. For low numbers of clients, throughput on sequential reads has improved. For larger number of clients, there are many regressions but this is not consistent. This could be due to weakness in the methodology due to both a small filesize and a small number of iterations. Random read is generally bad. For many kernels sequential write is good with the notable exception of 2.6.39 and 3.0 kernels. There was unexpected swapping on 3.1 and 3.2 kernels. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== tiobench -------- Like arnold, performance for sequential read is good for low number of clients. Random read looks good. With the exception of 3.0 in general and single threaded writes for all kernels, sequential writes have generally improved. Random write has a number of regressions. Kernels 3.1 and 3.2 had unexpected swapping. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== tiobench -------- Like hydra, sequential reads were generally better for low numbers of clients. 3.4 is notable in that it regressed and 3.1 was also bad which is roughly similar to what was seen on ext3. There are differences in the memory sizes and therefore the filesize and it implies that there is not a single cause of the regression. Random read has generally improved except with the obvious exception of the single-threaded case. Sequential writes have generally improved but it is interesting to note that 3.4 is worse than 3.3 and this was also seen for ext3. Random write is a mixed bad but again 3.4 is worse than 3.3. Like the other machines, 3.1 and 3.2 saw unexpected swapping. -- Mel Gorman SUSE Labs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 108+ messages in thread
* [MMTests] Threaded IO Performance on xfs @ 2012-07-23 21:25 ` Mel Gorman 0 siblings, 0 replies; 108+ messages in thread From: Mel Gorman @ 2012-07-23 21:25 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, linux-fsdevel, xfs Configuration: global-dhp__io-threaded-xfs Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs Benchmarks: tiobench Summary ======= There have been many improvements in the sequential read/write case but 3.4 is noticably worse than 3.3 in a number of cases. Benchmark notes =============== mkfs was run on system startup. mkfs parameters -f -d agcount=8 mount options inode64,delaylog,logbsize=262144,nobarrier for the most part. On kernels to old to support delaylog was removed. On kernels where it was the default, it was specified and the warning ignored. The size parameter for tiobench was 2*RAM. This is barely sufficient for this particular test where the size parameter should be multiple times the size of memory. The running time of the benchmark is already excessive and this is not likely to be changed. =========================================================== Machine: arnold Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/arnold/comparison.html Arch: x86 CPUs: 1 socket, 2 threads Model: Pentium 4 Disk: Single Rotary Disk ========================================================== tiobench -------- This is a mixed bag. For low numbers of clients, throughput on sequential reads has improved. For larger number of clients, there are many regressions but this is not consistent. This could be due to weakness in the methodology due to both a small filesize and a small number of iterations. Random read is generally bad. For many kernels sequential write is good with the notable exception of 2.6.39 and 3.0 kernels. There was unexpected swapping on 3.1 and 3.2 kernels. ========================================================== Machine: hydra Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/hydra/comparison.html Arch: x86-64 CPUs: 1 socket, 4 threads Model: AMD Phenom II X4 940 Disk: Single Rotary Disk ========================================================== tiobench -------- Like arnold, performance for sequential read is good for low number of clients. Random read looks good. With the exception of 3.0 in general and single threaded writes for all kernels, sequential writes have generally improved. Random write has a number of regressions. Kernels 3.1 and 3.2 had unexpected swapping. ========================================================== Machine: sandy Result: http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-threaded-xfs/sandy/comparison.html Arch: x86-64 CPUs: 1 socket, 8 threads Model: Intel Core i7-2600 Disk: Single Rotary Disk ========================================================== tiobench -------- Like hydra, sequential reads were generally better for low numbers of clients. 3.4 is notable in that it regressed and 3.1 was also bad which is roughly similar to what was seen on ext3. There are differences in the memory sizes and therefore the filesize and it implies that there is not a single cause of the regression. Random read has generally improved except with the obvious exception of the single-threaded case. Sequential writes have generally improved but it is interesting to note that 3.4 is worse than 3.3 and this was also seen for ext3. Random write is a mixed bad but again 3.4 is worse than 3.3. Like the other machines, 3.1 and 3.2 saw unexpected swapping. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 108+ messages in thread
end of thread, other threads:[~2012-08-22 10:54 UTC | newest] Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-06-20 11:32 MMTests 0.04 Mel Gorman 2012-06-20 11:32 ` Mel Gorman 2012-06-29 11:19 ` Mel Gorman 2012-06-29 11:19 ` Mel Gorman 2012-06-29 11:21 ` [MMTests] Page allocator Mel Gorman 2012-06-29 11:21 ` Mel Gorman 2012-06-29 11:22 ` [MMTests] Network performance Mel Gorman 2012-06-29 11:22 ` Mel Gorman 2012-06-29 11:23 ` [MMTests] IO metadata on ext3 Mel Gorman 2012-06-29 11:23 ` Mel Gorman 2012-06-29 11:24 ` [MMTests] IO metadata on ext4 Mel Gorman 2012-06-29 11:24 ` Mel Gorman 2012-06-29 11:25 ` [MMTests] IO metadata on XFS Mel Gorman 2012-06-29 11:25 ` Mel Gorman 2012-06-29 11:25 ` Mel Gorman 2012-07-01 23:54 ` Dave Chinner 2012-07-01 23:54 ` Dave Chinner 2012-07-01 23:54 ` Dave Chinner 2012-07-02 6:32 ` Christoph Hellwig 2012-07-02 6:32 ` Christoph Hellwig 2012-07-02 6:32 ` Christoph Hellwig 2012-07-02 14:32 ` Mel Gorman 2012-07-02 14:32 ` Mel Gorman 2012-07-02 14:32 ` Mel Gorman 2012-07-02 19:35 ` Mel Gorman 2012-07-02 19:35 ` Mel Gorman 2012-07-02 19:35 ` Mel Gorman 2012-07-03 0:19 ` Dave Chinner 2012-07-03 0:19 ` Dave Chinner 2012-07-03 0:19 ` Dave Chinner 2012-07-03 10:59 ` Mel Gorman 2012-07-03 10:59 ` Mel Gorman 2012-07-03 10:59 ` Mel Gorman 2012-07-03 11:44 ` Mel Gorman 2012-07-03 11:44 ` Mel Gorman 2012-07-03 11:44 ` Mel Gorman 2012-07-03 12:31 ` Daniel Vetter 2012-07-03 12:31 ` Daniel Vetter 2012-07-03 12:31 ` Daniel Vetter 2012-07-03 13:08 ` Mel Gorman 2012-07-03 13:08 ` Mel Gorman 2012-07-03 13:08 ` Mel Gorman 2012-07-03 13:28 ` Eugeni Dodonov 2012-07-03 13:28 ` Eugeni Dodonov 2012-07-04 0:47 ` Dave Chinner 2012-07-04 0:47 ` Dave Chinner 2012-07-04 0:47 ` Dave Chinner 2012-07-04 9:51 ` Mel Gorman 2012-07-04 9:51 ` Mel Gorman 2012-07-04 9:51 ` Mel Gorman 2012-07-03 13:04 ` Mel Gorman 2012-07-03 13:04 ` Mel Gorman 2012-07-03 13:04 ` Mel Gorman 2012-07-03 14:04 ` Daniel Vetter 2012-07-03 14:04 ` Daniel Vetter 2012-07-03 14:04 ` Daniel Vetter 2012-07-02 13:30 ` Mel Gorman 2012-07-02 13:30 ` Mel Gorman 2012-07-02 13:30 ` Mel Gorman 2012-07-04 15:52 ` [MMTests] Page reclaim performance on ext3 Mel Gorman 2012-07-04 15:52 ` Mel Gorman 2012-07-04 15:53 ` [MMTests] Page reclaim performance on ext4 Mel Gorman 2012-07-04 15:53 ` Mel Gorman 2012-07-04 15:53 ` [MMTests] Page reclaim performance on xfs Mel Gorman 2012-07-04 15:53 ` Mel Gorman 2012-07-05 14:56 ` [MMTests] Interactivity during IO on ext3 Mel Gorman 2012-07-05 14:56 ` Mel Gorman 2012-07-10 9:49 ` Jan Kara 2012-07-10 9:49 ` Jan Kara 2012-07-10 11:30 ` Mel Gorman 2012-07-10 11:30 ` Mel Gorman 2012-07-05 14:57 ` [MMTests] Interactivity during IO on ext4 Mel Gorman 2012-07-05 14:57 ` Mel Gorman 2012-07-23 21:12 ` [MMTests] Scheduler Mel Gorman 2012-07-23 21:12 ` Mel Gorman 2012-07-23 21:13 ` [MMTests] Sysbench read-only on ext3 Mel Gorman 2012-07-23 21:13 ` Mel Gorman 2012-07-24 2:29 ` Mike Galbraith 2012-07-24 2:29 ` Mike Galbraith 2012-07-24 8:19 ` Mel Gorman 2012-07-24 8:19 ` Mel Gorman 2012-07-24 8:32 ` Mike Galbraith 2012-07-24 8:32 ` Mike Galbraith 2012-07-23 21:14 ` [MMTests] Sysbench read-only on ext4 Mel Gorman 2012-07-23 21:14 ` Mel Gorman 2012-07-23 21:15 ` [MMTests] Sysbench read-only on xfs Mel Gorman 2012-07-23 21:15 ` Mel Gorman 2012-07-23 21:17 ` [MMTests] memcachetest and parallel IO on ext3 Mel Gorman 2012-07-23 21:17 ` Mel Gorman 2012-07-23 21:19 ` [MMTests] memcachetest and parallel IO on xfs Mel Gorman 2012-07-23 21:19 ` Mel Gorman 2012-07-23 21:20 ` [MMTests] Stress high-order allocations on ext3 Mel Gorman 2012-07-23 21:20 ` Mel Gorman 2012-07-23 21:21 ` [MMTests] dbench4 async " Mel Gorman 2012-07-23 21:21 ` Mel Gorman 2012-08-16 14:52 ` Jan Kara 2012-08-16 14:52 ` Jan Kara 2012-08-21 22:00 ` Jan Kara 2012-08-21 22:00 ` Jan Kara 2012-08-22 10:48 ` Mel Gorman 2012-08-22 10:48 ` Mel Gorman 2012-07-23 21:23 ` [MMTests] dbench4 async on ext4 Mel Gorman 2012-07-23 21:23 ` Mel Gorman 2012-07-23 21:24 ` [MMTests] Threaded IO Performance on ext3 Mel Gorman 2012-07-23 21:24 ` Mel Gorman 2012-07-23 21:25 ` [MMTests] Threaded IO Performance on xfs Mel Gorman 2012-07-23 21:25 ` Mel Gorman 2012-07-23 21:25 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.