* [Lustre-devel] LustreFS performance [not found] ` <02FEAA2B-8D98-4C2D-9CE8-FF6E1EB135A2@sun.com> @ 2009-03-02 17:04 ` Vitaly Fertman 2009-03-02 20:45 ` Andreas Dilger 2009-03-10 11:55 ` Mallik Ragampudi 2009-03-19 19:34 ` [Lustre-devel] LustreFS performance (update) Vitaly Fertman 1 sibling, 2 replies; 16+ messages in thread From: Vitaly Fertman @ 2009-03-02 17:04 UTC (permalink / raw) To: lustre-devel **************************************************** LustreFS benchmarking methodology. **************************************************** The document aims to describe the benchmarking methodology which helps to understand the LustreFS performance and reveal LustreFS bottlenecks in different configurations on different hardware, to ensure the next LustreFS release does not downgrade comparing with a previous one. In other words: Goal1. Understand the HEAD performance. Goal2. Compare HEAD and b1_6 (b1_8) performance. To achieve the Goal1, the methodology suggests to test different layers of software in the bottom-top direction, i.e. the underlying back-end, the target server sitting on this back-end, the network connected to this target and how the target performs through this network, etc up to the whole cluster. Each next step has only 1 change over the previous one, it is either a new layer added or 1 parameter in the configuration is changed (probably another network type or another back-end). Comparing the results of each test with the previous test, we get the overhead of the added layer or the performance impact of changing this parameter. To achieve the Goal2, the methodology suggests to go in the reverse top-bottom direction, i.e. to test some large sub-systems first and, if a downgrade vs. a previous LustreFS version is detected, to perform more detailed tests. (This is considered as the primary goal of the 2.0 Performance Team). The document does not cover the way of fixing revealed problems, probably some special purpose test needs to be run or oprofile needs to be compiled in -- it is our of scope of the document. Obviously, it is not possible to perform all the thousands of tests in all the configurations, running all the special purpose tests, etc, the document tries to prepare: 1) all the essential and sufficient tests to see how the system performs in general; 2) some minimal amount of essential tests to see how the system scales in different conditions. Therefore, the plan does not guarantee we will not miss a bottleneck or a bug, it just tries to cover maximum possible scenarios in most interesting conditions/environment states. The amount of tests described below is already about 2K, and there will be definitely more, and it will take a lot of time to perform all of them and to analyze the results. So one of the major concerns here is how to minimize the amount of test so that we would not miss some interesting case and would be able to get all the results within a reasonable amount of time. Please keep it in mind while looking at the tests below. **** Hardware Requirements. **** The test plan implies that we change only 1 parameter (cpu or disk or network) on each step. Thus, the HW requirements are: -- at least 1 node with: CPU:32; RAM: enough to have a tmpfs for MDS; DISK: raid, regular. NET: both GiGe and IB installed. -- besides that: 8 clients, 4 other servers. -- the other servers include: DISK: raid, regular. NET: both GiGe and IB installed. -- client includes: NET: both GiGe and IB installed. **** Software requirements **** 1. Short term. 1.1 mdsrate to be completed to test all the operations listed in MDST3 (see below). 1.2 mdsrate-**.sh to be fixed/written to run mdsrate properly and test all the operations listed in MDST3 (see below). 1.3. fake disk implement FAIL flag and report 'done' without doing anything in obdfilter to get a low-latency disk. 1.4. MT. add more tests here and implement them. 2. Long term. 2.1. mdtstack-survey - an echo client-server is to be written for mds similar to ost. - a test script similar to obdfilter-survey.sh is to be written. **** Different configurations **** Configuration of Node: RAM. Amount of RAM on nodes (?) CPU. Count of CPUs on nodes (1..32) DISK. Disk type (regular, raid, tmpfs, fake) JOUR. Journal type (internal, external, ram) Q: which raid? A: raid5, as it seems to be the most popular. fake: to get a low-latency disk, it is preferable to report 'done' without doing anything in obdfilter once some FAIL flag is set. It is useful for OST testing, because first of all, it does not have a CPU overhead of memcpy of using tmpfs and it lets to test large amount of data in contrast to tmpfs. As a drawback, it skips the localfs code paths. Configuration of Cluster: CL. Amount of clients (1,2,4,8) OSS. Amount of OSS nodes (1,2,4) NET. Network type (GiGe, IB) OSTN. Amount of OST per nodes (1,2,4) Configuration of test. TH. Amount of threads per client (1,2,4,8) VER. Lustre version (b1_6, HEAD. later b1_8). FEAT. Lustre features to turn off (COS, SA, RA, debug messages) TEST. Specific test parameters. **** Testing **** Low Layers Testing (LLT) LLT1. Raw disk (lustre-iokit:sgpdd-survey) LLT2. Local filesystem (lustre-iokit: ior-survey, is fs mounted synchronously?) Network Testing (NETT). NETT1. lnet: lnetself test. NETT2. OBD: lustre-iokit: (obdfilter-survey, echo_client-osc-..- net-..-ost-echo_server) NETT3. MD: (not ready) OST Testing (OSTT). OSTT1. Isolated OST (lustre-iokit: obdfilter-survey, echo_client- obdfilter-..-disk) OSTT2. Remote OST (lustre-iokit: obdfilter-survey, echo_client-osc-..- ost-obdfilter-..-disk) OSTT3. Client-OST IO (lustre-iokit: ost-survey, client-ost-disk). MDS Testing (MDST). MDST1. Isolated MDS test (not ready) MDST2. Remote MDS test (not ready) MDST3. Simple Client-MDS operation test Mixed testing (MT) (not ready) **** Statistics **** During all the tests the following is supposed to be running on all the servers: 1) vmstat 2) iostat, if there is some disk activity. smth else? *** Goal1. Understand the HEAD performance. *** The Goal1 describes the testing methodology in the bottom-top direction, from the lower layers (disk) to the complete Lustre cluster. LLT1. Raw disk (lustre-iokit:sgpdd-survey) RAM: fixed CPU: 1 DISK: regular,raid,tmpfs (default=raid) JOUR:- CL: 1 OSS:1 NET: - OSTN:- TH: 1,2,4,8 (default=1) F: debug TEST: *)bulk size is specified as rszlo/rszhi=[1,4,64,1024K] *)TH is specified as thrlo/thrhi=[1,2,4,8] *)the amount of objects to work on in parallel: crglo=crghi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. [bulk;separate or commin dir]=8 tests; Test matrix(TESTxTHxDISK): Run TESTs with different amount of threads for each DISK. TESTxTHxDISK=(8x4 - 1)x3=93 tests. "-1" because TH=1 is already covered. Total:93 tests. *** NETT1. lnetself test.*** RAM: fixed CPU: 1,2,8,32 (default=1) DISK: - JOUR:- CL: 1,2,4,8 (default=1) OSS:1 NET: GiGe, IB (default=IB) OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: *) test type: PING,READ,WRITE tests *) bulk size for READ/WRITE: 1k,4k,64k,1M [1 ping + 4 reads + 4 writes] = 9 tests Test matrix (TESTxCLxTHxNETxCPU): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=[1+4+4]x4=36 tests. 2. Multi-client test 2.1. Let's check how clients scale vs. threads per client (TH=1). 2.2. Let's check how the system scale with many clients and threads (TH=8). Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. [CL>1;TH=1,8]. TESTxCLxTH=9x3x2=54 tests. 3. Network test As the nature of IB is different from GiGe, we need to repeat all the tests from (1,2) here. 36+54=90 tests. 4. CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. At the same time, if some HW (network) limit is reached, the result will not be very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 1024K]: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=5x1x4x(4-1)=60. Total: 240 tests. *** NETT2. OBD performance *** lustre-iokit: obdfilter-survey, case=network. The results of this tests are to be compared with lnet results to get the osc+ost+ptlrpc overhead. RAM: fixed CPU: 1 DISK: - JOUR:- CL: 1,2,4,8 (default=1) OSS:1 NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: *) bulk size: rszlo=rszhi=N (1,4,64,1024) *) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8) *) the amount of objects is: nobjlo=nobjhi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. [4 bulks; common or separate dir]=8 tests Test matrix(TESTxTHxCLxNET): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests 2. Multi-client test 2.1. Let's check how clients scale vs. threads per client (TH=1). 2.2. Let's check how the system scale with many clients and threads (TH=8). Note: to be more demonstrative, the maximum amount of threads should be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. [CL>1;TH=1,8]. TESTxCLxTH=8x3x2=48 tests. 3. Network test. Having IB results in hand after (1,2) and these results from NETT1, we already see how osc+ost+ptlrpc changes the behavior. There is no reason to repeat them for GiGe, it seems. 4.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. At the same time, if some HW (network) limit is reached, the result will not be very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 1024K]: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(4-1)=48. Total: 128 tests. *** OSTT1. Isolated OST *** lustre-iokit: obdfilter-survey, case=disk The results of this tests are to be compared with LLT results to get the OST stack overhead. RAM: fixed CPU: 1,2,8,32 (default=1) DISK: regular, raid, fake (default=fake) JOUR: int, ext, ram, (default=int) CL: 1 OSS:1 NET: - OSTN:1,2,4 (default=1) TH: 1,2,4,8 (default=1) F: debug TEST: *) bulk size: rszlo=rszhi=N (1,4,64,1024K) *) TH is specified through: thrlo=1, thrhi=8 (1,2,4,8) *) each OST is supposed to be configured on a separate disk. *) the amount of objects is: nobjlo=nobjhi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. [4 bulks; common of separate dir]=8 tests Test matrix(TESTxTHxOSTNxDISKxCPU): 1. Multi-thread test. Run TESTs on OSTN=1 with different amount of threads. TESTxTH=8x4=32 tests 2. Multi-OST test 2.1. Let's check how OSTs vs. threads per OST scale (TH=OSTN). 2.2. Let's check how the system scale with many OSTs and threads (TH=8*OSTN). [OSTN>1;TH=OSTN,8*OSTN]. TESTxOSTNxTH=8x2x2=32 tests. 3. DISK test As other disks are completely different, so lets repeat most of the (1,2) for 2 others: [TH=OSTN;8*OSTN]: TESTxOSTNxTHxDISK=8x3x2x2=96 4. JOURNAL test. Limit the tests with only raid-disk. Limit the test with only 1 large and 1 small bulk:[1,1024K]. TESTxOSTNxTHxJOUR: 4x3x2x2=48 5. CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is better to perform it on a fast backend (DISK=fake) to see how CPU really matters. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. Also, run with a small & a large bulk only:[1,1024K] [OSTN=4,TH=1,2,4,8]: TESTxOSTNxTHxCPU=4x1x4x3=48 Total: 256 tests. *** OSTT2. Real OST test *** lustre-iokit: obdfilter-survey, case=netdisk This test is a composition of OBD performance and Isolated OST tests, so its results are to be compared with NETT2 & OSTT1 results. RAM: fixed CPU: 1,2,8,32 (default=1) DISK: fake JOUR:int CL: 1,2,4,8 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1,2,4 (default=1) TH: 1,2,4,8 (default=1) F: debug TEST: *) bulk size: rszlo=rszhi=N (1,4,64,1024) *) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8) *) each OST is supposed to be configured on a separate disk. *) the amount of objects is: nobjlo=nobjhi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. [4 bulks; common of separate dir]=8 tests Test matrix(TESTxTHxCLxCPUxNETxOSSxOSTN): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests 2. Multi-client test 2.1. Let's check how clients scale vs. threads per client (TH=1). 2.2. Let's check how the system scale with many clients and threads (TH=8). Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. [CL>1;TH=1,8]. TESTxCLxTH=8x3x2=48 tests. 3. Network test Having IB results in hand after (1,2) and these results from NETT2, we already see how osc+ost+ptlrpc+obdfilter changes the behavior. Thus, there is no reason to repeat them for GiGe, it seems. 4.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. At the same time, if some HW (network) limit is reached, the result will not be very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 1024K]: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(4-1)=48. 5. OSTN test. The same OSC, network, CPU, disk, just check how OST stack (see 1,2 tests) is scalable. 5.1. Let's check how N threads per 1 OST vs. 1 thread per N OST scales (CL=OSTN). 5.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. It seems enough to look at 1 small & 1 large bulk only: [1,1024K] [CL=OSTN,8;TH=1,8]. TESTxCLxTHxOSTN=4x2x2x2=32 tests 6. OSS test. 6.1. Let's check how 1 thread per N OST vs. 1 thread per N OSS scales (CL=OSS). 6.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. It seems enough to look@1 small & 1 large bulk only: [1,1024K] [CL=OSS,8;TH=1,8]. TESTxCLxTHxOSTN=4x2x2x2=32 tests Total:192 tests *** OSTT3. Client-OST test *** lustre-iokit: ior-survey. The test results are to be compared with OSTT2 results to get the overhead for Lustre Client: client stack, distributed locking, etc. RAM: fixed CPU: 1,2,8,32 (default=1) DISK: fake JOUR:int CL: 1,2,4,8 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1,2,4 (default=1) TH: 1,2,4,8 (default=1) F: debug TEST: *) CL is specified through $clients_hi *) TH is specified through $tasks_per_client_hi *) bulk is specified through rsize_lo/hi (1,4,64,1028K) *) file_per_task=[0;1] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. [4 bulks; common of separate dir]=8 tests Test matrix(TESTxTHxCLxCPU): absolutely the same as for OSTT2. NETT3. MD: (not ready) MDST1. Isolated MDS (not ready) MDST2. Remote MDS (not ready) This set of tests need to be implemented in a utility similar to obdfilter-survey but for MDS testing. MDST3. Simple Client-MDS operation tests 1. create,mknod,mkdir (symlink, link??) RAM: fixed CPU: 1,2,8,32 (default=1) DISK(MDS): tmpfs, raid, regular (default=tmpfs) DISK(OST): tmpfs JOUR: int,ext,ram (default=int) CL: 1,2,4,8 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: it will be probably mdsrate/mdsrate-create-small.sh, but it needs to be fixed to support all of these operations, not only create. If so: *) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8] *) CL is specified through CLIENTS or NODES_TO_USE. *) NOSINGLE should be provided *) add --dirnum option to COMMAND *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the same dir and when each works in a separate one. *) nfiles is files-per-dir * DIRNUM [common or separate dir]=2tests; Note: we should probably limit the amount of files in 1 directory with 2M, otherwise the performance will definitely downgrade. Test matrix(TESTxTHxCLxCPUxNETxOSS): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests (not 8 as if TH=1, DIRNUM=1, and this is already covered). 2. Multi-client test 2.1. Let's check how clients scale vs. threads per client (TH=1). 2.2. Let's check how the system scale with many clients and threads (TH=8). Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. [CL>1;TH=1,8]. TESTxCLxTH=2x3x2=12 tests. 3. Striped test. 2.1. Let's check how multi-client system scales (TH=1). 2.2. Let's check how large load system scales (TH=8) Test (only) create with different stripeness. TESTxCLxTHxOSS=[2x4x2-1]x2=30 4. Network test Having IB results in hand after (1,2,3) and these results from NETT1, we already see how mdc+mdt-stack+ptlrpc changes the behavior. There is no reason to repeat them for GiGe, it seems. 5. DISK test. Unlink the OST testing, we do not have echo-md client (MDTT1), thus we have not checked how different disks impact the performance, so we need to check it here. Limit this test with only couple of operations: create, mknod. As different disks are of completely different nature we need to repeat most of (1,2) here [TH=1,8]: TESTxCLxTHxDISK=(2x4x2-1)x2=30 6. JOURNAL test. Repeat (5) for different journals, but limit the test with raid-disk only. TESTxCLxTHxDISKxJOUR=(2x4x2-1)x1x2=30 7.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. Limit this test with only couple of operations: create, mknod. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel, so run it for CL=max only: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=2x1x4x(4-1)=24 Total: 19 tests for mkdir, 103 for mknod, 133 for create. 2. lookup (mdsrate-lookup-1dir.sh => mdsrate-lookup.sh) RAM: fixed CPU: 1,2,8,32 (default=1) DISK: tmpfs JOUR: int CL: 1,2,4,8 (default=1) OSS:1 NET: GiGe,IB (default=IB) OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: it will be probably better to work out mdsrate-lookup-1dir.sh, which could work in several directories in parallel. *) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8] (to be added into the script) *) CL is an amount of nodes specified in CLIENTS or NODES_TO_USE. *) NOSINGLE should be provided *) add --dirnum option to COMMAND *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the same dir and when each works in a separate one. *) nfiles is files-per-dir * DIRNUM *) add READDIR_ORDER to test both random and readdir order lookups. [common or separate di; readdir,random order]=4 tests. Q: it seems this test does md_getattr_name(), instead of lookup, thus no lock enqueue is involved. A: what about to replace it with access(2)?? Test matrix(TESTxTHxCLxCPUxNET): the same as for (1, mknod), but 4 tests instead of 2: 19x2=38 tests. 3. stat RAM: fixed CPU: 1,2,8,32 (default=1) DISK(MDS): tmpfs, raid, regular (default=tmpfs) DISK(OST): tmpfs JOUR: int,ext,ram (default=int) CL: 1,2,4 (default=1) OSS:1,2,4 (default=1) NET: GiGe,IB (default=IB) OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: mdsrate/mdsrate-stat-small.sh *) add THREADS_PER_CLIENT to the script to specify TH *) CL is specified through CLIENTS or NODES_TO_USE. *) NOSINGLE should be provided *) add --dirnum option to COMMAND *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the same dir and when each works in a separate one. *) nfiles is files-per-dir * DIRNUM *) add READDIR_ORDER to test both random and readdir order lookups. [common or separate dir; readdir,random order]=4 tests. Q: do we want to test stat(2) with other then tmpfs disk on OST? what journal should it have if so? Test matrix(TESTxTHxCLxCPUxNETxDISKxJOUR): the same as for (1, create), but 4 tests instead of 2: 133x2=266 tests. 4. unlink (mdsrate-create-small.sh, run twice??) it should be run (and it is run in mdsrate-create-small.sh) for all the operations in (1), i.e. create, mkdir, mknod. The test matrix is the same and the total: 19 tests for mkdir, 103 for mknod, 133 for create. 5. chmod (mdsrate-chmod.sh, new one, fix mdsrate) The same as (1, mkdir) and the total: 19 tests. 6. utime (mdsrate-utime.sh, new one, fix mdsrate) The same as (1, mkdir) and the total: 19 tests. 7. chown (mdsrate-chown.sh, new one, fix mdsrate) The same as (1, create), but skip different DISKs&JOURNALs: 19 + 30 + 24=73 tests. 8. rename (mdsrate-rename.sh, new one, fix mdsrate) The same as (1, mkdir) and the total: 19 tests. 9. find Q: despite the fact we currently have a large downgrade with "find -f type", do we want to have this test in the general test set? **** MT. Mixed testing. **** MT1. Create-write test. RAM: fixed CPU: 32 DISK(MDS): tmpfs, raid (default=tmpfs) DISK(OST): raid JOUR: int CL: 1,2,4,8 (default=1) OSS:1 (default=1) NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: must be a new one. Each thread creates files in a loop, writes 1 bulk to each and closes it. *) it is enough to test with a small bulk only: [1k] *) [common or separate dir]=2tests; Test matrix(TESTxTHxCLxCPUxNETxOSS): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests (not 8 as if TH=1, it is always in 1 dir, and this is already covered). 2. Multi-client test 2.1. Let's check how clients scale vs. threads per client (TH=1). 2.2. Let's check how the system scale with many clients and threads (TH=8). Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. [CL>1;TH=1,8]. TESTxCLxTH=2x3x2=12 tests. 3. DISK test. Check how different disks impact on the performance. As different disks are of completely different nature we need to repeat most of (1,2) here [TH=1,8]: TESTxCLxTHxDISK=(2x4x2-1)x1=15 Total: 34 tests. MT2. Create-Readdir test. RAM: fixed CPU: 32 DISK(MDS): tmpfs, raid (default=tmpfs) DISK(OST): raid JOUR: int CL: 1,2,4,8 (default=1) (1 extra client does "ls -U") OSS:1 (default=1) NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: must be a new one. Each thread creates files in a loop and immediately closes them. 1 thread on another client does "ls -U". It is done in 1 directory. The test matrix is exactly the same as for MT1. Total: 34 tests. MT3. ??? Some more tests ???? **** Goal2. Compare HEAD and b1_6 (b1_8) performance. **** This paragraph describes the testing methodology in the reverse order of testing, i.e. in the top-bottom direction, making sure new LustreFS (HEAD) version does not downgrade comparing with the previous ones (b1_6/b1_8). Therefore, the first testing cycle includes: MT, MDST3, OSTT3, NETT1. from the above tests. In the case a downgrade is detected, lower layer tests are to be run until the downgrade disappear. -- Vitaly ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-02 17:04 ` [Lustre-devel] LustreFS performance Vitaly Fertman @ 2009-03-02 20:45 ` Andreas Dilger 2009-03-04 17:19 ` Oleg Drokin 2009-03-10 14:39 ` Nicholas Henke 2009-03-10 11:55 ` Mallik Ragampudi 1 sibling, 2 replies; 16+ messages in thread From: Andreas Dilger @ 2009-03-02 20:45 UTC (permalink / raw) To: lustre-devel On Mar 02, 2009 20:04 +0300, Vitaly Fertman wrote: > RAM: enough to have a tmpfs for MDS; Note that strictly speaking we need to use ldiskfs on a ramdisk, not tmpfs, because we don't have an fsfilt_tmpfs. > Q: which raid? > A: raid5, as it seems to be the most popular. I would propose: - for MDT it needs to be RAID-1+0, because of small, random IO sizes - for OST it needs to be RAID-6, because of double-failure risk (see Lustre Manual "RAID" section for discussion) > **** Statistics **** > > During all the tests the following is supposed to be running on all > the servers: > 1) vmstat > 2) iostat, if there is some disk activity. > smth else? I would propose either LLNL's LMT or HP's collectl, which both also collect Lustre stats. Those both provide more information than the above, and having the IO/CPU load correlated to Lustre RPC counts is very useful. > MDST3 > > Q: do we want to test stat(2) with other then tmpfs disk on OST? > what journal should it have if so? I would be quite interested in the performance numbers from just the ramdisk MDT+OST, to see what the upper limit of the protocol and network are. > 9. find > Q: despite the fact we currently have a large downgrade with > "find -f type", do we want to have this test in the general test set? Some of that performance loss should have been fixed recently. We should continue to test it. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-02 20:45 ` Andreas Dilger @ 2009-03-04 17:19 ` Oleg Drokin 2009-03-04 17:28 ` Jeff Darcy 2009-03-10 14:39 ` Nicholas Henke 1 sibling, 1 reply; 16+ messages in thread From: Oleg Drokin @ 2009-03-04 17:19 UTC (permalink / raw) To: lustre-devel Hello! On Mar 2, 2009, at 3:45 PM, Andreas Dilger wrote: > On Mar 02, 2009 20:04 +0300, Vitaly Fertman wrote: >> RAM: enough to have a tmpfs for MDS; > Note that strictly speaking we need to use ldiskfs on a ramdisk, not > tmpfs, because we don't have an fsfilt_tmpfs. The idea was loop device on tmpfs, I think. Bye, Oleg ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-04 17:19 ` Oleg Drokin @ 2009-03-04 17:28 ` Jeff Darcy 2009-03-05 21:27 ` Andreas Dilger 0 siblings, 1 reply; 16+ messages in thread From: Jeff Darcy @ 2009-03-04 17:28 UTC (permalink / raw) To: lustre-devel Oleg Drokin wrote: > On Mar 2, 2009, at 3:45 PM, Andreas Dilger wrote: > > >> On Mar 02, 2009 20:04 +0300, Vitaly Fertman wrote: >> >>> RAM: enough to have a tmpfs for MDS; >>> >> Note that strictly speaking we need to use ldiskfs on a ramdisk, not >> tmpfs, because we don't have an fsfilt_tmpfs. >> > > The idea was loop device on tmpfs, I think. > FYI, this is exactly what we do with our FabriCache feature - i.e. both MDT and OSTs are actually loopback files on tmpfs. Modulo a few issues with preallocated write space eating all storage leaving none for actual data, it works rather well producing high performance numbers and giving LNDs a good workout. BTW, the loopback driver does copies and is disturbingly single-threaded, which can create a bottleneck. This can be worked around with multiple instances per node, though. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-04 17:28 ` Jeff Darcy @ 2009-03-05 21:27 ` Andreas Dilger 2009-03-09 2:50 ` Oleg Drokin 0 siblings, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2009-03-05 21:27 UTC (permalink / raw) To: lustre-devel On Mar 04, 2009 12:28 -0500, Jeff Darcy wrote: > Oleg Drokin wrote: >> On Mar 2, 2009, at 3:45 PM, Andreas Dilger wrote: >>> Note that strictly speaking we need to use ldiskfs on a ramdisk, not >>> tmpfs, because we don't have an fsfilt_tmpfs. >> >> The idea was loop device on tmpfs, I think. > > FYI, this is exactly what we do with our FabriCache feature - i.e. both > MDT and OSTs are actually loopback files on tmpfs. The problem with using a loop device instead of a ramdisk is that you now have 2 layers of indirection - MDS->ldiskfs->loop->tmpfs->RAM instead of MDS->ldiskfs->RAM. The drawback (or possibly benefit) is that ramdisks consume a fixed amount of RAM and are not "sparse" (AFAIK, that may have changed since I last looked into this). That said, once a block is written to by mke2fs or by ldiskfs in the loop->tmpfs case it will also never be freed again, so you only get some marginal benefit. > Modulo a few issues > with preallocated write space eating all storage leaving none for actual > data, it works rather well producing high performance numbers and giving > LNDs a good workout. BTW, the loopback driver does copies and is > disturbingly single-threaded, which can create a bottleneck. This can > be worked around with multiple instances per node, though. Even better, if you have some development skills, would be to implement (or possibly resurrect) an fsfilt-tmpfs layer. Since tmpfs isn't going to be recoverable anyways (I assume you just reformat from scratch when there is a crash), then you can make all of the transaction handling as no-ops, and just implement the minimal interfaces needed to work. That would allow unlinked files to release space from tmpfs, and also avoid the fixed allocation overhead and journaling of ldiskfs, probably saving you 5% of RAM (more on the MDS) and a LOT of memcpy() overhead. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-05 21:27 ` Andreas Dilger @ 2009-03-09 2:50 ` Oleg Drokin 2009-03-09 8:29 ` Andreas Dilger 0 siblings, 1 reply; 16+ messages in thread From: Oleg Drokin @ 2009-03-09 2:50 UTC (permalink / raw) To: lustre-devel Hello! On Mar 5, 2009, at 4:27 PM, Andreas Dilger wrote: > Even better, if you have some development skills, would be to > implement > (or possibly resurrect) an fsfilt-tmpfs layer. Since tmpfs isn't > going > to be recoverable anyways (I assume you just reformat from scratch > when > there is a crash), then you can make all of the transaction handling > as no-ops, and just implement the minimal interfaces needed to work. > That would allow unlinked files to release space from tmpfs, and also > avoid the fixed allocation overhead and journaling of ldiskfs, > probably > saving you 5% of RAM (more on the MDS) and a LOT of memcpy() overhead. This is exactly what I was trying to avoid. I tried to measure things as if I had an infinitely fast disk only, and I still needed all the journal/blockdevice and other such things to take the CPU they would normally take. After all we cannot expect people to actually run real MDSes on tmpfs unless they have some means to replicate that MDS somewhere else. Bye, Oleg ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-09 2:50 ` Oleg Drokin @ 2009-03-09 8:29 ` Andreas Dilger 0 siblings, 0 replies; 16+ messages in thread From: Andreas Dilger @ 2009-03-09 8:29 UTC (permalink / raw) To: lustre-devel On Mar 08, 2009 22:50 -0400, Oleg Drokin wrote: > On Mar 5, 2009, at 4:27 PM, Andreas Dilger wrote: >> Even better, if you have some development skills, would be to >> implement (or possibly resurrect) an fsfilt-tmpfs layer. Since >> tmpfs isn't going to be recoverable anyways (I assume you just >> reformat from scratch when there is a crash), then you can make >> all of the transaction handling as no-ops, and just implement >> the minimal interfaces needed to work. >> >> That would allow unlinked files to release space from tmpfs, and also >> avoid the fixed allocation overhead and journaling of ldiskfs, probably >> saving you 5% of RAM (more on the MDS) and a LOT of memcpy() overhead. > > This is exactly what I was trying to avoid. > I tried to measure things as if I had an infinitely fast disk only, and I > still needed all the journal/blockdevice and other such things to take > the CPU they would normally take. > After all we cannot expect people to actually run real MDSes on tmpfs > unless they have some means to replicate that MDS somewhere else. In fact, it wasn't my proposal for the metadata performance testing itself, but rather SiCortex are apparently running with RAM-backed filesystems for some kind of fast cache filesystem (e.g. distributed shared memory or similar). I was only proposing their implementation could be more efficient. I could imagine that for flash-cache type applications that storing checkpoints for a short time in a RAM-backed OST pool before migrating it to persistent storage. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-02 20:45 ` Andreas Dilger 2009-03-04 17:19 ` Oleg Drokin @ 2009-03-10 14:39 ` Nicholas Henke 2009-03-10 15:07 ` Mark Seger 1 sibling, 1 reply; 16+ messages in thread From: Nicholas Henke @ 2009-03-10 14:39 UTC (permalink / raw) To: lustre-devel Andreas Dilger wrote: > On Mar 02, 2009 20:04 +0300, Vitaly Fertman wrote: >> RAM: enough to have a tmpfs for MDS; >> **** Statistics **** >> >> During all the tests the following is supposed to be running on all >> the servers: >> 1) vmstat >> 2) iostat, if there is some disk activity. >> smth else? > > I would propose either LLNL's LMT or HP's collectl, which both also > collect Lustre stats. Those both provide more information than the > above, and having the IO/CPU load correlated to Lustre RPC counts is > very useful. It would be great if we could standardize on a set of tools for performance issues. I've got to think a set of tools like this would make it easier for customer & partners to gather the correct data the first time. Cray has been using lstats, a package of scripts we got from Sun a while back. We've added things like AT timeout and sar per-cpu usage to it (see bug 18574 att 22140 for complete set of scripts). I'm all for using collectl, but I think the requirements and setup for LMT makes it a tough sell. Does Sun have a set of customizations for collectl or does the standard collectl collect enough information? Nic ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-10 14:39 ` Nicholas Henke @ 2009-03-10 15:07 ` Mark Seger 0 siblings, 0 replies; 16+ messages in thread From: Mark Seger @ 2009-03-10 15:07 UTC (permalink / raw) To: lustre-devel >>> **** Statistics **** >>> >>> During all the tests the following is supposed to be running on all >>> the servers: >>> 1) vmstat >>> 2) iostat, if there is some disk activity. >>> smth else? >>> >> I would propose either LLNL's LMT or HP's collectl, which both also >> collect Lustre stats. Those both provide more information than the >> above, and having the IO/CPU load correlated to Lustre RPC counts is >> very useful. >> > > It would be great if we could standardize on a set of tools for performance > issues. I've got to think a set of tools like this would make it easier for > customer & partners to gather the correct data the first time. > > Cray has been using lstats, a package of scripts we got from Sun a while back. > We've added things like AT timeout and sar per-cpu usage to it (see bug 18574 > att 22140 for complete set of scripts). > > I'm all for using collectl, but I think the requirements and setup for LMT makes > it a tough sell. Does Sun have a set of customizations for collectl or does > the standard collectl collect enough information? > My goal when I wrote collectl was to provide one-stop shopping for as much system performance data as seemed relevant and view lustre as only one of many data sources. To that end, if you do a merge of all the data collected by the *stat utilities, sar, perfquery (for IB), many of the lustre stats (but not all) and maybe a few others you'll get closer to understanding what collectl can collect. On the output side you can pick and choose what to display - when used interactively only those data elements are collected but when run as a daemon you can collect them all and replay the data as ofter as you like looking at different slices. As for LMT I haven't played with it as my interests are in dealing with all data. However, as an exercise left to the reader, there are a number of switches for changing collectl's display as well as --home which moves the cursor the terminal's home position before displaying the output, giving a display similar to the feel of top. If you want to display what's happening to lustre and your disks, cpu, etc all at the same time on a refreshing display, --top is definitely the way to go. And finally, if you want something totally different and are feeling creative, just write your own print routines in perl and tell collectl to use them with the --export switch. -mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-02 17:04 ` [Lustre-devel] LustreFS performance Vitaly Fertman 2009-03-02 20:45 ` Andreas Dilger @ 2009-03-10 11:55 ` Mallik Ragampudi 2009-03-10 16:40 ` Vitaly Fertman 1 sibling, 1 reply; 16+ messages in thread From: Mallik Ragampudi @ 2009-03-10 11:55 UTC (permalink / raw) To: lustre-devel Vitaly, This is very comprehensive. Few comments: 1) I think it would be good to start with LLT2 (lustre-iokit: ior-survey) and get an out-of-the-box performance picture/comparison (1.6 and 2.0) from the whole cluster before testing the individual layers. 2) Can this plan be extended to include CMD performance testing as well ? I would expect that most of your test cases apply for the CMD as well ? 3) I assume the "CPU" parameter in your methodology refers to # of CPUs in MDS, right ? 4) I am not sure if we can test 32 cores without hitting other bottlenecks beyond 8 coress on the servers. This will cut down some combinations in the matrix. Thanks, Mallik Vitaly Fertman wrote: > **************************************************** > LustreFS benchmarking methodology. > **************************************************** > > The document aims to describe the benchmarking methodology which helps > to understand the LustreFS performance and reveal LustreFS bottlenecks in > different configurations on different hardware, to ensure the next > LustreFS > release does not downgrade comparing with a previous one. In other words: > Goal1. Understand the HEAD performance. > Goal2. Compare HEAD and b1_6 (b1_8) performance. > > To achieve the Goal1, the methodology suggests to test different > layers of > software in the bottom-top direction, i.e. the underlying back-end, > the target > server sitting on this back-end, the network connected to this target > and how > the target performs through this network, etc up to the whole cluster. > Each next step has only 1 change over the previous one, it is either a > new layer > added or 1 parameter in the configuration is changed (probably another > network > type or another back-end). Comparing the results of each test with the > previous > test, we get the overhead of the added layer or the performance impact of > changing this parameter. > > To achieve the Goal2, the methodology suggests to go in the reverse > top-bottom > direction, i.e. to test some large sub-systems first and, if a > downgrade vs. a previous > LustreFS version is detected, to perform more detailed tests. (This is > considered as > the primary goal of the 2.0 Performance Team). > > The document does not cover the way of fixing revealed problems, > probably some > special purpose test needs to be run or oprofile needs to be compiled > in -- it is our > of scope of the document. > > Obviously, it is not possible to perform all the thousands of tests in > all the configurations, > running all the special purpose tests, etc, the document tries to > prepare: > 1) all the essential and sufficient tests to see how the system > performs in general; > 2) some minimal amount of essential tests to see how the system scales > in different > conditions. > Therefore, the plan does not guarantee we will not miss a bottleneck > or a bug, it just > tries to cover maximum possible scenarios in most interesting > conditions/environment > states. > > The amount of tests described below is already about 2K, and there > will be definitely more, > and it will take a lot of time to perform all of them and to analyze > the results. So one of > the major concerns here is how to minimize the amount of test so that > we would not miss > some interesting case and would be able to get all the results within > a reasonable amount of > time. Please keep it in mind while looking at the tests below. > > **** Hardware Requirements. **** > > The test plan implies that we change only 1 parameter (cpu or disk or > network) > on each step. Thus, the HW requirements are: > > -- at least 1 node with: > CPU:32; > RAM: enough to have a tmpfs for MDS; > DISK: raid, regular. > NET: both GiGe and IB installed. > -- besides that: 8 clients, 4 other servers. > -- the other servers include: > DISK: raid, regular. > NET: both GiGe and IB installed. > -- client includes: > NET: both GiGe and IB installed. > > **** Software requirements **** > > 1. Short term. > 1.1 mdsrate > to be completed to test all the operations listed in MDST3 (see below). > 1.2 mdsrate-**.sh > to be fixed/written to run mdsrate properly and test all the > operations listed in > MDST3 (see below). > 1.3. fake disk > implement FAIL flag and report 'done' without doing anything in > obdfilter to get > a low-latency disk. > 1.4. MT. > add more tests here and implement them. > > 2. Long term. > 2.1. mdtstack-survey > - an echo client-server is to be written for mds similar to ost. > - a test script similar to obdfilter-survey.sh is to be written. > > **** Different configurations **** > > Configuration of Node: > RAM. Amount of RAM on nodes (?) > CPU. Count of CPUs on nodes (1..32) > DISK. Disk type (regular, raid, tmpfs, fake) > JOUR. Journal type (internal, external, ram) > > Q: which raid? > A: raid5, as it seems to be the most popular. > > fake: to get a low-latency disk, it is preferable to report 'done' > without doing anything > in obdfilter once some FAIL flag is set. It is useful for OST testing, > because first of all, > it does not have a CPU overhead of memcpy of using tmpfs and it lets > to test large > amount of data in contrast to tmpfs. As a drawback, it skips the > localfs code paths. > > Configuration of Cluster: > CL. Amount of clients (1,2,4,8) > OSS. Amount of OSS nodes (1,2,4) > NET. Network type (GiGe, IB) > OSTN. Amount of OST per nodes (1,2,4) > > Configuration of test. > TH. Amount of threads per client (1,2,4,8) > VER. Lustre version (b1_6, HEAD. later b1_8). > FEAT. Lustre features to turn off (COS, SA, RA, debug messages) > TEST. Specific test parameters. > > **** Testing **** > Low Layers Testing (LLT) > LLT1. Raw disk (lustre-iokit:sgpdd-survey) > LLT2. Local filesystem (lustre-iokit: ior-survey, is fs mounted > synchronously?) > > Network Testing (NETT). > NETT1. lnet: lnetself test. > NETT2. OBD: lustre-iokit: (obdfilter-survey, > echo_client-osc-..-net-..-ost-echo_server) > NETT3. MD: (not ready) > > OST Testing (OSTT). > OSTT1. Isolated OST (lustre-iokit: obdfilter-survey, > echo_client-obdfilter-..-disk) > OSTT2. Remote OST (lustre-iokit: obdfilter-survey, > echo_client-osc-..-ost-obdfilter-..-disk) > OSTT3. Client-OST IO (lustre-iokit: ost-survey, client-ost-disk). > > MDS Testing (MDST). > MDST1. Isolated MDS test (not ready) > MDST2. Remote MDS test (not ready) > MDST3. Simple Client-MDS operation test > > Mixed testing (MT) (not ready) > > **** Statistics **** > > During all the tests the following is supposed to be running on all > the servers: > 1) vmstat > 2) iostat, if there is some disk activity. > smth else? > > *** Goal1. Understand the HEAD performance. *** > > The Goal1 describes the testing methodology in the bottom-top direction, > from the lower layers (disk) to the complete Lustre cluster. > > LLT1. Raw disk (lustre-iokit:sgpdd-survey) > RAM: fixed > CPU: 1 > DISK: regular,raid,tmpfs (default=raid) > JOUR:- > CL: 1 > OSS:1 > NET: - > OSTN:- > TH: 1,2,4,8 (default=1) > F: debug > TEST: > *)bulk size is specified as rszlo/rszhi=[1,4,64,1024K] > *)TH is specified as thrlo/thrhi=[1,2,4,8] > *)the amount of objects to work on in parallel: crglo=crghi=[1;TH] > i.e. test only cases when all the threads work on the same file and > when all of them work on a separate file. > [bulk;separate or commin dir]=8 tests; > > Test matrix(TESTxTHxDISK): > Run TESTs with different amount of threads for each DISK. > TESTxTHxDISK=(8x4 - 1)x3=93 tests. > "-1" because TH=1 is already covered. > > Total:93 tests. > > *** NETT1. lnetself test.*** > > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK: - > JOUR:- > CL: 1,2,4,8 (default=1) > OSS:1 > NET: GiGe, IB (default=IB) > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: > *) test type: PING,READ,WRITE tests > *) bulk size for READ/WRITE: 1k,4k,64k,1M > [1 ping + 4 reads + 4 writes] = 9 tests > > Test matrix (TESTxCLxTHxNETxCPU): > 1. Multi-thread test. > Run TESTs on CL=1 with different amount of threads. > TESTxTH=[1+4+4]x4=36 tests. > 2. Multi-client test > 2.1. Let's check how clients scale vs. threads per client (TH=1). > 2.2. Let's check how the system scale with many clients and threads > (TH=8). > Note: to be more demonstrative, the maximum amount of threads could be > taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > [CL>1;TH=1,8]. TESTxCLxTH=9x3x2=54 tests. > 3. Network test > As the nature of IB is different from GiGe, we need to repeat all the > tests > from (1,2) here. 36+54=90 tests. > 4. CPU test > Note: lnet fixes from Liang to be applied here. > Run TESTs on different amount of CPU. > It is mostly interesting to look at large amount of threads, as we are > going to benefit from handling them in parallel. > At the same time, if some HW (network) limit is reached, the result > will not be > very demonstrative, so test with 1 small & 1 large bulk size > only:[1k;1024K]: > [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=5x1x4x(4-1)=60. > > Total: 240 tests. > > *** NETT2. OBD performance *** > lustre-iokit: obdfilter-survey, case=network. > > The results of this tests are to be compared with lnet results to get > the osc+ost+ptlrpc overhead. > > RAM: fixed > CPU: 1 > DISK: - > JOUR:- > CL: 1,2,4,8 (default=1) > OSS:1 > NET: IB > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: > *) bulk size: rszlo=rszhi=N (1,4,64,1024) > *) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8) > *) the amount of objects is: nobjlo=nobjhi=[1;TH] > i.e. test only cases when all the threads work on the same file and > when all of them work on a separate file. > [4 bulks; common or separate dir]=8 tests > > Test matrix(TESTxTHxCLxNET): > 1. Multi-thread test. > Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests > 2. Multi-client test > 2.1. Let's check how clients scale vs. threads per client (TH=1). > 2.2. Let's check how the system scale with many clients and threads > (TH=8). > Note: to be more demonstrative, the maximum amount of threads should > be taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > [CL>1;TH=1,8]. TESTxCLxTH=8x3x2=48 tests. > 3. Network test. > Having IB results in hand after (1,2) and these results from NETT1, we > already see how > osc+ost+ptlrpc changes the behavior. There is no reason to repeat them > for GiGe, it seems. > 4.CPU test > Note: lnet fixes from Liang to be applied here. > Run TESTs on different amount of CPU. > It is mostly interesting to look at large amount of threads, as we are > going to benefit from handling them in parallel. > At the same time, if some HW (network) limit is reached, the result > will not be > very demonstrative, so test with 1 small & 1 large bulk size > only:[1k;1024K]: > [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(4-1)=48. > > Total: 128 tests. > > *** OSTT1. Isolated OST *** > lustre-iokit: obdfilter-survey, case=disk > > The results of this tests are to be compared with LLT results to get > the OST > stack overhead. > > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK: regular, raid, fake (default=fake) > JOUR: int, ext, ram, (default=int) > CL: 1 > OSS:1 > NET: - > OSTN:1,2,4 (default=1) > TH: 1,2,4,8 (default=1) > F: debug > TEST: > *) bulk size: rszlo=rszhi=N (1,4,64,1024K) > *) TH is specified through: thrlo=1, thrhi=8 (1,2,4,8) > *) each OST is supposed to be configured on a separate disk. > *) the amount of objects is: nobjlo=nobjhi=[1;TH] > i.e. test only cases when all the threads work on the same file and > when all of them work on a separate file. > [4 bulks; common of separate dir]=8 tests > > Test matrix(TESTxTHxOSTNxDISKxCPU): > 1. Multi-thread test. > Run TESTs on OSTN=1 with different amount of threads. TESTxTH=8x4=32 > tests > 2. Multi-OST test > 2.1. Let's check how OSTs vs. threads per OST scale (TH=OSTN). > 2.2. Let's check how the system scale with many OSTs and threads > (TH=8*OSTN). > [OSTN>1;TH=OSTN,8*OSTN]. TESTxOSTNxTH=8x2x2=32 tests. > 3. DISK test > As other disks are completely different, so lets repeat most of the > (1,2) for 2 others: > [TH=OSTN;8*OSTN]: TESTxOSTNxTHxDISK=8x3x2x2=96 > 4. JOURNAL test. > Limit the tests with only raid-disk. > Limit the test with only 1 large and 1 small bulk:[1,1024K]. > TESTxOSTNxTHxJOUR: 4x3x2x2=48 > 5. CPU test > Note: lnet fixes from Liang to be applied here. > Run TESTs on different amount of CPU. It is better to perform it on a > fast > backend (DISK=fake) to see how CPU really matters. > It is mostly interesting to look at large amount of threads, as we are > going > to benefit from handling them in parallel. > Also, run with a small & a large bulk only:[1,1024K] > [OSTN=4,TH=1,2,4,8]: TESTxOSTNxTHxCPU=4x1x4x3=48 > > Total: 256 tests. > > *** OSTT2. Real OST test *** > lustre-iokit: obdfilter-survey, case=netdisk > > This test is a composition of OBD performance and Isolated OST tests, > so its results are to be compared with NETT2 & OSTT1 results. > > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK: fake > JOUR:int > CL: 1,2,4,8 (default=1) > OSS:1,2,4 (default=1) > NET: IB > OSTN:1,2,4 (default=1) > TH: 1,2,4,8 (default=1) > F: debug > TEST: > *) bulk size: rszlo=rszhi=N (1,4,64,1024) > *) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8) > *) each OST is supposed to be configured on a separate disk. > *) the amount of objects is: nobjlo=nobjhi=[1;TH] > i.e. test only cases when all the threads work on the same file and > when all of them work on a separate file. > [4 bulks; common of separate dir]=8 tests > > Test matrix(TESTxTHxCLxCPUxNETxOSSxOSTN): > 1. Multi-thread test. > Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests > 2. Multi-client test > 2.1. Let's check how clients scale vs. threads per client (TH=1). > 2.2. Let's check how the system scale with many clients and threads > (TH=8). > Note: to be more demonstrative, the maximum amount of threads could be > taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > [CL>1;TH=1,8]. TESTxCLxTH=8x3x2=48 tests. > 3. Network test > Having IB results in hand after (1,2) and these results from NETT2, we > already see how > osc+ost+ptlrpc+obdfilter changes the behavior. Thus, there is no > reason to repeat them > for GiGe, it seems. > 4.CPU test > Note: lnet fixes from Liang to be applied here. > Run TESTs on different amount of CPU. > It is mostly interesting to look at large amount of threads, as we are > going to benefit from handling them in parallel. > At the same time, if some HW (network) limit is reached, the result > will not be > very demonstrative, so test with 1 small & 1 large bulk size > only:[1k;1024K]: > [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(4-1)=48. > 5. OSTN test. > The same OSC, network, CPU, disk, just check how OST stack (see 1,2 > tests) is scalable. > 5.1. Let's check how N threads per 1 OST vs. 1 thread per N OST scales > (CL=OSTN). > 5.2. Let's check how the system scale with many clients and threads > (CL=8) > Note: to be more demonstrative, the maximum amount of threads could be > taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > It seems enough to look at 1 small & 1 large bulk only: [1,1024K] > [CL=OSTN,8;TH=1,8]. TESTxCLxTHxOSTN=4x2x2x2=32 tests > 6. OSS test. > 6.1. Let's check how 1 thread per N OST vs. 1 thread per N OSS scales > (CL=OSS). > 6.2. Let's check how the system scale with many clients and threads > (CL=8) > Note: to be more demonstrative, the maximum amount of threads could be > taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > It seems enough to look at 1 small & 1 large bulk only: [1,1024K] > [CL=OSS,8;TH=1,8]. TESTxCLxTHxOSTN=4x2x2x2=32 tests > > Total:192 tests > > *** OSTT3. Client-OST test *** > lustre-iokit: ior-survey. > > The test results are to be compared with OSTT2 results to get the > overhead > for Lustre Client: client stack, distributed locking, etc. > > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK: fake > JOUR:int > CL: 1,2,4,8 (default=1) > OSS:1,2,4 (default=1) > NET: IB > OSTN:1,2,4 (default=1) > TH: 1,2,4,8 (default=1) > F: debug > TEST: > *) CL is specified through $clients_hi > *) TH is specified through $tasks_per_client_hi > *) bulk is specified through rsize_lo/hi (1,4,64,1028K) > *) file_per_task=[0;1] > i.e. test only cases when all the threads work on the same file and > when all of them work on a separate file. > [4 bulks; common of separate dir]=8 tests > > Test matrix(TESTxTHxCLxCPU): absolutely the same as for OSTT2. > > NETT3. MD: (not ready) > MDST1. Isolated MDS (not ready) > MDST2. Remote MDS (not ready) > This set of tests need to be implemented in a utility similar to > obdfilter-survey > but for MDS testing. > > MDST3. Simple Client-MDS operation tests > > 1. create,mknod,mkdir (symlink, link??) > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK(MDS): tmpfs, raid, regular (default=tmpfs) > DISK(OST): tmpfs > JOUR: int,ext,ram (default=int) > CL: 1,2,4,8 (default=1) > OSS:1,2,4 (default=1) > NET: IB > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: it will be probably mdsrate/mdsrate-create-small.sh, but it > needs to be > fixed to support all of these operations, not only create. If so: > *) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8] > *) CL is specified through CLIENTS or NODES_TO_USE. > *) NOSINGLE should be provided > *) add --dirnum option to COMMAND > *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the > same dir and when each works in a separate one. > *) nfiles is files-per-dir * DIRNUM > [common or separate dir]=2tests; > > Note: we should probably limit the amount of files in 1 directory with > 2M, > otherwise the performance will definitely downgrade. > > Test matrix(TESTxTHxCLxCPUxNETxOSS): > 1. Multi-thread test. > Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests > (not 8 as if TH=1, DIRNUM=1, and this is already covered). > 2. Multi-client test > 2.1. Let's check how clients scale vs. threads per client (TH=1). > 2.2. Let's check how the system scale with many clients and threads > (TH=8). > Note: to be more demonstrative, the maximum amount of threads could be > taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > [CL>1;TH=1,8]. TESTxCLxTH=2x3x2=12 tests. > 3. Striped test. > 2.1. Let's check how multi-client system scales (TH=1). > 2.2. Let's check how large load system scales (TH=8) > Test (only) create with different stripeness. > TESTxCLxTHxOSS=[2x4x2-1]x2=30 > 4. Network test > Having IB results in hand after (1,2,3) and these results from NETT1, > we already see > how mdc+mdt-stack+ptlrpc changes the behavior. There is no reason to > repeat them > for GiGe, it seems. > 5. DISK test. > Unlink the OST testing, we do not have echo-md client (MDTT1), thus we > have not checked > how different disks impact the performance, so we need to check it here. > Limit this test with only couple of operations: create, mknod. > As different disks are of completely different nature we need to > repeat most of (1,2) here > [TH=1,8]: TESTxCLxTHxDISK=(2x4x2-1)x2=30 > 6. JOURNAL test. > Repeat (5) for different journals, but limit the test with raid-disk > only. TESTxCLxTHxDISKxJOUR=(2x4x2-1)x1x2=30 > 7.CPU test > Note: lnet fixes from Liang to be applied here. > Run TESTs on different amount of CPU. > Limit this test with only couple of operations: create, mknod. > It is mostly interesting to look at large amount of threads, as we are > going to benefit from handling them in parallel, so run it for CL=max > only: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=2x1x4x(4-1)=24 > > Total: 19 tests for mkdir, 103 for mknod, 133 for create. > > 2. lookup (mdsrate-lookup-1dir.sh => mdsrate-lookup.sh) > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK: tmpfs > JOUR: int > CL: 1,2,4,8 (default=1) > OSS:1 > NET: GiGe,IB (default=IB) > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: it will be probably better to work out mdsrate-lookup-1dir.sh, > which > could work in several directories in parallel. > *) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8] > (to be added into the script) > *) CL is an amount of nodes specified in CLIENTS or NODES_TO_USE. > *) NOSINGLE should be provided > *) add --dirnum option to COMMAND > *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the > same dir and when each works in a separate one. > *) nfiles is files-per-dir * DIRNUM > *) add READDIR_ORDER to test both random and readdir order lookups. > [common or separate di; readdir,random order]=4 tests. > > Q: it seems this test does md_getattr_name(), instead of lookup, thus > no lock > enqueue is involved. > A: what about to replace it with access(2)?? > > Test matrix(TESTxTHxCLxCPUxNET): the same as for (1, mknod), but 4 tests > instead of 2: 19x2=38 tests. > > 3. stat > > RAM: fixed > CPU: 1,2,8,32 (default=1) > DISK(MDS): tmpfs, raid, regular (default=tmpfs) > DISK(OST): tmpfs > JOUR: int,ext,ram (default=int) > CL: 1,2,4 (default=1) > OSS:1,2,4 (default=1) > NET: GiGe,IB (default=IB) > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: mdsrate/mdsrate-stat-small.sh > *) add THREADS_PER_CLIENT to the script to specify TH > *) CL is specified through CLIENTS or NODES_TO_USE. > *) NOSINGLE should be provided > *) add --dirnum option to COMMAND > *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the > same dir and when each works in a separate one. > *) nfiles is files-per-dir * DIRNUM > *) add READDIR_ORDER to test both random and readdir order lookups. > [common or separate dir; readdir,random order]=4 tests. > > Q: do we want to test stat(2) with other then tmpfs disk on OST? > what journal should it have if so? > > Test matrix(TESTxTHxCLxCPUxNETxDISKxJOUR): the same as for (1, create), > but 4 tests instead of 2: 133x2=266 tests. > > 4. unlink (mdsrate-create-small.sh, run twice??) > it should be run (and it is run in mdsrate-create-small.sh) for all > the operations in (1), i.e. create, mkdir, mknod. > The test matrix is the same and the total: > 19 tests for mkdir, 103 for mknod, 133 for create. > > 5. chmod (mdsrate-chmod.sh, new one, fix mdsrate) > The same as (1, mkdir) and the total: 19 tests. > > 6. utime (mdsrate-utime.sh, new one, fix mdsrate) > The same as (1, mkdir) and the total: 19 tests. > > 7. chown (mdsrate-chown.sh, new one, fix mdsrate) > The same as (1, create), but skip different DISKs&JOURNALs: > 19 + 30 + 24=73 tests. > > 8. rename (mdsrate-rename.sh, new one, fix mdsrate) > The same as (1, mkdir) and the total: 19 tests. > > 9. find > Q: despite the fact we currently have a large downgrade with > "find -f type", do we want to have this test in the general test set? > > **** MT. Mixed testing. **** > > MT1. Create-write test. > RAM: fixed > CPU: 32 > DISK(MDS): tmpfs, raid (default=tmpfs) > DISK(OST): raid > JOUR: int > CL: 1,2,4,8 (default=1) > OSS:1 (default=1) > NET: IB > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: must be a new one. Each thread creates files in a loop, writes 1 > bulk to each and closes it. > *) it is enough to test with a small bulk only: [1k] > *) [common or separate dir]=2tests; > > Test matrix(TESTxTHxCLxCPUxNETxOSS): > 1. Multi-thread test. > Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests > (not 8 as if TH=1, it is always in 1 dir, and this is already covered). > 2. Multi-client test > 2.1. Let's check how clients scale vs. threads per client (TH=1). > 2.2. Let's check how the system scale with many clients and threads > (TH=8). > Note: to be more demonstrative, the maximum amount of threads could be > taken > <8, if TH=8 reaches the maximum network throughput with small amount > of clients. > [CL>1;TH=1,8]. TESTxCLxTH=2x3x2=12 tests. > 3. DISK test. > Check how different disks impact on the performance. > As different disks are of completely different nature we need to > repeat most of (1,2) here > [TH=1,8]: TESTxCLxTHxDISK=(2x4x2-1)x1=15 > > Total: 34 tests. > > MT2. Create-Readdir test. > RAM: fixed > CPU: 32 > DISK(MDS): tmpfs, raid (default=tmpfs) > DISK(OST): raid > JOUR: int > CL: 1,2,4,8 (default=1) (1 extra client does "ls -U") > OSS:1 (default=1) > NET: IB > OSTN:1 > TH: 1,2,4,8 (default=1) > F: debug > TEST: must be a new one. Each thread creates files in a loop and > immediately closes them. > 1 thread on another client does "ls -U". It is done in 1 directory. > > The test matrix is exactly the same as for MT1. Total: 34 tests. > > MT3. ??? Some more tests ???? > > **** Goal2. Compare HEAD and b1_6 (b1_8) performance. **** > > This paragraph describes the testing methodology in the reverse order > of testing, > i.e. in the top-bottom direction, making sure new LustreFS (HEAD) > version does > not downgrade comparing with the previous ones (b1_6/b1_8). > > Therefore, the first testing cycle includes: > MT, MDST3, OSTT3, NETT1. > from the above tests. In the case a downgrade is detected, lower layer > tests are > to be run until the downgrade disappear. > > -- > Vitaly -- Mallik Ragampudi (877)860-5044 Lustre Engineering x52907 Sun Microsystems Mallikarjunarao.Ragampudi at sun.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance 2009-03-10 11:55 ` Mallik Ragampudi @ 2009-03-10 16:40 ` Vitaly Fertman 0 siblings, 0 replies; 16+ messages in thread From: Vitaly Fertman @ 2009-03-10 16:40 UTC (permalink / raw) To: lustre-devel On Mar 10, 2009, at 2:55 PM, Mallik Ragampudi wrote: > Vitaly, > > This is very comprehensive. Few comments: > > 1) I think it would be good to start with LLT2 (lustre-iokit: ior- > survey) and get an out-of-the-box performance > picture/comparison (1.6 and 2.0) from the whole cluster before > testing the individual layers. LLT2 is about underlying local fs on servers, not Lustre. the overall ior-survey is OSTT3 (it was wrongly mentioned as ost-survey in the document header, sorry, it must be ior-survey as written in the detailed section) > 2) Can this plan be extended to include CMD performance testing as > well ? I would expect that most > of your test cases apply for the CMD as well ? sure, I will extend it. > 3) I assume the "CPU" parameter in your methodology refers to # of > CPUs in MDS, right ? to the server which is being tested at the moment, i.e. OST for OST* tests, MDS fro MDS* tests. In the mixed section it refers to MDS indeed. > 4) I am not sure if we can test 32 cores without hitting other > bottlenecks beyond 8 coress on the servers. if we hit some bottleneck, we will get a problem to solve, what is good ;) however, to see the bottleneck from these results, it is better to run MT tests with different amount of CPUs (the same as for all the CPU tests, 1,2,8,32). > This will cut down some combinations in the matrix. > > Thanks, > Mallik > -- Vitaly ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance (update) [not found] ` <02FEAA2B-8D98-4C2D-9CE8-FF6E1EB135A2@sun.com> 2009-03-02 17:04 ` [Lustre-devel] LustreFS performance Vitaly Fertman @ 2009-03-19 19:34 ` Vitaly Fertman 2009-03-19 20:16 ` Andrew C. Uselton 2009-03-20 5:47 ` parinay kondekar 1 sibling, 2 replies; 16+ messages in thread From: Vitaly Fertman @ 2009-03-19 19:34 UTC (permalink / raw) To: lustre-devel **************************************************** LustreFS benchmarking methodology. **************************************************** The document aims to describe the benchmarking methodology which helps to understand the LustreFS performance and reveal LustreFS bottlenecks in different configurations on different hardware, to ensure the next LustreFS release does not downgrade comparing with a previous one. In other words: Goal1. Understand the HEAD performance. Goal2. Compare HEAD and b1_6 (b1_8) performance. To achieve the Goal1, the methodology suggests to test different layers of software in the bottom-top direction, i.e. the underlying back-end, the target server sitting on this back-end, the network connected to this target and how the target performs through this network, etc up to the whole cluster. Each next step has only 1 change over the previous one, it is either a new layer added or 1 parameter in the configuration is changed (probably another network type or another back-end). Comparing the results of each test with the previous test, we get the overhead of the added layer or the performance impact of changing this parameter. To achieve the Goal2, the methodology suggests to go in the reverse top-bottom direction, i.e. to test some large sub-systems first and, if a downgrade vs. a previous LustreFS version is detected, to perform more detailed tests. (This is considered as the primary goal of the 2.0 Performance Team). The document does not cover the way of fixing revealed problems, probably some special purpose test needs to be run or oprofile needs to be compiled in -- it is our of scope of the document. Obviously, it is not possible to perform all the thousands of tests in all the configurations, running all the special purpose tests, etc, the document tries to prepare: 1) all the essential and sufficient tests to see how the system performs in general; 2) some minimal amount of essential tests to see how the system scales in different conditions. Therefore, the plan does not guarantee we will not miss a bottleneck or a bug, it just tries to cover maximum possible scenarios in most interesting conditions/environment states. The amount of tests described below is already about 2K, and there will be definitely more, and it will take a lot of time to perform all of them and to analyze the results. So one of the major concerns here is how to minimize the amount of test so that we would not miss some interesting case and would be able to get all the results within a reasonable amount of time. Please keep it in mind while looking at the tests below. **** Hardware Requirements. **** The test plan implies that we change only 1 parameter (cpu or disk or network) on each step. The HW requirements are: -- at least 1 node with: CPU:32; RAM: enough to have a ramdisk for MDS; DISK: enough disks for raid6 or raid1+0 (as this node could be mds or ost); an extra disk for external journal; NET: both GiGe and IB installed. -- at least 1 another node includes: DISK: enough disks for raid6 or raid1+0 (as this node could be mds or ost); an extra disk for external journal; -- besides that: 8 clients, 3 other servers. -- the other servers include: DISK: raid6 NET: IB installed. -- client includes: NET: both GiGe and IB installed. **** Software requirements **** 1. Short term. 1.1 mdsrate to be completed to test all the operations listed in MDST3 (see below). 1.2 mdsrate-**.sh to be fixed/written to run mdsrate properly and test all the operations listed in MDST3 (see below). 1.3. fake disk implement FAIL flag and report 'done' without doing anything in obdfilter to get a low-latency disk. 1.4. MT. add more tests here and implement them. 2. Long term. 2.1. mdtstack-survey - an echo client-server is to be written for mds similar to ost. - a test script similar to obdfilter-survey.sh is to be written. **** Different configurations **** Configuration of Node: RAM. Amount of RAM on nodes (?) CPU. Count of CPUs on nodes (1..32) DISK. Disk type (raid, ramdisk, fake) JOUR. Journal type (internal, external, ram) Q: which raid? A: use RAID 1+0 for MDS; RAID6 for OST. fake: to get a low-latency disk, it is preferable to report 'done' without doing anything in obdfilter once some FAIL flag is set. It is useful for OST testing, because first of all, it does not have a CPU overhead of memcpy of using ramdisk and it lets to test large amount of data in contrast to ramdisk. As a drawback, it skips the localfs code paths. Note: OSS back-end has write through cache; MDS back-end has write- back cache. Configuration of Cluster: CL. Amount of clients (1,2,4,8) MDS. Amount of MDS nodes (1,2,4). OSS. Amount of OSS nodes (1,2,4) NET. Network type (GiGe, IB) OSTN. Amount of OST per nodes (1,2,4) Configuration of test. TH. Amount of threads per client (1,2,4,8) VER. Lustre version (b1_6, HEAD. later b1_8). FEAT. Lustre features to turn off (COS, SA, RA, debug messages) TEST. Specific test parameters. **** Testing **** Low Layers Testing (LLT) LLT1. Raw disk (lustre-iokit:sgpdd-survey) LLT2. Local filesystem (lustre-iokit: ior-survey, is fs mounted synchronously?) Network Testing (NETT). NETT1. lnet: lnetself test. NETT2. OBD: lustre-iokit: (obdfilter-survey, echo_client-osc-..- net-..-ost-echo_server) NETT3. MD: (not ready) OST Testing (OSTT). OSTT1. Isolated OST (lustre-iokit: obdfilter-survey, echo_client- obdfilter-..-disk) OSTT2. Remote OST (lustre-iokit: obdfilter-survey, echo_client-osc-..- ost-obdfilter-..-disk) OSTT3. Client-OST IO (lustre-iokit: ior-survey, client-ost-disk). MDS Testing (MDST). MDST1. Isolated MDS test (not ready) MDST2. Remote MDS test (not ready) MDST3. Simple Client-MDS operation test Mixed testing (MT) (not ready) **** Statistics **** During all the tests the following is supposed to be running on all the servers: 1) HP collectl or LLNL's LMT; 2) smth else? *** Goal1. Understand the HEAD performance. *** The Goal1 describes the testing methodology in the bottom-top direction, from the lower layers (disk) to the complete Lustre cluster. LLT1. Raw disk (lustre-iokit:sgpdd-survey) RAM: fixed CPU: 1 DISK: raid,ramdisk,fake (default=raid) JOUR:- CL: 1 OSS:1 NET: - OSTN:- TH: 1,2,4,8 (default=1) F: debug TEST: *)bulk size is specified as rszlo/rszhi=[1,4,64,1024K] *)TH is specified as thrlo/thrhi=[1,2,4,8] *)the amount of objects to work on in parallel: crglo=crghi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. TEST=[bulk;separate or commin file]=8 tests; Test matrix(TESTxTHxDISK): Run TESTs with different amount of threads for each DISK. TESTxTHxDISK=(8x4 - 1)x3=93 tests. "-1" because TH=1 is already covered. Total:93 tests. *** NETT1. lnetself test.*** RAM: fixed CPU: 1,8,32 (default=1) DISK: - JOUR:- CL: 1,8 (default=1) OSS:1 NET: GiGe, IB (default=IB) OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: *) test type: PING,READ,WRITE tests *) bulk size for READ/WRITE: 1k,4k,64k,1M [1 ping + 4 reads + 4 writes] = 9 tests Test matrix (TESTxCLxTHxNETxCPU): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=[1+4+4]x4=36 tests. 2. Multi-client test Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=9x1x4=36 tests. 3. Network test As the nature of IB is different from GiGe, we need to repeat all the tests from (1,2) here. 36+36=72 tests. 4. CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. At the same time, if some HW (network) limit is reached, the result will not be very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 1024K]: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=5x1x4x(3-1)=40. Total: 184 tests. *** NETT2. OBD performance *** lustre-iokit: obdfilter-survey, case=network. The results of this tests are to be compared with lnet results to get the osc+ost+ptlrpc overhead. RAM: fixed CPU: 1 DISK: - JOUR:- CL: 1,8 (default=1) OSS:1 NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: *) bulk size: rszlo=rszhi=N (1,4,64,1024) *) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8) *) the amount of objects is: nobjlo=nobjhi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. TEST=[4 bulks; common or separate file]=8 tests Test matrix(TESTxTHxCL): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests 2. Multi-client test Note: to be more demonstrative, the maximum amount of threads should be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=8x1x4=32 tests. 3. Network test. Having IB results in hand after (1,2) and these results from NETT1, we already see how osc+ost+ptlrpc changes the behavior. There is no reason to repeat them for GiGe, it seems. 4.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look@large amount of threads, as we are going to benefit from handling them in parallel. At the same time, if some HW (network) limit is reached, the result will not be very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 1024K]: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(4-1)=48. Total: 112 tests. *** OSTT1. Isolated OST *** lustre-iokit: obdfilter-survey, case=disk The results of this tests are to be compared with LLT results to get the OST stack overhead. RAM: fixed CPU: 1,8,32 (default=1) DISK: raid, fake (default=fake) JOUR: int, ext, ram, (default=ext) CL: 1 OSS:1 NET: - OSTN:1,2,4 (default=1) TH: 1,2,4,8 (default=1) F: debug TEST: *) bulk size: rszlo=rszhi=N (1,4,64,1024K) *) TH is specified through: thrlo=1, thrhi=8 (1,2,4,8) *) each OST is supposed to be configured on a separate disk. *) the amount of objects is: nobjlo=nobjhi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. TEST=[4 bulks; common of separate file]=8 tests Test matrix(TESTxTHxOSTNxDISKxJOURxCPU): 1. Multi-thread test. Run TESTs on OSTN=1 with different amount of threads. TESTxTH=8x4=32 tests 2. Multi-OST test 2.1. Let's check how OSTs vs. threads per OST scale (TH=OSTN). 2.2. Let's check how the system scale with many OSTs and threads (TH=8*OSTN). [OSTN>1;TH=OSTN,8*OSTN]. TESTxOSTNxTH=8x2x2=32 tests. 3. DISK test As other disks are completely different, so lets repeat most of the (1,2) for 2 others: [TH=OSTN;8*OSTN]: TESTxOSTNxTHxDISK=8x3x2x1=48 4. JOURNAL test. Limit the tests with only raid-disk. Limit the test with only 1 large and 1 small bulk:[1,1024K]. TESTxOSTNxTHxJOUR: 4x3x2x2=48 5. CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is better to perform it on a fast backend (DISK=fake) to see how CPU really matters. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. Also, run with a small & a large bulk only:[1,1024K] [OSTN=4,TH=1,2,4,8]: TESTxOSTNxTHxCPU=4x1x4x2=32 Total: 192 tests. *** OSTT2. Real OST test *** lustre-iokit: obdfilter-survey, case=netdisk This test is a composition of OBD performance and Isolated OST tests, so its results are to be compared with NETT2 & OSTT1 results. RAM: fixed CPU: 1,8,32 (default=1) DISK: fake JOUR: ext CL: 1,8 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1,2,4 (default=1) TH: 1,2,4,8 (default=1) F: debug TEST: *) bulk size: rszlo=rszhi=N (1,4,64,1024) *) TH is specified through: thrlo=1, thrhi=8 (thread count, 1,2,4,8) *) each OST is supposed to be configured on a separate disk. *) the amount of objects is: nobjlo=nobjhi=[1;TH] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. TEST=[4 bulks; common of separate file]=8 tests Test matrix(TESTxTHxCLxCPUxOSSxOSTN): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=8x4=32 tests 2. Multi-client test Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=8x1x4=32 tests. 3. Network test Having IB results in hand after (1,2) and these results from NETT2, we already see how osc+ost+ptlrpc+obdfilter changes the behavior. Thus, there is no reason to repeat them for GiGe, it seems. 4.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel. At the same time, if some HW (network) limit is reached, the result will not be very demonstrative, so test with 1 small & 1 large bulk size only:[1k; 1024K]: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=4x1x4x(3-1)=32. 5. OSTN test. The same OSC, network, CPU, disk, just check how OST stack (see 1,2 tests) is scalable. 5.1. Let's check how N threads per 1 OST vs. 1 thread per N OST scales (CL=OSTN). 5.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. As the different with (1,2) on the OSS part only, it is enough to test in separate directories only. It seems enough to look at 1 small & 1 large bulk only: [1,1024K] [CL=OSTN,8;TH=1,8]. TEST=2. TESTxCLxTHxOSTN=2x2x2x2=16 tests 6. OSS test. 6.1. Let's check how 1 thread per N OST vs. 1 thread per N OSS scales (CL=OSS). 6.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. As the different with (1,2) on the OSS part only, it is enough to test in separate directories only. It seems enough to look at 1 small & 1 large bulk only: [1,1024K] [CL=OSS,8;TH=1,8]. TEST=2. TESTxCLxTHxOSTN=2x2x2x2=16 tests Total:128 tests *** OSTT3. Client-OST test *** lustre-iokit: ior-survey. The test results are to be compared with OSTT2 results to get the overhead for Lustre Client: client stack, distributed locking, etc. RAM: fixed CPU: 1,8,32 (default=1) DISK: fake JOUR: ext CL: 1,8 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1,2,4 (default=1) TH: 1,2,4,8 (default=1) F: debug TEST: *) CL is specified through $clients_hi *) TH is specified through $tasks_per_client_hi *) bulk is specified through rsize_lo/hi (1,4,64,1028K) *) file_per_task=[0;1] i.e. test only cases when all the threads work on the same file and when all of them work on a separate file. TEST=[4 bulks; common of separate file]=8 tests Test matrix(TESTxTHxCLxCPUxOSSxOSTN): absolutely the same as for OSTT2. NETT3. MD: (not ready) MDST1. Isolated MDS (not ready) MDST2. Remote MDS (not ready) This set of tests need to be implemented in a utility similar to obdfilter-survey but for MDS testing. MDST3. Simple Client-MDS operation tests 1. create,mknod,mkdir (symlink, link??) RAM: fixed CPU(MDS): 1,8,32 (default=1) DISK(MDS): ramdisk, raid (default=ramdisk) DISK(OST): raid JOUR(MDS): int,ext,ram (default=ext) CL: 1,8 (default=1) MDS:1,2,4 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: it will be probably mdsrate/mdsrate-create-small.sh, but it needs to be fixed to support all of these operations, not only create. If so: *) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8] *) CL is specified through CLIENTS or NODES_TO_USE. *) NOSINGLE should be provided *) add --dirnum option to COMMAND *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the same dir and when each works in a separate one. *) nfiles is files-per-dir * DIRNUM [common or separate dir]=2tests; Note: we should probably limit the amount of files in 1 directory with 2M, otherwise the performance will definitely downgrade. Test matrix(TESTxTHxCLxCPUxMDSxOSSxDISKxJOUR): 1. Multi-thread test. (mknod) Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests (not 8 as if TH=1, DIRNUM=1, and this is already covered). 2. Multi-client test (mknod) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=2x1x4=8 tests. 3. OSS (create) 3.1. Let's check how multi-client system scales (TH=OSS). 3.2. Let's check how large load system scales (TH=8) As the different with (1,2) on the OSS part only, it is enough to test in separate directories only. Stripeness is [1, -1]. TEST=2. TESTxCLxTHxOSS=[2x2x2]x2 + [2x2x2]x1(1OSS case)=24 4. Network test Having IB results in hand after (1,2,3) and these results from NETT1, we already see how mdc+mdt-stack+ptlrpc changes the behavior. There is no reason to repeat them for GiGe, it seems. 5. DISK test. (mknod) Unlike the OST testing, we do not have echo-md client (MDTT1), thus we have not checked how different disks impact the performance, so we need to check it here. As difference disks are of completely different nature we need to repeat most of (1,2) here [TH=1,8]: TESTxCLxTHxDISK=(2x2x2-1)x1=7 6. JOURNAL test. (mknod) Repeat (5) for different journals, but limit the test with raid-disk only. TESTxCLxTHxDISKxJOUR=(2x2x2-1)x1x2=15 7.CPU test (mknod) Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel, so run it for CL=8 only TESTxCLxTHxCPU=2x1x4x(3-1)=16 8. CMD test. (mkdir) 8.1. Let's check how N threads per 1 MDS vs. 1 thread per N MDS scales (CL=MDS). 8.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients These test happens in a separate directory (for each thread) only, enough to test with nid creation policy only. TEST=1. TESTxCLxTHxMDS=1x2x4x2=16 tests. Total: 16 tests for mkdir, 53 for mknod, 24 for create. 2. stat RAM: fixed CPU(MDS): 1,8,32 (default=1) DISK(MDS): ramdisk DISK(OST): raid JOUR: ext CL: 1,8 (default=1) MDS: 1,2,4 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: mdsrate/mdsrate-stat-small.sh *) add THREADS_PER_CLIENT to the script to specify TH *) CL is specified through CLIENTS or NODES_TO_USE. *) NOSINGLE should be provided *) add --dirnum option to COMMAND *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the same dir and when each works in a separate one. *) nfiles is files-per-dir * DIRNUM *) add READDIR_ORDER to test readdir access order (random order is not very interesting for stat). [common or separate dir; readdir order]=2 tests. Test matrix(TESTxTHxCLxCPUxMDSxOSS): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests (not 16 as if TH=1, DIRNUM=1, and this is already covered). 2. Multi-client test Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=2x1x4=8 tests. 3. OSS. 3.1. Let's check how multi-client system scales (TH=OSS). 3.2. Let's check how large load system scales (TH=8) As the difference with (1,2) on the OSS part only, it is enough to test in separate directories only. Test must be done for create with different stripeness: [1, -1]. TEST=2. TESTxCLxTHxOSS=[2x2x2]x2=16 4.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel, so run it for CL=8 only. TESTxCLxTHxCPU=2x1x4x(3-1)=16 5. CMD test. (mkdir) 5.1. Let's check how N threads per 1 MDS vs. 1 thread per N MDS scales (CL=MDS). 5.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. 1 creation policy (nid) is enough: TESTxCLxTHxMDS=2x2x4x2=32 tests. Total: 79 tests. 3. unlink (mdsrate-create-small.sh) RAM: fixed CPU(MDS): 1,8,32 (default=1) DISK(MDS): ramdisk, raid (default=ramdisk) DISK(OST): raid JOUR(MDS): int,ext,ram (default=ext) CL: 1,8 (default=1) MDS:1,2,4 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: it will be probably mdsrate/mdsrate-create-small.sh, but it needs to be fixed to support all of these operations, not only create. If so: *) TH could be specified through THREADS_PER_CLIENT=[1,2,4,8] *) CL is specified through CLIENTS or NODES_TO_USE. *) NOSINGLE should be provided *) add --dirnum option to COMMAND *) DIRNUM=[1,TH*CL], so we test a case when all the threads work in the same dir and when each works in a separate one. *) nfiles is files-per-dir * DIRNUM *) add an ability to remove in readdir order to mdsrate test and its script. [readdir or _create_ order; common or separate dir]=3 (skip readdir/ common dir). Note: we should probably limit the amount of files in 1 directory with 2M, otherwise the performance will definitely downgrade. Test matrix(TESTxTHxCLxCPUxMDSxOSSxDISKxJOUR): 1. Multi-thread test. (mknod) Run TESTs on CL=1 with different amount of threads. TESTxTH=3x4-2=10 tests 2. Multi-client test (mknod) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=3x1x4=12 tests. 3. OSS (create) 3.1. Let's check how multi-client system scales (TH=OSS). 3.2. Let's check how large load system scales (TH=8) As the difference with (1,2) on the OSS part only, it is enough to test in separate directories only. Stripeness is [1, -1]. TEST=4. TESTxCLxTHxOSS=[4x2x2]x2 + [4x2x2]x1(1OSS case)=48 4. Network test Having IB results in hand after (1,2,3) and these results from NETT1, we already see how mdc+mdt-stack+ptlrpc changes the behavior. There is no reason to repeat them for GiGe, it seems. 5. DISK test. (mknod) Unlike the OST testing, we do not have echo-md client (MDTT1), thus we have not checked how different disks impact the performance, so we need to check it here. As different disks are of completely different nature we need to repeat most of (1,2) here [TH=1,8]: TESTxCLxTHxDISK=(3x2x2-2)x1=10 6. JOURNAL test. (mknod) Repeat (5) for different journals, but limit the test with raid-disk only. TESTxCLxTHxDISKxJOUR=(3x2x2-2)x2=20 7.CPU test (mknod) Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel, so run it for CL=8 only TESTxCLxTHxCPU=3x1x4x(3-1)=24 8. CMD test. (mkdir) 8.1. Let's check how N threads per 1 MDS vs. 1 thread per N MDS scales (CL=MDS). 8.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients These test happens in a separate directory (for each thread) only, enough to test with nid creation policy only. TEST=2. TESTxCLxTHxMDS=2x2x4x2=32 tests. Total: 32 tests for mkdir, 76 for mknod, 48 for create. 4. find (not ready) **** MT. Mixed testing. **** MT1. Create-write test. RAM: fixed CPU(MDS): 1,8,32 (default=32) DISK(MDS): ramdisk, raid (default=ramdisk) DISK(OST): raid JOUR: ext CL: 1,8 (default=1) MDS: 1,2,4 (default=1) OSS:1 NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: must be a new one. Each thread creates files in a loop, writes 1 bulk to each and closes it. *) it is enough to test with a small bulk only: [1k] *) [common or separate dir]=2tests; Test matrix(TESTxTHxCLxCPUxMDSxDISK): 1. Multi-thread test. Run TESTs on CL=1 with different amount of threads. TESTxTH=2x4-1=7 tests (not 8 as if TH=1, it is always in 1 dir, and this is already covered). 2. Multi-client test Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients. TESTxCLxTH=2x1x4=8 tests. 3. DISK test. Check how different disks impact on the performance. As different disks are of completely different nature we need to repeat most of (1,2) here [TH=1,8]: TESTxCLxTHxDISK=(2x2x2-1)x1=7 4.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look at large amount of threads, as we are going to benefit from handling them in parallel, so run it for CL=max only: [CL=8,TH=1,2,4,8]. TESTxCLxTHxCPU=2x1x4x(4-1)=24 5. CMD test. 5.1. Let's check how N threads per 1 MDS vs. 1 thread per N MDS scales (CL=MDS). 5.2. Let's check how the system scale with many clients and threads (CL=8) Note: to be more demonstrative, the maximum amount of threads could be taken <8, if TH=8 reaches the maximum network throughput with small amount of clients These test happens in a separate directory (for each thread) only, creation policy=[nid,name]. TEST=2. TESTxCLxTHxMDS=2x2x4x2=32 tests. Total: 78 tests. MT2. Create-Readdir test. RAM: fixed CPU(MDS): 1,8,32 (default=32) DISK(MDS): ramdisk, raid (default=ramdisk) DISK(OST): raid JOUR: ext CL: 1,8 (default=1) (1 extra client does "ls -U") MDS:1,2,4 (default=1) OSS:1 NET: IB OSTN:1 TH: 1,2,4,8 (default=1) F: debug TEST: must be a new one. Each thread creates files in a loop and immediately closes them. 1 thread on another client does "ls -U". It is done in 1 directory. The test matrix is exactly the same as for MT1. Total: 78 tests. MT3. untar a kernel. MT4. pmake (compile a kernel). RAM: fixed CPU(MDS): 1,8,32 (default=32) DISK(MDS): ramdisk, raid (default=ramdisk) DISK(OST): raid JOUR: ext CL: 1 MDS:1,2,4 (default=1) OSS:1,2,4 (default=1) NET: IB OSTN:1 TH: 1 F: debug TEST: a new one. Test matrix(TESTxCPUxMDSxOSSxDISK): 1. DISK test. Check how different disks impact on the performance. TESTxDISK=1 2.CPU test Note: lnet fixes from Liang to be applied here. Run TESTs on different amount of CPU. It is mostly interesting to look@large amount of threads, as we are going to benefit from handling them in parallel, so run it for CL=max only: TESTxCPU=1x(3-1)=2 3. CMD test. Creation policy=name. TESTxMDS=1x2=2 tests. 4. OSS As most of the files are small, stripeness does not play any role (=1) TESTxOSS=1x(3-1)=2. Total: 7 tests. MT5. ??? Some more tests ???? **** Goal2. Compare HEAD and b1_6 (b1_8) performance. **** This paragraph describes the testing methodology in the reverse order of testing, i.e. in the top-bottom direction, making sure new LustreFS (HEAD) version does not downgrade comparing with the previous ones (b1_6/b1_8). Therefore, the first testing cycle includes: 1) MT, MDST3, OSTT3, NETT1 from the above tests. 2) no CMD tests In the case a downgrade is detected, lower layer tests are to be run until the downgrade disappear. **** Goal3. CMD testing. **** MT, MDST3 tests, their CMD sections. **** Goal4. Quick weekly MD performance test. **** 1) It covers tests described in MT,MDST sections. 2) MDST: No CPU,OSS,OSTN tests 3) MT: no MT1,MT2 tests 4) Only 1 node configuration: MDS on RAID1+0 with write back cache OSS on RAID6 with write through cache JOUR: external for both servers; 5) Only 1 network: IB; 6) Minimal amount of cluster configurations: MDS=1; OST=1; [CL,TH]=[1,1],[1,8],[8,8]; MDST1.1: perform only create (not mkdir,mknod) for [common or separate dir]=2. 1. Multi-thread test. TESTxTH=2x2-1=3 tests 2. Multi-client test. TESTxCLxTH=2x1x1=2 tests. Total: 5 tests. MDST1.2. stat for [common or separate dir; readdir order]=2 tests. 1. Multi-thread test. TESTxTH=2x2-1=3 tests 2. Multi-client test. TESTxCLxTH=2x1x1=2 tests. Total: 5 tests. MDST1.3 unlink for [readdir or _create_ order; common or separate dir]=3 (skip readdir/common dir). All tests are done against create (not mkdir,mknod). 1. Multi-thread test. TESTxTH=3x2-2=4 tests 2. Multi-client test (mknod) TESTxCLxTH=3x1x1=3 tests. Total: 7 tests. MT3. untar a kernel. MT4. pmake (compile a kernel). Total: 1 tests. Total: 19 tests. -- Vitaly ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance (update) 2009-03-19 19:34 ` [Lustre-devel] LustreFS performance (update) Vitaly Fertman @ 2009-03-19 20:16 ` Andrew C. Uselton 2009-03-20 13:15 ` Vitaly Fertman 2009-03-20 5:47 ` parinay kondekar 1 sibling, 1 reply; 16+ messages in thread From: Andrew C. Uselton @ 2009-03-19 20:16 UTC (permalink / raw) To: lustre-devel Howdy Vitaly, I like this. It is quite comprehensive and detailed. I'd like to offer a few constructive criticisms in hope that you will better achieve your goals. Mostly I'll stick them in-line where they seem relevant, but I'll start with: 1) Your write up is quite dense and terse. I could follow the overall structure, but found it pretty tough going to understand any specific detail. It really helps to work with someone who will write up the same information, but in a form with whole sentences and a minimum of acronyms or special symbols. Define the acronyms you do use in a clear way in one place that I can refer back to. Vitaly Fertman wrote: > **************************************************** > LustreFS benchmarking methodology. > **************************************************** > > The document aims to describe the benchmarking methodology which helps > to understand the LustreFS performance and reveal LustreFS bottlenecks > in > different configurations on different hardware, to ensure the next > LustreFS > release does not downgrade comparing with a previous one. In other > words: > Goal1. Understand the HEAD performance. > Goal2. Compare HEAD and b1_6 (b1_8) performance. > > To achieve the Goal1, the methodology suggests to test different > layers of > software in the bottom-top direction, i.e. the underlying back-end, > the target > server sitting on this back-end, the network connected to this target > and how > the target performs through this network, etc up to the whole cluster. I like this approach. My own efforts tend to be at-scale testing at the whole-cluster end of the range, often in the presence of other cluster activity. It is good to have the details of the underlying components documented. ... > Obviously, it is not possible to perform all the thousands of tests in > all the configurations, > running all the special purpose tests, etc, the document tries to > prepare: > 1) all the essential and sufficient tests to see how the system > performs in general; > 2) some minimal amount of essential tests to see how the system scales > in different > conditions. In some cases it's obvious, but in many it is not clear what exactly you mean to be testing. It is a good extension to your methodology to state clearly not only the mechanics of the test itself, but what you think you are testing with the given experiment. Spend a little time and describe what the system is under examination, how it responds or should respond to the proposed test, and what tunables and parameters you think might be relevant. For instance, if the test is supposed to saturate the target server, then how much I/O do you expect will be required and why? What timeout or other tunable may determine the observed saturation point. Your goal should be to have, not only a test, but a real expectation about its results even before you run the test. Once you have that expectation then you can evaluate the results. The bottom up approach helps with this, since you can use the performance of the individual pieces to help establish your expectation about the larger assemblies. ... > **** Hardware Requirements. **** > > The test plan implies that we change only 1 parameter (cpu or disk or > network) > on each step. The HW requirements are: > > -- at least 1 node with: > CPU:32; > RAM: enough to have a ramdisk for MDS; > DISK: enough disks for raid6 or raid1+0 (as this node could be mds > or ost); > an extra disk for external journal; > NET: both GiGe and IB installed. > -- at least 1 another node includes: > DISK: enough disks for raid6 or raid1+0 (as this node could be mds > or ost); > an extra disk for external journal; > -- besides that: 8 clients, 3 other servers. > -- the other servers include: > DISK: raid6 > NET: IB installed. > -- client includes: > NET: both GiGe and IB installed. > > **** Software requirements **** > You might provide links to these tests for those not familiar with them. > 1. Short term. > 1.1 mdsrate > to be completed to test all the operations listed in MDST3 (see below). > 1.2 mdsrate-**.sh > to be fixed/written to run mdsrate properly and test all the > operations listed in > MDST3 (see below). > 1.3. fake disk > implement FAIL flag and report 'done' without doing anything in > obdfilter to get > a low-latency disk. > 1.4. MT. > add more tests here and implement them. > > 2. Long term. > 2.1. mdtstack-survey > - an echo client-server is to be written for mds similar to ost. > - a test script similar to obdfilter-survey.sh is to be written. > > **** Different configurations **** > ... I'll cut it short here, but in general, I think you might be surprised that if you organize this document so that anyone else could come along behind you and perform all the same tests in the same way, you might get a lot of others doing these experiments along side you. That would make your job a lot easier and increase the likelihood that bugs and regressions would be caught quickly. > -- > Vitaly > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel Cheers, Andrew ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance (update) 2009-03-19 20:16 ` Andrew C. Uselton @ 2009-03-20 13:15 ` Vitaly Fertman 0 siblings, 0 replies; 16+ messages in thread From: Vitaly Fertman @ 2009-03-20 13:15 UTC (permalink / raw) To: lustre-devel Hi Andrew, thanks for you feedback, indeed, this still looks more like a raw test list than a ready for publishing document, but this is a continuous work and I am still working on it, so I will try to address you suggestions. On Mar 19, 2009, at 11:16 PM, Andrew C. Uselton wrote: > Howdy Vitaly, > I like this. It is quite comprehensive and detailed. I'd like to > offer a few constructive criticisms in hope that you will better > achieve your goals. Mostly I'll stick them in-line where they seem > relevant, but I'll start with: > 1) Your write up is quite dense and terse. I could follow the > overall structure, but found it pretty tough going to understand any > specific detail. It really helps to work with someone who will > write up the same information, but in a form with whole sentences > and a minimum of acronyms or special symbols. Define the acronyms > you do use in a clear way in one place that I can refer back to. > > > Vitaly Fertman wrote: >> **************************************************** >> LustreFS benchmarking methodology. >> **************************************************** >> The document aims to describe the benchmarking methodology which >> helps >> to understand the LustreFS performance and reveal LustreFS >> bottlenecks in >> different configurations on different hardware, to ensure the next >> LustreFS >> release does not downgrade comparing with a previous one. In other >> words: >> Goal1. Understand the HEAD performance. >> Goal2. Compare HEAD and b1_6 (b1_8) performance. >> To achieve the Goal1, the methodology suggests to test different >> layers of >> software in the bottom-top direction, i.e. the underlying back- >> end, the target >> server sitting on this back-end, the network connected to this >> target and how >> the target performs through this network, etc up to the whole >> cluster. > > I like this approach. My own efforts tend to be at-scale testing at > the whole-cluster end of the range, often in the presence of other > cluster activity. It is good to have the details of the underlying > components documented. > > ... >> Obviously, it is not possible to perform all the thousands of tests >> in all the configurations, >> running all the special purpose tests, etc, the document tries to >> prepare: >> 1) all the essential and sufficient tests to see how the system >> performs in general; >> 2) some minimal amount of essential tests to see how the system >> scales in different >> conditions. > > In some cases it's obvious, but in many it is not clear what exactly > you mean to be testing. It is a good extension to your methodology > to state clearly not only the mechanics of the test itself, but what > you think you are testing with the given experiment. Spend a little > time and describe what the system is under examination, how it > responds or should respond to the proposed test, and what tunables > and parameters you think might be relevant. For instance, if the > test is supposed to saturate the target server, then how much I/O do > you expect will be required and why? What timeout or other tunable > may determine the observed saturation point. Your goal should be to > have, not only a test, but a real expectation about its results even > before you run the test. Once you have that expectation then you > can evaluate the results. The bottom up approach helps with this, > since you can use the performance of the individual pieces to help > establish your expectation about the larger assemblies. > > ... >> **** Hardware Requirements. **** >> The test plan implies that we change only 1 parameter (cpu or disk >> or network) >> on each step. The HW requirements are: >> -- at least 1 node with: >> CPU:32; >> RAM: enough to have a ramdisk for MDS; >> DISK: enough disks for raid6 or raid1+0 (as this node could be >> mds or ost); >> an extra disk for external journal; >> NET: both GiGe and IB installed. >> -- at least 1 another node includes: >> DISK: enough disks for raid6 or raid1+0 (as this node could be >> mds or ost); >> an extra disk for external journal; >> -- besides that: 8 clients, 3 other servers. >> -- the other servers include: >> DISK: raid6 >> NET: IB installed. >> -- client includes: >> NET: both GiGe and IB installed. >> **** Software requirements **** > You might provide links to these tests for those not familiar with > them. >> 1. Short term. >> 1.1 mdsrate >> to be completed to test all the operations listed in MDST3 (see >> below). >> 1.2 mdsrate-**.sh >> to be fixed/written to run mdsrate properly and test all the >> operations listed in >> MDST3 (see below). >> 1.3. fake disk >> implement FAIL flag and report 'done' without doing anything in >> obdfilter to get >> a low-latency disk. >> 1.4. MT. >> add more tests here and implement them. >> 2. Long term. >> 2.1. mdtstack-survey >> - an echo client-server is to be written for mds similar to ost. >> - a test script similar to obdfilter-survey.sh is to be written. >> **** Different configurations **** > ... > > I'll cut it short here, but in general, I think you might be > surprised that if you organize this document so that anyone else > could come along behind you and perform all the same tests in the > same way, you might get a lot of others doing these experiments > along side you. That would make your job a lot easier and increase > the likelihood that bugs and regressions would be caught quickly. > >> -- >> Vitaly >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > > Cheers, > Andrew -- Vitaly ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance (update) 2009-03-19 19:34 ` [Lustre-devel] LustreFS performance (update) Vitaly Fertman 2009-03-19 20:16 ` Andrew C. Uselton @ 2009-03-20 5:47 ` parinay kondekar 2009-03-24 0:34 ` Eric Barton 1 sibling, 1 reply; 16+ messages in thread From: parinay kondekar @ 2009-03-20 5:47 UTC (permalink / raw) To: lustre-devel The wiki :: https://wikis.clusterfs.com/intra/index.php/LustreFS_performance ~p Vitaly Fertman wrote: > **************************************************** > LustreFS benchmarking methodology. > **************************************************** > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] LustreFS performance (update) 2009-03-20 5:47 ` parinay kondekar @ 2009-03-24 0:34 ` Eric Barton 0 siblings, 0 replies; 16+ messages in thread From: Eric Barton @ 2009-03-24 0:34 UTC (permalink / raw) To: lustre-devel Vitaly, I've been following this thread with great interest and I'd like to chat with you about this and also the MDS performance regression tests. Unfortunately, I'm unlikely to be able to do that this week and it will probably have to wait until I'm back in the UK next week. In the mean time... 1. Have you got a rough idea how much work it would be to write the software that could exercise the MDD directly? I'd just like to know if we're talking days or weeks or months - we need to know that before we decide whether to do it. 2. I think Andrew Uselton's comments are helpful. We cannot afford routinely to sample the whole performance space - there are just too many dimensions. So we need to develop a performance model that allows us to restrict the number of measurements we need to be confident that there are no surprises "in between" the points we have sampled. That means we have to start running tests as soon as possible over as wide a parameter range as possible, with as much hardware as possible. Then we'll start to get a feel how much variability there is all over the space and where the "edges" and asymptotes are. 3. It's worthwhile taking time to analyse and present results with care. I've attached a spreadsheet that compares ping performance of a single 8-core server with varying numbers of clients and client threads, measured using different LNET locking schemes - hp (HEAD ping), 2lp (HEAD modified to split the LNET global lock into 2) and 3lp (same, but splitting the LNET global lock into 3). The lower row of graphs shows ping throughput versus number of client nodes, with different numbers of threads per node in each series. The upper row of graphs shows the same ping throughput, but plotted against client threads totalled over all nodes, with different numbers of nodes in each series. Please note.... a) Set axis scaling correctly so that visual comparison is accurate. b) The upper row of graphs shows that it's the total number of threads exercising the server that's most important - and that how those threads are distributed over client nodes seems to matter most when there are 8 of them. That's absolutely _not_ obvious from looking at the lower row of graphs. Cheers, Eric > -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of parinay > kondekar > Sent: 19 March 2009 10:47 PM > To: Vitaly Fertman > Cc: lustre-2.0-performance at sun.com; minh diep; Lustre Development Mailing List > Subject: Re: [Lustre-devel] LustreFS performance (update) > > The wiki :: https://wikis.clusterfs.com/intra/index.php/LustreFS_performance > > ~p > > Vitaly Fertman wrote: > > **************************************************** > > LustreFS benchmarking methodology. > > **************************************************** > > > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel -------------- next part -------------- A non-text attachment was scrubbed... Name: example graphs.ods Type: application/vnd.oasis.opendocument.spreadsheet Size: 51672 bytes Desc: not available URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090323/d4bac167/attachment.ods> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-03-24 0:34 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <3376C558-E29A-4BB5-8C4C-3E8F4537A195@sun.com> [not found] ` <02FEAA2B-8D98-4C2D-9CE8-FF6E1EB135A2@sun.com> 2009-03-02 17:04 ` [Lustre-devel] LustreFS performance Vitaly Fertman 2009-03-02 20:45 ` Andreas Dilger 2009-03-04 17:19 ` Oleg Drokin 2009-03-04 17:28 ` Jeff Darcy 2009-03-05 21:27 ` Andreas Dilger 2009-03-09 2:50 ` Oleg Drokin 2009-03-09 8:29 ` Andreas Dilger 2009-03-10 14:39 ` Nicholas Henke 2009-03-10 15:07 ` Mark Seger 2009-03-10 11:55 ` Mallik Ragampudi 2009-03-10 16:40 ` Vitaly Fertman 2009-03-19 19:34 ` [Lustre-devel] LustreFS performance (update) Vitaly Fertman 2009-03-19 20:16 ` Andrew C. Uselton 2009-03-20 13:15 ` Vitaly Fertman 2009-03-20 5:47 ` parinay kondekar 2009-03-24 0:34 ` Eric Barton
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.