* performance of raid5 on fast devices
@ 2017-01-17 2:35 Jake Yao
2017-01-17 3:10 ` Stan Hoeppner
2017-01-17 5:10 ` Roman Mamedov
0 siblings, 2 replies; 11+ messages in thread
From: Jake Yao @ 2017-01-17 2:35 UTC (permalink / raw)
To: linux-raid
I have a raid5 array on 4 NVMe drives, and the performance on the
array is only marginally better than a single drive. Unlike a similar
raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
better than a single drive, which is expected.
It looks like when the single kernel thread associated with the raid
device running at 100%, the array performance hit its peak. This can
happen easily for fast devices like NVMe.
This can reproduced by creating a raid5 with 4 ramdisks as well, and
comparing performance on the array and one ramdisk. Sometimes the
performance on the array is worse than a single ramdisk.
The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
journal is configured.
Is this a known issue?
Please cc me on the email as I am not on the mail list.
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 2:35 performance of raid5 on fast devices Jake Yao
@ 2017-01-17 3:10 ` Stan Hoeppner
2017-01-17 5:04 ` Coly Li
2017-01-17 5:10 ` Roman Mamedov
1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2017-01-17 3:10 UTC (permalink / raw)
To: Jake Yao, linux-raid
On 01/16/2017 08:35 PM, Jake Yao wrote:
> I have a raid5 array on 4 NVMe drives, and the performance on the
> array is only marginally better than a single drive. Unlike a similar
> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
> better than a single drive, which is expected.
>
> It looks like when the single kernel thread associated with the raid
> device running at 100%, the array performance hit its peak. This can
> happen easily for fast devices like NVMe.
The md raid personalities are limited to a single kernel write thread.
Work is in progress to alleviate this bottleneck by using multiple write
threads. When it will hit mainline I don't know.
> This can reproduced by creating a raid5 with 4 ramdisks as well, and
> comparing performance on the array and one ramdisk. Sometimes the
> performance on the array is worse than a single ramdisk.
>
> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
> journal is configured.
>
> Is this a known issue?
>
> Please cc me on the email as I am not on the mail list.
>
> Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 3:10 ` Stan Hoeppner
@ 2017-01-17 5:04 ` Coly Li
2017-01-17 15:22 ` Jake Yao
0 siblings, 1 reply; 11+ messages in thread
From: Coly Li @ 2017-01-17 5:04 UTC (permalink / raw)
To: Jake Yao; +Cc: Stan Hoeppner, linux-raid
On 2017/1/17 上午11:10, Stan Hoeppner wrote:
> On 01/16/2017 08:35 PM, Jake Yao wrote:
>> I have a raid5 array on 4 NVMe drives, and the performance on the
>> array is only marginally better than a single drive. Unlike a similar
>> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
>> better than a single drive, which is expected.
>>
>> It looks like when the single kernel thread associated with the raid
>> device running at 100%, the array performance hit its peak. This can
>> happen easily for fast devices like NVMe.
> The md raid personalities are limited to a single kernel write thread.
> Work is in progress to alleviate this bottleneck by using multiple write
> threads. When it will hit mainline I don't know.
If you want 8 writing threads, and your md raid5 device is /dev/md0, you
may have a try with,
echo 8 > /sys/block/md0/md/group_thread_cnt
>
>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>> comparing performance on the array and one ramdisk. Sometimes the
>> performance on the array is worse than a single ramdisk.
>>
>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>> journal is configured.
>>
>> Is this a known issue?
It was, but you are on 4.9 kernel, group_thread_cnt should work for you.
Coly
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 2:35 performance of raid5 on fast devices Jake Yao
2017-01-17 3:10 ` Stan Hoeppner
@ 2017-01-17 5:10 ` Roman Mamedov
2017-01-17 15:28 ` Jake Yao
1 sibling, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2017-01-17 5:10 UTC (permalink / raw)
To: Jake Yao; +Cc: linux-raid
On Mon, 16 Jan 2017 21:35:21 -0500
Jake Yao <jgyao1@gmail.com> wrote:
> I have a raid5 array on 4 NVMe drives, and the performance on the
> array is only marginally better than a single drive. Unlike a similar
> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
> better than a single drive, which is expected.
>
> It looks like when the single kernel thread associated with the raid
> device running at 100%, the array performance hit its peak. This can
> happen easily for fast devices like NVMe.
>
> This can reproduced by creating a raid5 with 4 ramdisks as well, and
> comparing performance on the array and one ramdisk. Sometimes the
> performance on the array is worse than a single ramdisk.
>
> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
> journal is configured.
>
> Is this a known issue?
How do you measure the performance?
Sure it may be CPU-bound in the end, but also why not try the usual
optimization tricks, such as:
* increase your stripe_cache_size, it's not uncommon that this can speed up
linear writes by as much as several times;
* if you meant reads, you could look into read-ahead settings for the array;
* and in both cases, try experimenting with different stripe sizes (if you
were using 512K, try with 64K stripes).
--
With respect,
Roman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 5:04 ` Coly Li
@ 2017-01-17 15:22 ` Jake Yao
0 siblings, 0 replies; 11+ messages in thread
From: Jake Yao @ 2017-01-17 15:22 UTC (permalink / raw)
To: Coly Li; +Cc: Stan Hoeppner, linux-raid
Thanks for the response.
It helps a little by increasing group_thread_cnt, but not to the
extend of 3x expected. It looks like the single kernel thread is
still the bottleneck.
On Tue, Jan 17, 2017 at 12:04 AM, Coly Li <colyli@suse.de> wrote:
> On 2017/1/17 上午11:10, Stan Hoeppner wrote:
>> On 01/16/2017 08:35 PM, Jake Yao wrote:
>>> I have a raid5 array on 4 NVMe drives, and the performance on the
>>> array is only marginally better than a single drive. Unlike a similar
>>> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
>>> better than a single drive, which is expected.
>>>
>>> It looks like when the single kernel thread associated with the raid
>>> device running at 100%, the array performance hit its peak. This can
>>> happen easily for fast devices like NVMe.
>> The md raid personalities are limited to a single kernel write thread.
>> Work is in progress to alleviate this bottleneck by using multiple write
>> threads. When it will hit mainline I don't know.
>
> If you want 8 writing threads, and your md raid5 device is /dev/md0, you
> may have a try with,
> echo 8 > /sys/block/md0/md/group_thread_cnt
>
>>
>>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>>> comparing performance on the array and one ramdisk. Sometimes the
>>> performance on the array is worse than a single ramdisk.
>>>
>>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>>> journal is configured.
>>>
>>> Is this a known issue?
>
> It was, but you are on 4.9 kernel, group_thread_cnt should work for you.
>
> Coly
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 5:10 ` Roman Mamedov
@ 2017-01-17 15:28 ` Jake Yao
2017-01-17 21:04 ` Heinz Mauelshagen
0 siblings, 1 reply; 11+ messages in thread
From: Jake Yao @ 2017-01-17 15:28 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-raid
Thanks for the response.
I am using fio for performance measurement.
The chunk size of raid5 array is 32K, and the block size in fio is set
to 96K(3x chunk size) which is also the optimal_io_size, ioengine is
set to libaio with direct IO.
Increasing stripe_cache_size does not help much, and it looks like the
write is limited by the single kernel thread as mentioned earlier.
On Tue, Jan 17, 2017 at 12:10 AM, Roman Mamedov <rm@romanrm.net> wrote:
> On Mon, 16 Jan 2017 21:35:21 -0500
> Jake Yao <jgyao1@gmail.com> wrote:
>
>> I have a raid5 array on 4 NVMe drives, and the performance on the
>> array is only marginally better than a single drive. Unlike a similar
>> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
>> better than a single drive, which is expected.
>>
>> It looks like when the single kernel thread associated with the raid
>> device running at 100%, the array performance hit its peak. This can
>> happen easily for fast devices like NVMe.
>>
>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>> comparing performance on the array and one ramdisk. Sometimes the
>> performance on the array is worse than a single ramdisk.
>>
>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>> journal is configured.
>>
>> Is this a known issue?
>
> How do you measure the performance?
>
> Sure it may be CPU-bound in the end, but also why not try the usual
> optimization tricks, such as:
>
> * increase your stripe_cache_size, it's not uncommon that this can speed up
> linear writes by as much as several times;
>
> * if you meant reads, you could look into read-ahead settings for the array;
>
> * and in both cases, try experimenting with different stripe sizes (if you
> were using 512K, try with 64K stripes).
>
> --
> With respect,
> Roman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 15:28 ` Jake Yao
@ 2017-01-17 21:04 ` Heinz Mauelshagen
2017-01-18 19:25 ` Jake Yao
0 siblings, 1 reply; 11+ messages in thread
From: Heinz Mauelshagen @ 2017-01-17 21:04 UTC (permalink / raw)
To: Jake Yao, Roman Mamedov; +Cc: linux-raid
Jake et al,
I took the oportunity to measure raid5 on a 4x NVME here with
variations of group_thread_cnt={0..10} minimal
stripe_cache_size={256,512,1024,2048,4096,8192,16384,32768}
This is on an X-99 with Intel E5-2640 and kernel 4.9.3-200.fc25.x86_64.
Highest active stripe count logged < 17K.
fio job/sections used:
----------------------------
[r-md0]
ioengine=libaio
iodepth=40
rw=read
bs=4096K
direct=1
size=4G
numjobs=8
filename=/dev/md0
[w-md0]
ioengine=libaio
iodepth=40
rw=write
bs=4096K
direct=1
size=4G
numjobs=8
filename=/dev/md0
Baseline performance seen with raid0:
---------------------------------------------------
md0 : active raid0 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
33521664 blocks super 1.2 32k chunks
READ: io=32768MB, aggrb=8202.3MB/s, minb=1025.3MB/s, maxb=1217.7MB/s,
mint=3364msec, maxt=3995msec
WRITE: io=32768MB, aggrb=5746.8MB/s, minb=735584KB/s, maxb=836685KB/s,
mint=5013msec, maxt=5702msec
Performance with raid5:
--------------------------------
md0 : active raid5 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
25141248 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4]
[UUUU]
READ: io=32768MB, aggrb=7375.3MB/s, minb=944025KB/s, maxb=1001.1MB/s,
mint=4088msec, maxt=4443msec
Write results for group_thread_cnt/stripe_cache_size variations:
------------------------------------------------------------------------------------
0/256 -> WRITE: io=32768MB, aggrb=1296.4MB/s, minb=165927KB/s,
maxb=167644KB/s, mint=25019msec, maxt=25278msec
1/256 -> WRITE: io=32768MB, aggrb=2152.6MB/s, minb=275524KB/s,
maxb=278654KB/s, mint=15052msec, maxt=15223msec
2/256 -> WRITE: io=32768MB, aggrb=3177.4MB/s, minb=406700KB/s,
maxb=415854KB/s, mint=10086msec, maxt=10313msec
3/256 -> WRITE: io=32768MB, aggrb=4026.6MB/s, minb=515397KB/s,
maxb=524222KB/s, mint=8001msec, maxt=8138msec
4/256 -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s,
maxb=552609KB/s, mint=7590msec, maxt=7854msec *
5/256 -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s,
maxb=547845KB/s, mint=7656msec, maxt=7864msec
6/256 -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s,
maxb=556126KB/s, mint=7542msec, maxt=7822msec
7/256 -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s,
maxb=560810KB/s, mint=7479msec, maxt=7816msec
8/256 -> WRITE: io=32768MB, aggrb=4185.2MB/s, minb=535807KB/s,
maxb=562389KB/s, mint=7458msec, maxt=7828msec
9/256 -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s,
maxb=577966KB/s, mint=7257msec, maxt=7815msec
10/256 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s,
maxb=568256KB/s, mint=7381msec, maxt=7835msec
0/512 -> WRITE: io=32768MB, aggrb=1297.8MB/s, minb=166025KB/s,
maxb=167664KB/s, mint=25016msec, maxt=25263msec
1/512 -> WRITE: io=32768MB, aggrb=2148.5MB/s, minb=275000KB/s,
maxb=278044KB/s, mint=15085msec, maxt=15252msec
2/512 -> WRITE: io=32768MB, aggrb=3158.4MB/s, minb=404270KB/s,
maxb=411407KB/s, mint=10195msec, maxt=10375msec
3/512 -> WRITE: io=32768MB, aggrb=4102.7MB/s, minb=525141KB/s,
maxb=539738KB/s, mint=7771msec, maxt=7987msec
4/512 -> WRITE: io=32768MB, aggrb=4162.8MB/s, minb=532745KB/s,
maxb=541759KB/s, mint=7742msec, maxt=7873msec *
5/512 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s,
maxb=549856KB/s, mint=7628msec, maxt=7842msec
6/512 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s,
maxb=562314KB/s, mint=7459msec, maxt=7863msec
7/512 -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s,
maxb=566338KB/s, mint=7406msec, maxt=7815msec
8/512 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s,
maxb=558644KB/s, mint=7508msec, maxt=7821msec
9/512 -> WRITE: io=32768MB, aggrb=4165.8MB/s, minb=533219KB/s,
maxb=559837KB/s, mint=7492msec, maxt=7866msec
10/512 -> WRITE: io=32768MB, aggrb=4177.2MB/s, minb=534783KB/s,
maxb=570188KB/s, mint=7356msec, maxt=7843msec
0/1024 -> WRITE: io=32768MB, aggrb=1288.6MB/s, minb=164935KB/s,
maxb=166877KB/s, mint=25134msec, maxt=25430msec
1/1024 -> WRITE: io=32768MB, aggrb=2218.5MB/s, minb=283955KB/s,
maxb=289842KB/s, mint=14471msec, maxt=14771msec
2/1024 -> WRITE: io=32768MB, aggrb=3186.1MB/s, minb=407926KB/s,
maxb=420903KB/s, mint=9965msec, maxt=10282msec
3/1024 -> WRITE: io=32768MB, aggrb=4107.4MB/s, minb=525733KB/s,
maxb=538836KB/s, mint=7784msec, maxt=7978msec
4/1024 -> WRITE: io=32768MB, aggrb=4146.9MB/s, minb=530790KB/s,
maxb=550505KB/s, mint=7619msec, maxt=7902msec
5/1024 -> WRITE: io=32768MB, aggrb=4160.5MB/s, minb=532542KB/s,
maxb=550795KB/s, mint=7615msec, maxt=7876msec *
6/1024 -> WRITE: io=32768MB, aggrb=4174.3MB/s, minb=534306KB/s,
maxb=558942KB/s, mint=7504msec, maxt=7850msec
7/1024 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s,
maxb=556864KB/s, mint=7532msec, maxt=7821msec
8/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
maxb=561035KB/s, mint=7476msec, maxt=7824msec
9/1024 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s,
maxb=567872KB/s, mint=7386msec, maxt=7863msec
10/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
maxb=569878KB/s, mint=7360msec, maxt=7824msec
0/2048 -> WRITE: io=32768MB, aggrb=1265.7MB/s, minb=162004KB/s,
maxb=166111KB/s, mint=25250msec, maxt=25890msec
1/2048 -> WRITE: io=32768MB, aggrb=2239.5MB/s, minb=286652KB/s,
maxb=290846KB/s, mint=14421msec, maxt=14632msec
2/2048 -> WRITE: io=32768MB, aggrb=3184.5MB/s, minb=407609KB/s,
maxb=413150KB/s, mint=10152msec, maxt=10290msec
3/2048 -> WRITE: io=32768MB, aggrb=4213.5MB/s, minb=539321KB/s,
maxb=557901KB/s, mint=7518msec, maxt=7777msec *
4/2048 -> WRITE: io=32768MB, aggrb=4168.5MB/s, minb=533558KB/s,
maxb=543162KB/s, mint=7722msec, maxt=7861msec
5/2048 -> WRITE: io=32768MB, aggrb=4185.5MB/s, minb=535739KB/s,
maxb=549352KB/s, mint=7635msec, maxt=7829msec
6/2048 -> WRITE: io=32768MB, aggrb=4181.8MB/s, minb=535260KB/s,
maxb=553338KB/s, mint=7580msec, maxt=7836msec
7/2048 -> WRITE: io=32768MB, aggrb=4215.7MB/s, minb=539599KB/s,
maxb=566109KB/s, mint=7409msec, maxt=7773msec
8/2048 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s,
maxb=568102KB/s, mint=7383msec, maxt=7801msec
9/2048 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s,
maxb=574483KB/s, mint=7301msec, maxt=7830msec
10/2048 -> WRITE: io=32768MB, aggrb=4172.7MB/s, minb=534102KB/s,
maxb=567641KB/s, mint=7389msec, maxt=7853msec
0/4096 -> WRITE: io=32768MB, aggrb=1264.8MB/s, minb=161879KB/s,
maxb=168588KB/s, mint=24879msec, maxt=25910msec
1/4096 -> WRITE: io=32768MB, aggrb=2349.4MB/s, minb=300710KB/s,
maxb=312541KB/s, mint=13420msec, maxt=13948msec
2/4096 -> WRITE: io=32768MB, aggrb=3387.6MB/s, minb=433609KB/s,
maxb=441877KB/s, mint=9492msec, maxt=9673msec
3/4096 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s,
maxb=552390KB/s, mint=7593msec, maxt=7835msec *
4/4096 -> WRITE: io=32768MB, aggrb=4170.2MB/s, minb=533762KB/s,
maxb=560061KB/s, mint=7489msec, maxt=7858msec
5/4096 -> WRITE: io=32768MB, aggrb=4179.6MB/s, minb=534919KB/s,
maxb=548490KB/s, mint=7647msec, maxt=7841msec
6/4096 -> WRITE: io=32768MB, aggrb=4183.4MB/s, minb=535465KB/s,
maxb=549208KB/s, mint=7637msec, maxt=7833msec
7/4096 -> WRITE: io=32768MB, aggrb=4174.9MB/s, minb=534374KB/s,
maxb=557530KB/s, mint=7523msec, maxt=7849msec
8/4096 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s,
maxb=570188KB/s, mint=7356msec, maxt=7842msec
9/4096 -> WRITE: io=32768MB, aggrb=4180.2MB/s, minb=535056KB/s,
maxb=570110KB/s, mint=7357msec, maxt=7839msec
10/4096 -> WRITE: io=32768MB, aggrb=4183.9MB/s, minb=535534KB/s,
maxb=574640KB/s, mint=7299msec, maxt=7832msec
0/8192 -> WRITE: io=32768MB, aggrb=1260.9MB/s, minb=161381KB/s,
maxb=171511KB/s, mint=24455msec, maxt=25990msec
1/8192 -> WRITE: io=32768MB, aggrb=2368.5MB/s, minb=303166KB/s,
maxb=320444KB/s, mint=13089msec, maxt=13835msec
2/8192 -> WRITE: io=32768MB, aggrb=3408.8MB/s, minb=436225KB/s,
maxb=458544KB/s, mint=9147msec, maxt=9615msec
3/8192 -> WRITE: io=32768MB, aggrb=4219.5MB/s, minb=540085KB/s,
maxb=564585KB/s, mint=7429msec, maxt=7766msec *
4/8192 -> WRITE: io=32768MB, aggrb=4208.6MB/s, minb=538698KB/s,
maxb=570653KB/s, mint=7350msec, maxt=7786msec
5/8192 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s,
maxb=562013KB/s, mint=7463msec, maxt=7801msec
6/8192 -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s,
maxb=585387KB/s, mint=7165msec, maxt=7822msec
7/8192 -> WRITE: io=32768MB, aggrb=4184.5MB/s, minb=535602KB/s,
maxb=579323KB/s, mint=7240msec, maxt=7831msec
8/8192 -> WRITE: io=32768MB, aggrb=4186.6MB/s, minb=535876KB/s,
maxb=572132KB/s, mint=7331msec, maxt=7827msec
9/8192 -> WRITE: io=32768MB, aggrb=4176.5MB/s, minb=534578KB/s,
maxb=598246KB/s, mint=7011msec, maxt=7846msec
10/8192 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s,
maxb=580285KB/s, mint=7228msec, maxt=7830msec
0/16384 -> WRITE: io=32768MB, aggrb=1281.0MB/s, minb=163968KB/s,
maxb=183542KB/s, mint=22852msec, maxt=25580msec
1/16384 -> WRITE: io=32768MB, aggrb=2451.8MB/s, minb=313827KB/s,
maxb=337787KB/s, mint=12417msec, maxt=13365msec
2/16384 -> WRITE: io=32768MB, aggrb=3409.5MB/s, minb=436406KB/s,
maxb=468532KB/s, mint=8952msec, maxt=9611msec
3/16384 -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s,
maxb=566721KB/s, mint=7401msec, maxt=7816msec *
4/16384 -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s,
maxb=581089KB/s, mint=7218msec, maxt=7854msec
5/16384 -> WRITE: io=32768MB, aggrb=4175.4MB/s, minb=534442KB/s,
maxb=587108KB/s, mint=7144msec, maxt=7848msec
6/16384 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
maxb=585224KB/s, mint=7167msec, maxt=7824msec
7/16384 -> WRITE: io=32768MB, aggrb=4173.8MB/s, minb=534238KB/s,
maxb=591330KB/s, mint=7093msec, maxt=7851msec
8/16384 -> WRITE: io=32768MB, aggrb=4163.2MB/s, minb=532880KB/s,
maxb=590165KB/s, mint=7107msec, maxt=7871msec
9/16384 -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s,
maxb=608664KB/s, mint=6891msec, maxt=7864msec
10/16384 -> WRITE: io=32768MB, aggrb=4157.9MB/s, minb=532204KB/s,
maxb=594768KB/s, mint=7052msec, maxt=7881msec
0/32768 -> WRITE: io=32768MB, aggrb=1288.1MB/s, minb=164980KB/s,
maxb=189026KB/s, mint=22189msec, maxt=25423msec
1/32768 -> WRITE: io=32768MB, aggrb=2443.6MB/s, minb=312774KB/s,
maxb=348624KB/s, mint=12031msec, maxt=13410msec
2/32768 -> WRITE: io=32768MB, aggrb=3467.1MB/s, minb=443888KB/s,
maxb=484722KB/s, mint=8653msec, maxt=9449msec
3/32768 -> WRITE: io=32768MB, aggrb=4131.2MB/s, minb=528782KB/s,
maxb=572444KB/s, mint=7327msec, maxt=7932msec *
4/32768 -> WRITE: io=32768MB, aggrb=4082.8MB/s, minb=522589KB/s,
maxb=606990KB/s, mint=6910msec, maxt=8026msec
5/32768 -> WRITE: io=32768MB, aggrb=3985.5MB/s, minb=510131KB/s,
maxb=578046KB/s, mint=7256msec, maxt=8222msec
6/32768 -> WRITE: io=32768MB, aggrb=3937.2MB/s, minb=504062KB/s,
maxb=591914KB/s, mint=7086msec, maxt=8321msec
7/32768 -> WRITE: io=32768MB, aggrb=4012.3MB/s, minb=513567KB/s,
maxb=583028KB/s, mint=7194msec, maxt=8167msec
8/32768 -> WRITE: io=32768MB, aggrb=3944.2MB/s, minb=504851KB/s,
maxb=567257KB/s, mint=7394msec, maxt=8308msec
9/32768 -> WRITE: io=32768MB, aggrb=3930.1MB/s, minb=503155KB/s,
maxb=580687KB/s, mint=7223msec, maxt=8336msec
10/32768 -> WRITE: io=32768MB, aggrb=3965.2MB/s, minb=507539KB/s,
maxb=599443KB/s, mint=6997msec, maxt=8264msec
Analysis:
-----------
- the amount of minimum stripe cache entries doesn't cause much
variation as expected
- writing threads cause significant performance enhancement
- seen best results with 3 or 4 writing threads which correlates well to
the # of stripes
Did you provide your fio job(s) for comparision yet?
Regards,
Heinz
P.S.: write performance tested with the following script:
#!/bin/sh
MD=md0
for s in 256 512 1024 2048 4096 8192 16384 32768
do
echo $s > /sys/block/$MD/md/stripe_cache_size
for t in {0..10}
do
echo $t > /sys/block/$MD/md/group_thread_cnt
echo -n "$t/$s -> "
fio --section=w-md0 fio_md0.job 2>&1|grep "aggrb="|sed
's/^ *//'
done
done
On 01/17/2017 04:28 PM, Jake Yao wrote:
> Thanks for the response.
>
> I am using fio for performance measurement.
>
> The chunk size of raid5 array is 32K, and the block size in fio is set
> to 96K(3x chunk size) which is also the optimal_io_size, ioengine is
> set to libaio with direct IO.
>
> Increasing stripe_cache_size does not help much, and it looks like the
> write is limited by the single kernel thread as mentioned earlier.
>
>
> On Tue, Jan 17, 2017 at 12:10 AM, Roman Mamedov <rm@romanrm.net> wrote:
>> On Mon, 16 Jan 2017 21:35:21 -0500
>> Jake Yao <jgyao1@gmail.com> wrote:
>>
>>> I have a raid5 array on 4 NVMe drives, and the performance on the
>>> array is only marginally better than a single drive. Unlike a similar
>>> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
>>> better than a single drive, which is expected.
>>>
>>> It looks like when the single kernel thread associated with the raid
>>> device running at 100%, the array performance hit its peak. This can
>>> happen easily for fast devices like NVMe.
>>>
>>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>>> comparing performance on the array and one ramdisk. Sometimes the
>>> performance on the array is worse than a single ramdisk.
>>>
>>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>>> journal is configured.
>>>
>>> Is this a known issue?
>> How do you measure the performance?
>>
>> Sure it may be CPU-bound in the end, but also why not try the usual
>> optimization tricks, such as:
>>
>> * increase your stripe_cache_size, it's not uncommon that this can speed up
>> linear writes by as much as several times;
>>
>> * if you meant reads, you could look into read-ahead settings for the array;
>>
>> * and in both cases, try experimenting with different stripe sizes (if you
>> were using 512K, try with 64K stripes).
>>
>> --
>> With respect,
>> Roman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-17 21:04 ` Heinz Mauelshagen
@ 2017-01-18 19:25 ` Jake Yao
2017-01-20 14:58 ` Coly Li
0 siblings, 1 reply; 11+ messages in thread
From: Jake Yao @ 2017-01-18 19:25 UTC (permalink / raw)
To: Heinz Mauelshagen; +Cc: Roman Mamedov, linux-raid
It is interesting. I do not see the similar behavior with the change
of group_thread_cnt.
The raid5 I have is following:
md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 0/3 pages [0KB], 65536KB chunk
/dev/md125:
Version : 1.2
Creation Time : Thu Dec 15 20:11:46 2016
Raid Level : raid5
Array Size : 943325184 (899.63 GiB 965.96 GB)
Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Jan 18 16:24:52 2017
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 32K
Name : localhost:nvme (local to host localhost)
UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
Events : 108
Number Major Minor RaidDevice State
0 259 6 0 active sync /dev/nvme0n1p1
1 259 8 1 active sync /dev/nvme1n1p1
2 259 9 2 active sync /dev/nvme2n1p1
4 259 1 3 active sync /dev/nvme3n1p1
The fio config is:
[global]
ioengine=libaio
iodepth=64
bs=96K
direct=1
thread=1
time_based=1
runtime=20
numjobs=1
loops=1
group_reporting=1
exitall
[nvme_md_wrt]
rw=write
filename=/dev/md125
[nvme_single_wrt]
rw=write
filename=/dev/nvme1n1p2
With changing group_thread_cnt, I got following:
0 -> WRITE: io=40643MB, aggrb=2031.1MB/s, minb=2031.1MB/s,
maxb=2031.1MB/s, mint=20002msec, maxt=20002msec
1 -> WRITE: io=43740MB, aggrb=2186.7MB/s, minb=2186.7MB/s,
maxb=2186.7MB/s, mint=20003msec, maxt=20003msec
2 -> WRITE: io=43805MB, aggrb=2189.1MB/s, minb=2189.1MB/s,
maxb=2189.1MB/s, mint=20003msec, maxt=20003msec
3 -> WRITE: io=43763MB, aggrb=2187.9MB/s, minb=2187.9MB/s,
maxb=2187.9MB/s, mint=20003msec, maxt=20003msec
4 -> WRITE: io=43767MB, aggrb=2188.2MB/s, minb=2188.2MB/s,
maxb=2188.2MB/s, mint=20002msec, maxt=20002msec
5 -> WRITE: io=43767MB, aggrb=2188.4MB/s, minb=2188.4MB/s,
maxb=2188.4MB/s, mint=20003msec, maxt=20003msec
6 -> WRITE: io=43776MB, aggrb=2188.5MB/s, minb=2188.5MB/s,
maxb=2188.5MB/s, mint=20003msec, maxt=20003msec
7 -> WRITE: io=43758MB, aggrb=2187.6MB/s, minb=2187.6MB/s,
maxb=2187.6MB/s, mint=20003msec, maxt=20003msec
8 -> WRITE: io=43766MB, aggrb=2187.1MB/s, minb=2187.1MB/s,
maxb=2187.1MB/s, mint=20003msec, maxt=20003msec
In the test run, the md125_raid5 kernel thread running close to 100%
during the test, and all the kworker threads at around 10%
My system is a VM with 6 cpus running on ESXi with NVMe drives passthru.
I am wondering why the difference.
Thanks!
On Tue, Jan 17, 2017 at 4:04 PM, Heinz Mauelshagen <heinzm@redhat.com> wrote:
> Jake et al,
>
> I took the oportunity to measure raid5 on a 4x NVME here with
> variations of group_thread_cnt={0..10} minimal
> stripe_cache_size={256,512,1024,2048,4096,8192,16384,32768}
>
> This is on an X-99 with Intel E5-2640 and kernel 4.9.3-200.fc25.x86_64.
>
> Highest active stripe count logged < 17K.
>
>
> fio job/sections used:
> ----------------------------
> [r-md0]
> ioengine=libaio
> iodepth=40
> rw=read
> bs=4096K
> direct=1
> size=4G
> numjobs=8
> filename=/dev/md0
>
> [w-md0]
> ioengine=libaio
> iodepth=40
> rw=write
> bs=4096K
> direct=1
> size=4G
> numjobs=8
> filename=/dev/md0
>
>
> Baseline performance seen with raid0:
> ---------------------------------------------------
> md0 : active raid0 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
> 33521664 blocks super 1.2 32k chunks
>
> READ: io=32768MB, aggrb=8202.3MB/s, minb=1025.3MB/s, maxb=1217.7MB/s,
> mint=3364msec, maxt=3995msec
> WRITE: io=32768MB, aggrb=5746.8MB/s, minb=735584KB/s, maxb=836685KB/s,
> mint=5013msec, maxt=5702msec
>
>
> Performance with raid5:
> --------------------------------
> md0 : active raid5 dm-350[3] dm-349[2] dm-348[1] dm-347[0]
> 25141248 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
>
>
> READ: io=32768MB, aggrb=7375.3MB/s, minb=944025KB/s, maxb=1001.1MB/s,
> mint=4088msec, maxt=4443msec
>
>
> Write results for group_thread_cnt/stripe_cache_size variations:
> ------------------------------------------------------------------------------------
> 0/256 -> WRITE: io=32768MB, aggrb=1296.4MB/s, minb=165927KB/s,
> maxb=167644KB/s, mint=25019msec, maxt=25278msec
> 1/256 -> WRITE: io=32768MB, aggrb=2152.6MB/s, minb=275524KB/s,
> maxb=278654KB/s, mint=15052msec, maxt=15223msec
> 2/256 -> WRITE: io=32768MB, aggrb=3177.4MB/s, minb=406700KB/s,
> maxb=415854KB/s, mint=10086msec, maxt=10313msec
> 3/256 -> WRITE: io=32768MB, aggrb=4026.6MB/s, minb=515397KB/s,
> maxb=524222KB/s, mint=8001msec, maxt=8138msec
> 4/256 -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s,
> maxb=552609KB/s, mint=7590msec, maxt=7854msec *
> 5/256 -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s,
> maxb=547845KB/s, mint=7656msec, maxt=7864msec
> 6/256 -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s,
> maxb=556126KB/s, mint=7542msec, maxt=7822msec
> 7/256 -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s,
> maxb=560810KB/s, mint=7479msec, maxt=7816msec
> 8/256 -> WRITE: io=32768MB, aggrb=4185.2MB/s, minb=535807KB/s,
> maxb=562389KB/s, mint=7458msec, maxt=7828msec
> 9/256 -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s,
> maxb=577966KB/s, mint=7257msec, maxt=7815msec
> 10/256 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s,
> maxb=568256KB/s, mint=7381msec, maxt=7835msec
>
> 0/512 -> WRITE: io=32768MB, aggrb=1297.8MB/s, minb=166025KB/s,
> maxb=167664KB/s, mint=25016msec, maxt=25263msec
> 1/512 -> WRITE: io=32768MB, aggrb=2148.5MB/s, minb=275000KB/s,
> maxb=278044KB/s, mint=15085msec, maxt=15252msec
> 2/512 -> WRITE: io=32768MB, aggrb=3158.4MB/s, minb=404270KB/s,
> maxb=411407KB/s, mint=10195msec, maxt=10375msec
> 3/512 -> WRITE: io=32768MB, aggrb=4102.7MB/s, minb=525141KB/s,
> maxb=539738KB/s, mint=7771msec, maxt=7987msec
> 4/512 -> WRITE: io=32768MB, aggrb=4162.8MB/s, minb=532745KB/s,
> maxb=541759KB/s, mint=7742msec, maxt=7873msec *
> 5/512 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s,
> maxb=549856KB/s, mint=7628msec, maxt=7842msec
> 6/512 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s,
> maxb=562314KB/s, mint=7459msec, maxt=7863msec
> 7/512 -> WRITE: io=32768MB, aggrb=4192.1MB/s, minb=536699KB/s,
> maxb=566338KB/s, mint=7406msec, maxt=7815msec
> 8/512 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s,
> maxb=558644KB/s, mint=7508msec, maxt=7821msec
> 9/512 -> WRITE: io=32768MB, aggrb=4165.8MB/s, minb=533219KB/s,
> maxb=559837KB/s, mint=7492msec, maxt=7866msec
> 10/512 -> WRITE: io=32768MB, aggrb=4177.2MB/s, minb=534783KB/s,
> maxb=570188KB/s, mint=7356msec, maxt=7843msec
>
> 0/1024 -> WRITE: io=32768MB, aggrb=1288.6MB/s, minb=164935KB/s,
> maxb=166877KB/s, mint=25134msec, maxt=25430msec
> 1/1024 -> WRITE: io=32768MB, aggrb=2218.5MB/s, minb=283955KB/s,
> maxb=289842KB/s, mint=14471msec, maxt=14771msec
> 2/1024 -> WRITE: io=32768MB, aggrb=3186.1MB/s, minb=407926KB/s,
> maxb=420903KB/s, mint=9965msec, maxt=10282msec
> 3/1024 -> WRITE: io=32768MB, aggrb=4107.4MB/s, minb=525733KB/s,
> maxb=538836KB/s, mint=7784msec, maxt=7978msec
> 4/1024 -> WRITE: io=32768MB, aggrb=4146.9MB/s, minb=530790KB/s,
> maxb=550505KB/s, mint=7619msec, maxt=7902msec
> 5/1024 -> WRITE: io=32768MB, aggrb=4160.5MB/s, minb=532542KB/s,
> maxb=550795KB/s, mint=7615msec, maxt=7876msec *
> 6/1024 -> WRITE: io=32768MB, aggrb=4174.3MB/s, minb=534306KB/s,
> maxb=558942KB/s, mint=7504msec, maxt=7850msec
> 7/1024 -> WRITE: io=32768MB, aggrb=4189.8MB/s, minb=536287KB/s,
> maxb=556864KB/s, mint=7532msec, maxt=7821msec
> 8/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
> maxb=561035KB/s, mint=7476msec, maxt=7824msec
> 9/1024 -> WRITE: io=32768MB, aggrb=4167.4MB/s, minb=533422KB/s,
> maxb=567872KB/s, mint=7386msec, maxt=7863msec
> 10/1024 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
> maxb=569878KB/s, mint=7360msec, maxt=7824msec
>
> 0/2048 -> WRITE: io=32768MB, aggrb=1265.7MB/s, minb=162004KB/s,
> maxb=166111KB/s, mint=25250msec, maxt=25890msec
> 1/2048 -> WRITE: io=32768MB, aggrb=2239.5MB/s, minb=286652KB/s,
> maxb=290846KB/s, mint=14421msec, maxt=14632msec
> 2/2048 -> WRITE: io=32768MB, aggrb=3184.5MB/s, minb=407609KB/s,
> maxb=413150KB/s, mint=10152msec, maxt=10290msec
> 3/2048 -> WRITE: io=32768MB, aggrb=4213.5MB/s, minb=539321KB/s,
> maxb=557901KB/s, mint=7518msec, maxt=7777msec *
> 4/2048 -> WRITE: io=32768MB, aggrb=4168.5MB/s, minb=533558KB/s,
> maxb=543162KB/s, mint=7722msec, maxt=7861msec
> 5/2048 -> WRITE: io=32768MB, aggrb=4185.5MB/s, minb=535739KB/s,
> maxb=549352KB/s, mint=7635msec, maxt=7829msec
> 6/2048 -> WRITE: io=32768MB, aggrb=4181.8MB/s, minb=535260KB/s,
> maxb=553338KB/s, mint=7580msec, maxt=7836msec
> 7/2048 -> WRITE: io=32768MB, aggrb=4215.7MB/s, minb=539599KB/s,
> maxb=566109KB/s, mint=7409msec, maxt=7773msec
> 8/2048 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s,
> maxb=568102KB/s, mint=7383msec, maxt=7801msec
> 9/2048 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s,
> maxb=574483KB/s, mint=7301msec, maxt=7830msec
> 10/2048 -> WRITE: io=32768MB, aggrb=4172.7MB/s, minb=534102KB/s,
> maxb=567641KB/s, mint=7389msec, maxt=7853msec
>
> 0/4096 -> WRITE: io=32768MB, aggrb=1264.8MB/s, minb=161879KB/s,
> maxb=168588KB/s, mint=24879msec, maxt=25910msec
> 1/4096 -> WRITE: io=32768MB, aggrb=2349.4MB/s, minb=300710KB/s,
> maxb=312541KB/s, mint=13420msec, maxt=13948msec
> 2/4096 -> WRITE: io=32768MB, aggrb=3387.6MB/s, minb=433609KB/s,
> maxb=441877KB/s, mint=9492msec, maxt=9673msec
> 3/4096 -> WRITE: io=32768MB, aggrb=4182.3MB/s, minb=535329KB/s,
> maxb=552390KB/s, mint=7593msec, maxt=7835msec *
> 4/4096 -> WRITE: io=32768MB, aggrb=4170.2MB/s, minb=533762KB/s,
> maxb=560061KB/s, mint=7489msec, maxt=7858msec
> 5/4096 -> WRITE: io=32768MB, aggrb=4179.6MB/s, minb=534919KB/s,
> maxb=548490KB/s, mint=7647msec, maxt=7841msec
> 6/4096 -> WRITE: io=32768MB, aggrb=4183.4MB/s, minb=535465KB/s,
> maxb=549208KB/s, mint=7637msec, maxt=7833msec
> 7/4096 -> WRITE: io=32768MB, aggrb=4174.9MB/s, minb=534374KB/s,
> maxb=557530KB/s, mint=7523msec, maxt=7849msec
> 8/4096 -> WRITE: io=32768MB, aggrb=4178.6MB/s, minb=534851KB/s,
> maxb=570188KB/s, mint=7356msec, maxt=7842msec
> 9/4096 -> WRITE: io=32768MB, aggrb=4180.2MB/s, minb=535056KB/s,
> maxb=570110KB/s, mint=7357msec, maxt=7839msec
> 10/4096 -> WRITE: io=32768MB, aggrb=4183.9MB/s, minb=535534KB/s,
> maxb=574640KB/s, mint=7299msec, maxt=7832msec
>
> 0/8192 -> WRITE: io=32768MB, aggrb=1260.9MB/s, minb=161381KB/s,
> maxb=171511KB/s, mint=24455msec, maxt=25990msec
> 1/8192 -> WRITE: io=32768MB, aggrb=2368.5MB/s, minb=303166KB/s,
> maxb=320444KB/s, mint=13089msec, maxt=13835msec
> 2/8192 -> WRITE: io=32768MB, aggrb=3408.8MB/s, minb=436225KB/s,
> maxb=458544KB/s, mint=9147msec, maxt=9615msec
> 3/8192 -> WRITE: io=32768MB, aggrb=4219.5MB/s, minb=540085KB/s,
> maxb=564585KB/s, mint=7429msec, maxt=7766msec *
> 4/8192 -> WRITE: io=32768MB, aggrb=4208.6MB/s, minb=538698KB/s,
> maxb=570653KB/s, mint=7350msec, maxt=7786msec
> 5/8192 -> WRITE: io=32768MB, aggrb=4200.5MB/s, minb=537662KB/s,
> maxb=562013KB/s, mint=7463msec, maxt=7801msec
> 6/8192 -> WRITE: io=32768MB, aggrb=4189.3MB/s, minb=536218KB/s,
> maxb=585387KB/s, mint=7165msec, maxt=7822msec
> 7/8192 -> WRITE: io=32768MB, aggrb=4184.5MB/s, minb=535602KB/s,
> maxb=579323KB/s, mint=7240msec, maxt=7831msec
> 8/8192 -> WRITE: io=32768MB, aggrb=4186.6MB/s, minb=535876KB/s,
> maxb=572132KB/s, mint=7331msec, maxt=7827msec
> 9/8192 -> WRITE: io=32768MB, aggrb=4176.5MB/s, minb=534578KB/s,
> maxb=598246KB/s, mint=7011msec, maxt=7846msec
> 10/8192 -> WRITE: io=32768MB, aggrb=4184.1MB/s, minb=535671KB/s,
> maxb=580285KB/s, mint=7228msec, maxt=7830msec
>
> 0/16384 -> WRITE: io=32768MB, aggrb=1281.0MB/s, minb=163968KB/s,
> maxb=183542KB/s, mint=22852msec, maxt=25580msec
> 1/16384 -> WRITE: io=32768MB, aggrb=2451.8MB/s, minb=313827KB/s,
> maxb=337787KB/s, mint=12417msec, maxt=13365msec
> 2/16384 -> WRITE: io=32768MB, aggrb=3409.5MB/s, minb=436406KB/s,
> maxb=468532KB/s, mint=8952msec, maxt=9611msec
> 3/16384 -> WRITE: io=32768MB, aggrb=4192.5MB/s, minb=536630KB/s,
> maxb=566721KB/s, mint=7401msec, maxt=7816msec *
> 4/16384 -> WRITE: io=32768MB, aggrb=4172.2MB/s, minb=534034KB/s,
> maxb=581089KB/s, mint=7218msec, maxt=7854msec
> 5/16384 -> WRITE: io=32768MB, aggrb=4175.4MB/s, minb=534442KB/s,
> maxb=587108KB/s, mint=7144msec, maxt=7848msec
> 6/16384 -> WRITE: io=32768MB, aggrb=4188.2MB/s, minb=536081KB/s,
> maxb=585224KB/s, mint=7167msec, maxt=7824msec
> 7/16384 -> WRITE: io=32768MB, aggrb=4173.8MB/s, minb=534238KB/s,
> maxb=591330KB/s, mint=7093msec, maxt=7851msec
> 8/16384 -> WRITE: io=32768MB, aggrb=4163.2MB/s, minb=532880KB/s,
> maxb=590165KB/s, mint=7107msec, maxt=7871msec
> 9/16384 -> WRITE: io=32768MB, aggrb=4166.9MB/s, minb=533355KB/s,
> maxb=608664KB/s, mint=6891msec, maxt=7864msec
> 10/16384 -> WRITE: io=32768MB, aggrb=4157.9MB/s, minb=532204KB/s,
> maxb=594768KB/s, mint=7052msec, maxt=7881msec
>
> 0/32768 -> WRITE: io=32768MB, aggrb=1288.1MB/s, minb=164980KB/s,
> maxb=189026KB/s, mint=22189msec, maxt=25423msec
> 1/32768 -> WRITE: io=32768MB, aggrb=2443.6MB/s, minb=312774KB/s,
> maxb=348624KB/s, mint=12031msec, maxt=13410msec
> 2/32768 -> WRITE: io=32768MB, aggrb=3467.1MB/s, minb=443888KB/s,
> maxb=484722KB/s, mint=8653msec, maxt=9449msec
> 3/32768 -> WRITE: io=32768MB, aggrb=4131.2MB/s, minb=528782KB/s,
> maxb=572444KB/s, mint=7327msec, maxt=7932msec *
> 4/32768 -> WRITE: io=32768MB, aggrb=4082.8MB/s, minb=522589KB/s,
> maxb=606990KB/s, mint=6910msec, maxt=8026msec
> 5/32768 -> WRITE: io=32768MB, aggrb=3985.5MB/s, minb=510131KB/s,
> maxb=578046KB/s, mint=7256msec, maxt=8222msec
> 6/32768 -> WRITE: io=32768MB, aggrb=3937.2MB/s, minb=504062KB/s,
> maxb=591914KB/s, mint=7086msec, maxt=8321msec
> 7/32768 -> WRITE: io=32768MB, aggrb=4012.3MB/s, minb=513567KB/s,
> maxb=583028KB/s, mint=7194msec, maxt=8167msec
> 8/32768 -> WRITE: io=32768MB, aggrb=3944.2MB/s, minb=504851KB/s,
> maxb=567257KB/s, mint=7394msec, maxt=8308msec
> 9/32768 -> WRITE: io=32768MB, aggrb=3930.1MB/s, minb=503155KB/s,
> maxb=580687KB/s, mint=7223msec, maxt=8336msec
> 10/32768 -> WRITE: io=32768MB, aggrb=3965.2MB/s, minb=507539KB/s,
> maxb=599443KB/s, mint=6997msec, maxt=8264msec
>
>
> Analysis:
> -----------
> - the amount of minimum stripe cache entries doesn't cause much variation as
> expected
> - writing threads cause significant performance enhancement
> - seen best results with 3 or 4 writing threads which correlates well to the
> # of stripes
>
>
> Did you provide your fio job(s) for comparision yet?
>
> Regards,
> Heinz
>
> P.S.: write performance tested with the following script:
>
> #!/bin/sh
>
> MD=md0
>
> for s in 256 512 1024 2048 4096 8192 16384 32768
> do
> echo $s > /sys/block/$MD/md/stripe_cache_size
>
> for t in {0..10}
> do
> echo $t > /sys/block/$MD/md/group_thread_cnt
> echo -n "$t/$s -> "
> fio --section=w-md0 fio_md0.job 2>&1|grep "aggrb="|sed 's/^
> *//'
> done
> done
>
>
>
>
> On 01/17/2017 04:28 PM, Jake Yao wrote:
>>
>> Thanks for the response.
>>
>> I am using fio for performance measurement.
>>
>> The chunk size of raid5 array is 32K, and the block size in fio is set
>> to 96K(3x chunk size) which is also the optimal_io_size, ioengine is
>> set to libaio with direct IO.
>>
>> Increasing stripe_cache_size does not help much, and it looks like the
>> write is limited by the single kernel thread as mentioned earlier.
>>
>>
>> On Tue, Jan 17, 2017 at 12:10 AM, Roman Mamedov <rm@romanrm.net> wrote:
>>>
>>> On Mon, 16 Jan 2017 21:35:21 -0500
>>> Jake Yao <jgyao1@gmail.com> wrote:
>>>
>>>> I have a raid5 array on 4 NVMe drives, and the performance on the
>>>> array is only marginally better than a single drive. Unlike a similar
>>>> raid5 array on 4 SAS SSD or HDD, the performance on array is 3x
>>>> better than a single drive, which is expected.
>>>>
>>>> It looks like when the single kernel thread associated with the raid
>>>> device running at 100%, the array performance hit its peak. This can
>>>> happen easily for fast devices like NVMe.
>>>>
>>>> This can reproduced by creating a raid5 with 4 ramdisks as well, and
>>>> comparing performance on the array and one ramdisk. Sometimes the
>>>> performance on the array is worse than a single ramdisk.
>>>>
>>>> The kernel version is 4.9.0-rc3 and mdadm is release 3.4, no write
>>>> journal is configured.
>>>>
>>>> Is this a known issue?
>>>
>>> How do you measure the performance?
>>>
>>> Sure it may be CPU-bound in the end, but also why not try the usual
>>> optimization tricks, such as:
>>>
>>> * increase your stripe_cache_size, it's not uncommon that this can
>>> speed up
>>> linear writes by as much as several times;
>>>
>>> * if you meant reads, you could look into read-ahead settings for the
>>> array;
>>>
>>> * and in both cases, try experimenting with different stripe sizes (if
>>> you
>>> were using 512K, try with 64K stripes).
>>>
>>> --
>>> With respect,
>>> Roman
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-18 19:25 ` Jake Yao
@ 2017-01-20 14:58 ` Coly Li
2017-01-23 22:20 ` Jake Yao
0 siblings, 1 reply; 11+ messages in thread
From: Coly Li @ 2017-01-20 14:58 UTC (permalink / raw)
To: Jake Yao; +Cc: Heinz Mauelshagen, Roman Mamedov, linux-raid
On 2017/1/19 上午3:25, Jake Yao wrote:
> It is interesting. I do not see the similar behavior with the change
> of group_thread_cnt.
>
> The raid5 I have is following:
>
> md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
> 943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
> bitmap: 0/3 pages [0KB], 65536KB chunk
>
> /dev/md125:
> Version : 1.2
> Creation Time : Thu Dec 15 20:11:46 2016
> Raid Level : raid5
> Array Size : 943325184 (899.63 GiB 965.96 GB)
> Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
> Raid Devices : 4
> Total Devices : 4
> Persistence : Superblock is persistent
>
> Intent Bitmap : Internal
>
> Update Time : Wed Jan 18 16:24:52 2017
> State : clean
> Active Devices : 4
> Working Devices : 4
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 32K
>
> Name : localhost:nvme (local to host localhost)
> UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
> Events : 108
>
> Number Major Minor RaidDevice State
> 0 259 6 0 active sync /dev/nvme0n1p1
> 1 259 8 1 active sync /dev/nvme1n1p1
> 2 259 9 2 active sync /dev/nvme2n1p1
> 4 259 1 3 active sync /dev/nvme3n1p1
>
> The fio config is:
>
> [global]
> ioengine=libaio
> iodepth=64
> bs=96K
> direct=1
> thread=1
> time_based=1
> runtime=20
> numjobs=1
You only have 1 I/O thread, bottle neck is here. Have a try with numjobs=8.
> loops=1
> group_reporting=1
> exitall
[snip]
Coly
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-20 14:58 ` Coly Li
@ 2017-01-23 22:20 ` Jake Yao
2017-01-24 7:11 ` Coly Li
0 siblings, 1 reply; 11+ messages in thread
From: Jake Yao @ 2017-01-23 22:20 UTC (permalink / raw)
To: Coly Li; +Cc: Heinz Mauelshagen, Roman Mamedov, linux-raid
I run tests with multiple IO threads, but it looks like it does not
affect the overall performance.
In this run with 8 io threads,
[global]
ioengine=libaio
iodepth=64
bs=192k
direct=1
thread=1
time_based=1
runtime=20
numjobs=8
loops=1
group_reporting=1
rwmixread=70
rwmixwrite=30
exitall
#
# end of global
#
[nvme_md_write]
rw=write
filename=/dev/md127
runtime=20
[nvme_drv_write]
rw=write
filename=/dev/nvme1n1p2
runtime=20
I got following for nvme based raid5 and single drive:
md thrd-cnt 0: write: io=27992MB, bw=1397.5MB/s, iops=7452, runt= 20031msec
md thrd-cnt 1: write: io=43065MB, bw=2148.6MB/s, iops=11458, runt= 20044msec
md thrd-cnt 2: write: io=43209MB, bw=2155.9MB/s, iops=11497, runt= 20043msec
md thrd-cnt 3: write: io=43163MB, bw=2153.9MB/s, iops=11487, runt= 20040msec
md thrd-cnt 4: write: io=43316MB, bw=2163.2MB/s, iops=11536, runt= 20024msec
md thrd-cnt 5: write: io=43390MB, bw=2164.7MB/s, iops=11544, runt= 20045msec
md thrd-cnt 6: write: io=43295MB, bw=2160.2MB/s, iops=11521, runt= 20042msec
single drive: write: io=36004MB, bw=1795.4MB/s, iops=9575, runt= 20054msec
It also does not show little effect on ssd based raid5 and single
drive. Same fio config as above, just changing the corresponding
device filenames. The result is following:
md thrd-cnt 0: write: io=13646MB, bw=696242KB/s, iops=3626, runt= 20070msec
md thrd-cnt 1: write: io=24519MB, bw=1221.5MB/s, iops=6514, runt= 20074msec
md thrd-cnt 2: write: io=24780MB, bw=1234.9MB/s, iops=6585, runt= 20068msec
md thrd-cnt 3: write: io=24890MB, bw=1240.2MB/s, iops=6613, runt= 20072msec
md thrd-cnt 4: write: io=24937MB, bw=1242.5MB/s, iops=6626, runt= 20071msec
md thrd-cnt 5: write: io=24948MB, bw=1242.9MB/s, iops=6628, runt= 20073msec
md thrd-cnt 6: write: io=24701MB, bw=1230.1MB/s, iops=6564, runt= 20068msec
single drive: write: io=8389.4MB, bw=428184KB/s, iops=2230, runt= 20063msec
In the ssd case, raid5 array is 3x better than a single drive.
On Fri, Jan 20, 2017 at 9:58 AM, Coly Li <colyli@suse.de> wrote:
> On 2017/1/19 上午3:25, Jake Yao wrote:
>> It is interesting. I do not see the similar behavior with the change
>> of group_thread_cnt.
>>
>> The raid5 I have is following:
>>
>> md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
>> 943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
>> bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> /dev/md125:
>> Version : 1.2
>> Creation Time : Thu Dec 15 20:11:46 2016
>> Raid Level : raid5
>> Array Size : 943325184 (899.63 GiB 965.96 GB)
>> Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
>> Raid Devices : 4
>> Total Devices : 4
>> Persistence : Superblock is persistent
>>
>> Intent Bitmap : Internal
>>
>> Update Time : Wed Jan 18 16:24:52 2017
>> State : clean
>> Active Devices : 4
>> Working Devices : 4
>> Failed Devices : 0
>> Spare Devices : 0
>>
>> Layout : left-symmetric
>> Chunk Size : 32K
>>
>> Name : localhost:nvme (local to host localhost)
>> UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
>> Events : 108
>>
>> Number Major Minor RaidDevice State
>> 0 259 6 0 active sync /dev/nvme0n1p1
>> 1 259 8 1 active sync /dev/nvme1n1p1
>> 2 259 9 2 active sync /dev/nvme2n1p1
>> 4 259 1 3 active sync /dev/nvme3n1p1
>>
>> The fio config is:
>>
>> [global]
>> ioengine=libaio
>> iodepth=64
>> bs=96K
>> direct=1
>> thread=1
>> time_based=1
>> runtime=20
>> numjobs=1
>
> You only have 1 I/O thread, bottle neck is here. Have a try with numjobs=8.
>
>> loops=1
>> group_reporting=1
>> exitall
> [snip]
>
> Coly
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: performance of raid5 on fast devices
2017-01-23 22:20 ` Jake Yao
@ 2017-01-24 7:11 ` Coly Li
0 siblings, 0 replies; 11+ messages in thread
From: Coly Li @ 2017-01-24 7:11 UTC (permalink / raw)
To: Jake Yao; +Cc: Heinz Mauelshagen, Roman Mamedov, linux-raid
Hi Jake,
Hmm, is the hardware powerful enough ? When I did similar testing, I
used a machine with 2x10 core XEON CPU, and 80GB memory.
And could you please try bs=64K? I got a good performance number with
64KB blocksize.
And could you have a look at top out put, are all the CPUs 100%
utilized, or still idle on some CPUs ?
Coly
On 2017/1/24 上午6:20, Jake Yao wrote:
> I run tests with multiple IO threads, but it looks like it does not
> affect the overall performance.
>
> In this run with 8 io threads,
>
> [global]
> ioengine=libaio
> iodepth=64
> bs=192k
> direct=1
> thread=1
> time_based=1
> runtime=20
> numjobs=8
> loops=1
> group_reporting=1
> rwmixread=70
> rwmixwrite=30
> exitall
> #
> # end of global
> #
> [nvme_md_write]
> rw=write
> filename=/dev/md127
> runtime=20
>
> [nvme_drv_write]
> rw=write
> filename=/dev/nvme1n1p2
> runtime=20
>
> I got following for nvme based raid5 and single drive:
>
> md thrd-cnt 0: write: io=27992MB, bw=1397.5MB/s, iops=7452, runt= 20031msec
> md thrd-cnt 1: write: io=43065MB, bw=2148.6MB/s, iops=11458, runt= 20044msec
> md thrd-cnt 2: write: io=43209MB, bw=2155.9MB/s, iops=11497, runt= 20043msec
> md thrd-cnt 3: write: io=43163MB, bw=2153.9MB/s, iops=11487, runt= 20040msec
> md thrd-cnt 4: write: io=43316MB, bw=2163.2MB/s, iops=11536, runt= 20024msec
> md thrd-cnt 5: write: io=43390MB, bw=2164.7MB/s, iops=11544, runt= 20045msec
> md thrd-cnt 6: write: io=43295MB, bw=2160.2MB/s, iops=11521, runt= 20042msec
> single drive: write: io=36004MB, bw=1795.4MB/s, iops=9575, runt= 20054msec
>
> It also does not show little effect on ssd based raid5 and single
> drive. Same fio config as above, just changing the corresponding
> device filenames. The result is following:
>
> md thrd-cnt 0: write: io=13646MB, bw=696242KB/s, iops=3626, runt= 20070msec
> md thrd-cnt 1: write: io=24519MB, bw=1221.5MB/s, iops=6514, runt= 20074msec
> md thrd-cnt 2: write: io=24780MB, bw=1234.9MB/s, iops=6585, runt= 20068msec
> md thrd-cnt 3: write: io=24890MB, bw=1240.2MB/s, iops=6613, runt= 20072msec
> md thrd-cnt 4: write: io=24937MB, bw=1242.5MB/s, iops=6626, runt= 20071msec
> md thrd-cnt 5: write: io=24948MB, bw=1242.9MB/s, iops=6628, runt= 20073msec
> md thrd-cnt 6: write: io=24701MB, bw=1230.1MB/s, iops=6564, runt= 20068msec
> single drive: write: io=8389.4MB, bw=428184KB/s, iops=2230, runt= 20063msec
>
> In the ssd case, raid5 array is 3x better than a single drive.
>
> On Fri, Jan 20, 2017 at 9:58 AM, Coly Li <colyli@suse.de> wrote:
>> On 2017/1/19 上午3:25, Jake Yao wrote:
>>> It is interesting. I do not see the similar behavior with the change
>>> of group_thread_cnt.
>>>
>>> The raid5 I have is following:
>>>
>>> md125 : active raid5 nvme0n1p1[0] nvme2n1p1[2] nvme1n1p1[1] nvme3n1p1[4]
>>> 943325184 blocks super 1.2 level 5, 32k chunk, algorithm 2 [4/4] [UUUU]
>>> bitmap: 0/3 pages [0KB], 65536KB chunk
>>>
>>> /dev/md125:
>>> Version : 1.2
>>> Creation Time : Thu Dec 15 20:11:46 2016
>>> Raid Level : raid5
>>> Array Size : 943325184 (899.63 GiB 965.96 GB)
>>> Used Dev Size : 314441728 (299.88 GiB 321.99 GB)
>>> Raid Devices : 4
>>> Total Devices : 4
>>> Persistence : Superblock is persistent
>>>
>>> Intent Bitmap : Internal
>>>
>>> Update Time : Wed Jan 18 16:24:52 2017
>>> State : clean
>>> Active Devices : 4
>>> Working Devices : 4
>>> Failed Devices : 0
>>> Spare Devices : 0
>>>
>>> Layout : left-symmetric
>>> Chunk Size : 32K
>>>
>>> Name : localhost:nvme (local to host localhost)
>>> UUID : 477a94af:79f5a10a:0d513dc6:7f5e670d
>>> Events : 108
>>>
>>> Number Major Minor RaidDevice State
>>> 0 259 6 0 active sync /dev/nvme0n1p1
>>> 1 259 8 1 active sync /dev/nvme1n1p1
>>> 2 259 9 2 active sync /dev/nvme2n1p1
>>> 4 259 1 3 active sync /dev/nvme3n1p1
>>>
>>> The fio config is:
>>>
>>> [global]
>>> ioengine=libaio
>>> iodepth=64
>>> bs=96K
>>> direct=1
>>> thread=1
>>> time_based=1
>>> runtime=20
>>> numjobs=1
>>
>> You only have 1 I/O thread, bottle neck is here. Have a try with numjobs=8.
>>
>>> loops=1
>>> group_reporting=1
>>> exitall
>> [snip]
>>
>> Coly
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-01-24 7:11 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17 2:35 performance of raid5 on fast devices Jake Yao
2017-01-17 3:10 ` Stan Hoeppner
2017-01-17 5:04 ` Coly Li
2017-01-17 15:22 ` Jake Yao
2017-01-17 5:10 ` Roman Mamedov
2017-01-17 15:28 ` Jake Yao
2017-01-17 21:04 ` Heinz Mauelshagen
2017-01-18 19:25 ` Jake Yao
2017-01-20 14:58 ` Coly Li
2017-01-23 22:20 ` Jake Yao
2017-01-24 7:11 ` Coly Li
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.