* 4x lower IOPS: Linux MD vs indiv. devices - why?
@ 2017-01-23 16:26 Tobias Oberstein
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
2017-01-23 18:18 ` Kudryavtsev, Andrey O
0 siblings, 2 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 16:26 UTC (permalink / raw)
To: fio
Hi,
I have a question rgd Linux software RAID (MD) as tested with FIO - so
this is slightly OT, but I am hoping for expert advice or redirection to
a more appropriate place (if this is unwelcome here).
I have a box with this HW:
- 88 cores Xeon E7 (176 HTs) + 3TB RAM
- 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
With random 4kB read load, I am able to max it out at 7 million IOPS -
but only if I run FIO on the _individual_ NVMe devices.
[global]
group_reporting
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
bs=4k
runtime=120
[randread]
stonewall
rw=randread
numjobs=2560
When I create a stripe set over all devices:
sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
/dev/nvme0n1 \
/dev/nvme1n1 \
/dev/nvme2n1 \
/dev/nvme3n1 \
/dev/nvme4n1 \
/dev/nvme5n1 \
/dev/nvme6n1 \
/dev/nvme7n1 \
/dev/nvme8n1 \
/dev/nvme9n1 \
/dev/nvme10n1 \
/dev/nvme11n1 \
/dev/nvme12n1 \
/dev/nvme13n1 \
/dev/nvme14n1 \
/dev/nvme15n1
I only get 1.6 million IOPS. Detail results down below.
Note: the array is created with chunk size 8K because this is for
database workload. Here I tested with 4k block size, but the it's
similar (lower perf on MD) with 8k
Any helps or hints would be greatly appreciated!
Cheers,
/Tobias
7 million IOPS on raw, individual NVMe devices
==============================================
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2367 (f=29896):
[_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23
15:47:17 2017
read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
clat percentiles (usec):
| 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171],
| 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270],
| 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980],
| 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
| 99.99th=[ 8096]
lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
lat (usec) : 1000=1.79%
lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s
(28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec
Disk stats (read/write):
nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0,
in_queue=14802400, util=100.00%
nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0,
in_queue=15101276, util=100.00%
nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0,
in_queue=12053112, util=100.00%
nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0,
in_queue=11135004, util=100.00%
nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0,
in_queue=21079576, util=100.00%
nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0,
in_queue=19393024, util=100.00%
nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0,
in_queue=20140104, util=100.00%
nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0,
in_queue=21090048, util=100.00%
nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0,
in_queue=14929172, util=100.00%
nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0,
in_queue=13919288, util=100.00%
nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0,
in_queue=11390392, util=100.00%
nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0,
in_queue=20110288, util=100.00%
nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0,
in_queue=11683568, util=100.00%
nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0,
in_queue=16314628, util=100.00%
nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0,
in_queue=27659920, util=100.00%
nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0,
in_queue=17910636, util=100.00%
1.6 millions IOPS on Linux MD over 16 NVMe devices
==================================================
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23
17:21:15 2017
read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
clat percentiles (usec):
| 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89],
| 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108],
| 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221],
| 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608],
| 99.99th=[ 2960]
lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
lat (usec) : 1000=0.07%
lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s
(6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec
Disk stats (read/write):
md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0,
aggrin_queue=1247601, aggrutil=100.00%
nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0,
in_queue=1225896, util=100.00%
nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0,
in_queue=1191452, util=100.00%
nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0,
in_queue=1296728, util=100.00%
nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0,
in_queue=1239808, util=100.00%
nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0,
in_queue=1272916, util=100.00%
nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0,
in_queue=1178360, util=100.00%
nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0,
in_queue=1207808, util=100.00%
nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0,
in_queue=1258956, util=100.00%
nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0,
in_queue=1304536, util=100.00%
nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0,
in_queue=1281952, util=100.00%
nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0,
in_queue=1271820, util=100.00%
nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0,
in_queue=1224192, util=100.00%
nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0,
in_queue=1214240, util=100.00%
nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0,
in_queue=1242372, util=100.00%
nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0,
in_queue=1277600, util=100.00%
nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0,
in_queue=1272988, util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
@ 2017-01-23 17:52 ` Tobias Oberstein
[not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 17:52 UTC (permalink / raw)
To: Andrey Kuzmin, fio
Am 23.01.2017 um 18:03 schrieb Andrey Kuzmin:
> Why don't you just 'perf' your md run and find out where it spends (an
> awful lot if extra) time?
Good idea!
I ran with threads=1024 (to account for perf overhead). At that
concurrency, Linux MD reaches 25% lower IOPS and has higher system load.
Please see here:
https://github.com/oberstet/scratchbox/tree/master/cruncher/sql19/linux-md-bottleneck
With higher concurrency, the discrepancy gets wider up to 7 mio vs 1.6
mio IOPS.
I am not a kernel hacker.
What is osq_lock?
FWIW, this is a NUMA machine with 4 x E7 (88 cores / 176 HT) and 8 x
Intel P3608 NVMe.
Any hints or anything I should try / measure?
Thanks a lot for your tips and assistence!
Cheers,
/Tobias
>
> On Jan 23, 2017 19:28, "Tobias Oberstein" <tobias.oberstein@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a question rgd Linux software RAID (MD) as tested with FIO - so
>> this is slightly OT, but I am hoping for expert advice or redirection to a
>> more appropriate place (if this is unwelcome here).
>>
>> I have a box with this HW:
>>
>> - 88 cores Xeon E7 (176 HTs) + 3TB RAM
>> - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
>>
>> With random 4kB read load, I am able to max it out at 7 million IOPS - but
>> only if I run FIO on the _individual_ NVMe devices.
>>
>> [global]
>> group_reporting
>> filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1
>> :/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/
>> nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/
>> nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
>> size=30G
>> ioengine=sync
>> iodepth=1
>> thread=1
>> direct=1
>> time_based=1
>> randrepeat=0
>> norandommap=1
>> bs=4k
>> runtime=120
>>
>> [randread]
>> stonewall
>> rw=randread
>> numjobs=2560
>>
>> When I create a stripe set over all devices:
>>
>> sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
>> /dev/nvme0n1 \
>> /dev/nvme1n1 \
>> /dev/nvme2n1 \
>> /dev/nvme3n1 \
>> /dev/nvme4n1 \
>> /dev/nvme5n1 \
>> /dev/nvme6n1 \
>> /dev/nvme7n1 \
>> /dev/nvme8n1 \
>> /dev/nvme9n1 \
>> /dev/nvme10n1 \
>> /dev/nvme11n1 \
>> /dev/nvme12n1 \
>> /dev/nvme13n1 \
>> /dev/nvme14n1 \
>> /dev/nvme15n1
>>
>> I only get 1.6 million IOPS. Detail results down below.
>>
>> Note: the array is created with chunk size 8K because this is for database
>> workload. Here I tested with 4k block size, but the it's similar (lower
>> perf on MD) with 8k
>>
>> Any helps or hints would be greatly appreciated!
>>
>> Cheers,
>> /Tobias
>>
>>
>>
>> 7 million IOPS on raw, individual NVMe devices
>> ==============================================
>>
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
>> /opt/fio/bin/fio postgresql_storage_workload.fio
>> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
>> ioengine=sync, iodepth=1
>> ...
>> fio-2.17-17-g9cf1
>> Starting 2560 threads
>> Jobs: 2367 (f=29896): [_(2),f(3),_(2),f(11),_(2),f(2
>> ),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(
>> 1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(
>> 2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(
>> 1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_
>> (1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(
>> 11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(
>> 1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(
>> 134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30)
>> ,_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),
>> f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(
>> 1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(
>> 15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(
>> 1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_
>> (1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_
>> (1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(
>> 22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_
>> (1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(
>> 11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1)
>> ,f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),
>> _(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(
>> 45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1)
>> ,f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),
>> _(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(
>> 18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=
>> 21.1GiB/s,w=0KiB/s][r=5751k,w=0 IOPS][eta 00m:00s]
>> randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 15:47:17
>> 2017
>> read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
>> clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
>> lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
>> clat percentiles (usec):
>> | 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171],
>> | 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270],
>> | 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980],
>> | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
>> | 99.99th=[ 8096]
>> lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
>> lat (usec) : 1000=1.79%
>> lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
>> cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=1
>>
>> Run status group 0 (all jobs):
>> READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s (28.6GB/s-28.6GB/s),
>> io=3189GiB (3424GB), run=120007-120007msec
>>
>> Disk stats (read/write):
>> nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, in_queue=14802400,
>> util=100.00%
>> nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, in_queue=15101276,
>> util=100.00%
>> nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, in_queue=12053112,
>> util=100.00%
>> nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, in_queue=11135004,
>> util=100.00%
>> nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, in_queue=21079576,
>> util=100.00%
>> nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, in_queue=19393024,
>> util=100.00%
>> nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, in_queue=20140104,
>> util=100.00%
>> nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, in_queue=21090048,
>> util=100.00%
>> nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, in_queue=14929172,
>> util=100.00%
>> nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, in_queue=13919288,
>> util=100.00%
>> nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0,
>> in_queue=11390392, util=100.00%
>> nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0,
>> in_queue=20110288, util=100.00%
>> nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0,
>> in_queue=11683568, util=100.00%
>> nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0,
>> in_queue=16314628, util=100.00%
>> nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0,
>> in_queue=27659920, util=100.00%
>> nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0,
>> in_queue=17910636, util=100.00%
>>
>>
>> 1.6 millions IOPS on Linux MD over 16 NVMe devices
>> ==================================================
>>
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
>> /opt/fio/bin/fio postgresql_storage_workload.fio
>> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
>> ioengine=sync, iodepth=1
>> ...
>> fio-2.17-17-g9cf1
>> Starting 2560 threads
>> Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0
>> IOPS][eta 00m:00s]
>> randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 17:21:15
>> 2017
>> read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
>> clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
>> lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
>> clat percentiles (usec):
>> | 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89],
>> | 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108],
>> | 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221],
>> | 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608],
>> | 99.99th=[ 2960]
>> lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
>> lat (usec) : 1000=0.07%
>> lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
>> cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
>> latency : target=0, window=0, percentile=100.00%, depth=1
>>
>> Run status group 0 (all jobs):
>> READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s (6505MB/s-6505MB/s),
>> io=728GiB (781GB), run=120098-120098msec
>>
>> Disk stats (read/write):
>> md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
>> aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0,
>> aggrin_queue=1247601, aggrutil=100.00%
>> nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, in_queue=1225896,
>> util=100.00%
>> nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, in_queue=1191452,
>> util=100.00%
>> nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, in_queue=1296728,
>> util=100.00%
>> nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, in_queue=1239808,
>> util=100.00%
>> nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, in_queue=1272916,
>> util=100.00%
>> nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, in_queue=1178360,
>> util=100.00%
>> nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, in_queue=1207808,
>> util=100.00%
>> nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, in_queue=1258956,
>> util=100.00%
>> nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, in_queue=1304536,
>> util=100.00%
>> nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, in_queue=1281952,
>> util=100.00%
>> nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, in_queue=1271820,
>> util=100.00%
>> nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, in_queue=1224192,
>> util=100.00%
>> nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, in_queue=1214240,
>> util=100.00%
>> nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, in_queue=1242372,
>> util=100.00%
>> nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, in_queue=1277600,
>> util=100.00%
>> nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, in_queue=1272988,
>> util=100.00%
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
>>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
@ 2017-01-23 18:18 ` Kudryavtsev, Andrey O
2017-01-23 18:53 ` Tobias Oberstein
1 sibling, 1 reply; 27+ messages in thread
From: Kudryavtsev, Andrey O @ 2017-01-23 18:18 UTC (permalink / raw)
To: Tobias Oberstein, fio
Hi Tobias,
MDRAID overhead is always there, but you can play with some tuning knobs.
I recommend following:
1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs.
2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular
3. and finally using “imsm” MDRAID extensions and latest MDADM build.
See some other hints there:
http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives
some config examples for NVMe are here:
https://github.com/01org/fiovisualizer/tree/master/Workloads
--
Andrey Kudryavtsev,
SSD Solution Architect
Intel Corp.
inet: 83564353
work: +1-916-356-4353
mobile: +1-916-221-2281
On 1/23/17, 8:26 AM, "fio-owner@vger.kernel.org on behalf of Tobias Oberstein" <fio-owner@vger.kernel.org on behalf of tobias.oberstein@gmail.com> wrote:
Hi,
I have a question rgd Linux software RAID (MD) as tested with FIO - so
this is slightly OT, but I am hoping for expert advice or redirection to
a more appropriate place (if this is unwelcome here).
I have a box with this HW:
- 88 cores Xeon E7 (176 HTs) + 3TB RAM
- 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
With random 4kB read load, I am able to max it out at 7 million IOPS -
but only if I run FIO on the _individual_ NVMe devices.
[global]
group_reporting
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
bs=4k
runtime=120
[randread]
stonewall
rw=randread
numjobs=2560
When I create a stripe set over all devices:
sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
/dev/nvme0n1 \
/dev/nvme1n1 \
/dev/nvme2n1 \
/dev/nvme3n1 \
/dev/nvme4n1 \
/dev/nvme5n1 \
/dev/nvme6n1 \
/dev/nvme7n1 \
/dev/nvme8n1 \
/dev/nvme9n1 \
/dev/nvme10n1 \
/dev/nvme11n1 \
/dev/nvme12n1 \
/dev/nvme13n1 \
/dev/nvme14n1 \
/dev/nvme15n1
I only get 1.6 million IOPS. Detail results down below.
Note: the array is created with chunk size 8K because this is for
database workload. Here I tested with 4k block size, but the it's
similar (lower perf on MD) with 8k
Any helps or hints would be greatly appreciated!
Cheers,
/Tobias
7 million IOPS on raw, individual NVMe devices
==============================================
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2367 (f=29896):
[_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23
15:47:17 2017
read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
clat percentiles (usec):
| 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171],
| 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270],
| 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980],
| 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
| 99.99th=[ 8096]
lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
lat (usec) : 1000=1.79%
lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s
(28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec
Disk stats (read/write):
nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0,
in_queue=14802400, util=100.00%
nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0,
in_queue=15101276, util=100.00%
nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0,
in_queue=12053112, util=100.00%
nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0,
in_queue=11135004, util=100.00%
nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0,
in_queue=21079576, util=100.00%
nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0,
in_queue=19393024, util=100.00%
nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0,
in_queue=20140104, util=100.00%
nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0,
in_queue=21090048, util=100.00%
nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0,
in_queue=14929172, util=100.00%
nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0,
in_queue=13919288, util=100.00%
nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0,
in_queue=11390392, util=100.00%
nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0,
in_queue=20110288, util=100.00%
nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0,
in_queue=11683568, util=100.00%
nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0,
in_queue=16314628, util=100.00%
nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0,
in_queue=27659920, util=100.00%
nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0,
in_queue=17910636, util=100.00%
1.6 millions IOPS on Linux MD over 16 NVMe devices
==================================================
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23
17:21:15 2017
read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
clat percentiles (usec):
| 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89],
| 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108],
| 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221],
| 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608],
| 99.99th=[ 2960]
lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
lat (usec) : 1000=0.07%
lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s
(6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec
Disk stats (read/write):
md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0,
aggrin_queue=1247601, aggrutil=100.00%
nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0,
in_queue=1225896, util=100.00%
nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0,
in_queue=1191452, util=100.00%
nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0,
in_queue=1296728, util=100.00%
nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0,
in_queue=1239808, util=100.00%
nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0,
in_queue=1272916, util=100.00%
nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0,
in_queue=1178360, util=100.00%
nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0,
in_queue=1207808, util=100.00%
nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0,
in_queue=1258956, util=100.00%
nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0,
in_queue=1304536, util=100.00%
nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0,
in_queue=1281952, util=100.00%
nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0,
in_queue=1271820, util=100.00%
nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0,
in_queue=1224192, util=100.00%
nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0,
in_queue=1214240, util=100.00%
nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0,
in_queue=1242372, util=100.00%
nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0,
in_queue=1277600, util=100.00%
nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0,
in_queue=1272988, util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
N�����r��y���b�X��ǧv�^�){.n�+�������?��ܨ}���Ơz�&j:+v���?����zZ+��+zf���h���~����i���z�?�w���?����&�)ߢ?f
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
[not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
@ 2017-01-23 18:33 ` Tobias Oberstein
2017-01-23 19:10 ` Kudryavtsev, Andrey O
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 18:33 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: fio
> You're just running a huge number of threads against the same md device and
> bottleneck on some internal lock. If you step back and set up, say, 256
Ah, alright. Shit.
> threads with ioengine=libaio, qd=128 (to match the in-flight I/O number),
> you'd likely see the locking impact reduced substantially.
The problem with using libaio and QD>1 is: that doesn't represent the
workload I am optimizing for.
The workload is PostgreSQL, and that is doing all it's IO as regular
read/writes, and hence the use of ioengine=sync with large thread counts.
Note: we have an internal tool that is able to parallelize PostgreSQL
via database sessions.
--
I tried anyway. Here is what I get with engine=libaio (results down below):
A)
QD=128 and jobs=8 (same effective IO concurrency as previously = 1024)
iops=200184
The IOPS stay constant during the run (120s).
B)
QD=128 and jobs=16 (effective concurrency = 2048)
iops=1068.7K
But, but:
The IOPS slowly go up to over 5 mio, then collapses to like 20k, and
then go up again. Very strange.
C)
QD=128 and jobs=32 (effective concurrency = 4096)
FIO claims: iops=2135.9K
Which is still 3.5x lower than what I get with the sync engine and 2800
threads!
Plus: that strange behavior over run time .. IOPS go up to 10M:
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-29-13-ZEyCVcKZ.1485196199.png
and the collapse to 0 IOPS
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-30-20-GEEEQR6f.1485196243.png
at which the NVMes don't show any load (I am watching them in another
window).
===
libaio is nowhere near what I get with engine=sync and high job counts.
Mmh. Plus the strange behavior.
And as said, that doesn't represent my workload anyways.
I want to stay away from AIO ..
Cheers,
/Tobias
A)
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 8 threads
Jobs: 1 (f=1): [_(2),r(1),_(5)] [38.3% done] [0KB/0KB/0KB /s] [0/0/0
iops] [eta 03m:23s]
randread: (groupid=0, jobs=8): err= 0: pid=1994: Mon Jan 23 19:23:23 2017
read : io=93837MB, bw=800739KB/s, iops=200184, runt=120001msec
slat (usec): min=0, max=4291, avg=39.28, stdev=76.95
clat (usec): min=2, max=22205, avg=5075.21, stdev=3646.18
lat (usec): min=5, max=22333, avg=5114.55, stdev=3674.10
clat percentiles (usec):
| 1.00th=[ 916], 5.00th=[ 1224], 10.00th=[ 1448], 20.00th=[ 1864],
| 30.00th=[ 2320], 40.00th=[ 2960], 50.00th=[ 3920], 60.00th=[ 5024],
| 70.00th=[ 6368], 80.00th=[ 8384], 90.00th=[10944], 95.00th=[12608],
| 99.00th=[14272], 99.50th=[15168], 99.90th=[16768], 99.95th=[17536],
| 99.99th=[18816]
bw (KB /s): min=33088, max=400688, per=12.35%, avg=98898.47,
stdev=76253.23
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.22%, 1000=1.48%
lat (msec) : 2=21.67%, 4=27.51%, 10=35.37%, 20=13.74%, 50=0.01%
cpu : usr=1.53%, sys=13.53%, ctx=7504182, majf=0, minf=1032
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=24022368/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=93837MB, aggrb=800738KB/s, minb=800738KB/s,
maxb=800738KB/s, mint=120001msec, maxt=120001msec
Disk stats (read/write):
md1: ios=7485313/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=468407/0, aggrmerge=0/0, aggrticks=51834/0, aggrin_queue=51770,
aggrutil=35.00%
nvme15n1: ios=468133/0, merge=0/0, ticks=52628/0, in_queue=52532,
util=34.39%
nvme6n1: ios=468355/0, merge=0/0, ticks=48944/0, in_queue=48840,
util=32.34%
nvme9n1: ios=468561/0, merge=0/0, ticks=53924/0, in_queue=53956,
util=35.00%
nvme11n1: ios=468354/0, merge=0/0, ticks=53424/0, in_queue=53396,
util=34.70%
nvme2n1: ios=468418/0, merge=0/0, ticks=51536/0, in_queue=51496,
util=33.63%
nvme14n1: ios=468669/0, merge=0/0, ticks=51696/0, in_queue=51576,
util=33.84%
nvme5n1: ios=468526/0, merge=0/0, ticks=50004/0, in_queue=49928,
util=33.00%
nvme8n1: ios=468233/0, merge=0/0, ticks=52232/0, in_queue=52140,
util=33.82%
nvme10n1: ios=468501/0, merge=0/0, ticks=52532/0, in_queue=52416,
util=34.29%
nvme1n1: ios=468434/0, merge=0/0, ticks=53492/0, in_queue=53404,
util=34.58%
nvme13n1: ios=468544/0, merge=0/0, ticks=51876/0, in_queue=51860,
util=33.85%
nvme4n1: ios=468513/0, merge=0/0, ticks=51172/0, in_queue=51176,
util=33.30%
nvme7n1: ios=468245/0, merge=0/0, ticks=50564/0, in_queue=50484,
util=33.14%
nvme0n1: ios=468318/0, merge=0/0, ticks=49812/0, in_queue=49760,
util=32.67%
nvme12n1: ios=468279/0, merge=0/0, ticks=52416/0, in_queue=52344,
util=34.17%
nvme3n1: ios=468442/0, merge=0/0, ticks=53092/0, in_queue=53016,
util=34.37%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
B)
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 16 threads
Jobs: 1 (f=1): [_(15),r(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]
[eta 00m:00s]
randread: (groupid=0, jobs=16): err= 0: pid=2141: Mon Jan 23 19:27:38 2017
read : io=500942MB, bw=4174.5MB/s, iops=1068.7K, runt=120001msec
slat (usec): min=0, max=3647, avg=11.07, stdev=37.60
clat (usec): min=2, max=19872, avg=1475.65, stdev=2510.83
lat (usec): min=4, max=19964, avg=1486.76, stdev=2530.31
clat percentiles (usec):
| 1.00th=[ 334], 5.00th=[ 346], 10.00th=[ 358], 20.00th=[ 362],
| 30.00th=[ 370], 40.00th=[ 378], 50.00th=[ 398], 60.00th=[ 494],
| 70.00th=[ 780], 80.00th=[ 1480], 90.00th=[ 4256], 95.00th=[ 8032],
| 99.00th=[12096], 99.50th=[12736], 99.90th=[14272], 99.95th=[14912],
| 99.99th=[16512]
bw (KB /s): min= 0, max=1512848, per=8.04%, avg=343481.50,
stdev=460791.59
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=60.27%, 750=8.95%, 1000=4.94%
lat (msec) : 2=9.33%, 4=5.98%, 10=7.89%, 20=2.63%
cpu : usr=3.19%, sys=44.95%, ctx=9452424, majf=0, minf=2064
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=128241193/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=500942MB, aggrb=4174.5MB/s, minb=4174.5MB/s,
maxb=4174.5MB/s, mint=120001msec, maxt=120001msec
Disk stats (read/write):
md1: ios=9392258/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=588533/0, aggrmerge=0/0, aggrticks=63464/0, aggrin_queue=63476,
aggrutil=36.40%
nvme15n1: ios=588661/0, merge=0/0, ticks=66932/0, in_queue=66824,
util=36.40%
nvme6n1: ios=589278/0, merge=0/0, ticks=60768/0, in_queue=60600,
util=34.84%
nvme9n1: ios=588744/0, merge=0/0, ticks=64344/0, in_queue=64480,
util=35.85%
nvme11n1: ios=588005/0, merge=0/0, ticks=65636/0, in_queue=65828,
util=36.02%
nvme2n1: ios=588097/0, merge=0/0, ticks=62296/0, in_queue=62440,
util=35.00%
nvme14n1: ios=588451/0, merge=0/0, ticks=64480/0, in_queue=64408,
util=35.87%
nvme5n1: ios=588654/0, merge=0/0, ticks=60736/0, in_queue=60704,
util=34.66%
nvme8n1: ios=588843/0, merge=0/0, ticks=63980/0, in_queue=63928,
util=35.40%
nvme10n1: ios=588315/0, merge=0/0, ticks=62436/0, in_queue=62432,
util=35.15%
nvme1n1: ios=588327/0, merge=0/0, ticks=64432/0, in_queue=64564,
util=36.10%
nvme13n1: ios=588342/0, merge=0/0, ticks=65856/0, in_queue=65892,
util=36.06%
nvme4n1: ios=588343/0, merge=0/0, ticks=64528/0, in_queue=64752,
util=35.73%
nvme7n1: ios=589243/0, merge=0/0, ticks=63740/0, in_queue=63696,
util=35.34%
nvme0n1: ios=588499/0, merge=0/0, ticks=61308/0, in_queue=61268,
util=34.83%
nvme12n1: ios=588221/0, merge=0/0, ticks=62076/0, in_queue=61976,
util=35.19%
nvme3n1: ios=588512/0, merge=0/0, ticks=61880/0, in_queue=61824,
util=35.09%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
C)
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 32 threads
Jobs: 1 (f=0): [_(24),r(1),_(7)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0
iops] [eta 00m:00s]
randread: (groupid=0, jobs=32): err= 0: pid=2263: Mon Jan 23 19:30:49 2017
read : io=977.76GB, bw=8343.4MB/s, iops=2135.9K, runt=120001msec
slat (usec): min=0, max=3372, avg= 7.30, stdev=27.48
clat (usec): min=1, max=21871, avg=997.26, stdev=1995.10
lat (usec): min=4, max=21982, avg=1004.60, stdev=2010.61
clat percentiles (usec):
| 1.00th=[ 374], 5.00th=[ 378], 10.00th=[ 378], 20.00th=[ 386],
| 30.00th=[ 390], 40.00th=[ 394], 50.00th=[ 394], 60.00th=[ 398],
| 70.00th=[ 406], 80.00th=[ 540], 90.00th=[ 1496], 95.00th=[ 5408],
| 99.00th=[10944], 99.50th=[12224], 99.90th=[14016], 99.95th=[14784],
| 99.99th=[16512]
bw (KB /s): min= 0, max=1353208, per=5.91%, avg=505187.96,
stdev=549388.79
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 500=78.69%, 750=5.80%, 1000=2.94%
lat (msec) : 2=3.84%, 4=2.66%, 10=4.52%, 20=1.56%, 50=0.01%
cpu : usr=3.09%, sys=68.19%, ctx=10916103, majf=0, minf=4128
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=256309234/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=977.76GB, aggrb=8343.4MB/s, minb=8343.4MB/s,
maxb=8343.4MB/s, mint=120001msec, maxt=120001msec
Disk stats (read/write):
md1: ios=10762806/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=675866/0, aggrmerge=0/0, aggrticks=70332/0, aggrin_queue=70505,
aggrutil=28.65%
nvme15n1: ios=675832/0, merge=0/0, ticks=69604/0, in_queue=69648,
util=27.82%
nvme6n1: ios=676181/0, merge=0/0, ticks=75584/0, in_queue=75552,
util=28.65%
nvme9n1: ios=675762/0, merge=0/0, ticks=67916/0, in_queue=68236,
util=27.79%
nvme11n1: ios=675745/0, merge=0/0, ticks=68296/0, in_queue=68804,
util=27.66%
nvme2n1: ios=676036/0, merge=0/0, ticks=70904/0, in_queue=71240,
util=28.14%
nvme14n1: ios=675737/0, merge=0/0, ticks=71560/0, in_queue=71716,
util=28.13%
nvme5n1: ios=676592/0, merge=0/0, ticks=71832/0, in_queue=71976,
util=28.02%
nvme8n1: ios=675969/0, merge=0/0, ticks=69152/0, in_queue=69192,
util=27.63%
nvme10n1: ios=675607/0, merge=0/0, ticks=67600/0, in_queue=67668,
util=27.74%
nvme1n1: ios=675528/0, merge=0/0, ticks=72856/0, in_queue=73136,
util=28.48%
nvme13n1: ios=675189/0, merge=0/0, ticks=69736/0, in_queue=70084,
util=28.04%
nvme4n1: ios=676117/0, merge=0/0, ticks=68120/0, in_queue=68600,
util=27.88%
nvme7n1: ios=675726/0, merge=0/0, ticks=72004/0, in_queue=71960,
util=28.25%
nvme0n1: ios=676119/0, merge=0/0, ticks=71228/0, in_queue=71264,
util=28.12%
nvme12n1: ios=675837/0, merge=0/0, ticks=70320/0, in_queue=70368,
util=27.99%
nvme3n1: ios=675887/0, merge=0/0, ticks=68600/0, in_queue=68636,
util=27.95%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 18:18 ` Kudryavtsev, Andrey O
@ 2017-01-23 18:53 ` Tobias Oberstein
2017-01-23 19:06 ` Kudryavtsev, Andrey O
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 18:53 UTC (permalink / raw)
To: Kudryavtsev, Andrey O, fio
Hi Andrey,
thanks for your tips!
Am 23.01.2017 um 19:18 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> MDRAID overhead is always there, but you can play with some tuning knobs.
> I recommend following:
> 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs.
I get nearly 7 mio random 4k IOPS with engine=sync and threads=2800 on
the 16 logical NVMe block devices (from 8 physical P3608 4TB).
The values I get with libaio are much lower (see my other reply).
My concrete problem is: I can't get these 7 mio IOPS through MD (striped
over all 16 NVMe logical devices) .. MD hits a wall at 1.6 mio
Note: I also tried LVM striped volumes. Sluggish perf., much higher
system load.
> 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular
Background: my workload is 100% 8kB and current results are here
https://github.com/oberstet/scratchbox/raw/master/cruncher/sql19/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf
The sector size on the NVMes currently is
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo isdct show -a
-intelssd 0 | grep SectorSize
SectorSize : 512
Do you recommend changing that in my case?
> 3. and finally using “imsm” MDRAID extensions and latest MDADM build.
What is imsm?
Is that "Intel Matrix Storage Array"?
Is that fully open-source and in-tree kernel?
If not, I won't use it anyway, sorry, company policy.
We're running Debian 8 / Kernel 4.8 from backports (and soonish Debian 9).
> See some other hints there:
> http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives
>
> some config examples for NVMe are here:
> https://github.com/01org/fiovisualizer/tree/master/Workloads
>
>
What's your platform?
Eg on Windows, async IO is awesome. On *nix .. not. At least in my
experience.
And then, my target workload (PostgreSQL) isn't doing AIO at all ..
Cheers,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 18:53 ` Tobias Oberstein
@ 2017-01-23 19:06 ` Kudryavtsev, Andrey O
2017-01-24 9:46 ` Tobias Oberstein
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: Kudryavtsev, Andrey O @ 2017-01-23 19:06 UTC (permalink / raw)
To: Tobias Oberstein, fio
Hi Tobias,
Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
--
Andrey Kudryavtsev,
SSD Solution Architect
Intel Corp.
inet: 83564353
work: +1-916-356-4353
mobile: +1-916-221-2281
On 1/23/17, 10:53 AM, "Tobias Oberstein" <tobias.oberstein@gmail.com> wrote:
Hi Andrey,
thanks for your tips!
Am 23.01.2017 um 19:18 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> MDRAID overhead is always there, but you can play with some tuning knobs.
> I recommend following:
> 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs.
I get nearly 7 mio random 4k IOPS with engine=sync and threads=2800 on
the 16 logical NVMe block devices (from 8 physical P3608 4TB).
The values I get with libaio are much lower (see my other reply).
My concrete problem is: I can't get these 7 mio IOPS through MD (striped
over all 16 NVMe logical devices) .. MD hits a wall at 1.6 mio
Note: I also tried LVM striped volumes. Sluggish perf., much higher
system load.
> 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular
Background: my workload is 100% 8kB and current results are here
https://github.com/oberstet/scratchbox/raw/master/cruncher/sql19/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf
The sector size on the NVMes currently is
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo isdct show -a
-intelssd 0 | grep SectorSize
SectorSize : 512
Do you recommend changing that in my case?
> 3. and finally using “imsm” MDRAID extensions and latest MDADM build.
What is imsm?
Is that "Intel Matrix Storage Array"?
Is that fully open-source and in-tree kernel?
If not, I won't use it anyway, sorry, company policy.
We're running Debian 8 / Kernel 4.8 from backports (and soonish Debian 9).
> See some other hints there:
> http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives
>
> some config examples for NVMe are here:
> https://github.com/01org/fiovisualizer/tree/master/Workloads
>
>
What's your platform?
Eg on Windows, async IO is awesome. On *nix .. not. At least in my
experience.
And then, my target workload (PostgreSQL) isn't doing AIO at all ..
Cheers,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 18:33 ` Tobias Oberstein
@ 2017-01-23 19:10 ` Kudryavtsev, Andrey O
2017-01-23 19:26 ` Tobias Oberstein
2017-01-23 19:13 ` Sitsofe Wheeler
[not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
2 siblings, 1 reply; 27+ messages in thread
From: Kudryavtsev, Andrey O @ 2017-01-23 19:10 UTC (permalink / raw)
To: Tobias Oberstein, Andrey Kuzmin; +Cc: fio
Tobias,
I’d try 128 jobs, QD 32 and disable random map and latency measurements
randrepeat=0
norandommap
disable_ lat
--
Andrey Kudryavtsev,
SSD Solution Architect
Intel Corp.
inet: 83564353
work: +1-916-356-4353
mobile: +1-916-221-2281
On 1/23/17, 10:33 AM, "fio-owner@vger.kernel.org on behalf of Tobias Oberstein" <fio-owner@vger.kernel.org on behalf of tobias.oberstein@gmail.com> wrote:
> You're just running a huge number of threads against the same md device and
> bottleneck on some internal lock. If you step back and set up, say, 256
Ah, alright. Shit.
> threads with ioengine=libaio, qd=128 (to match the in-flight I/O number),
> you'd likely see the locking impact reduced substantially.
The problem with using libaio and QD>1 is: that doesn't represent the
workload I am optimizing for.
The workload is PostgreSQL, and that is doing all it's IO as regular
read/writes, and hence the use of ioengine=sync with large thread counts.
Note: we have an internal tool that is able to parallelize PostgreSQL
via database sessions.
--
I tried anyway. Here is what I get with engine=libaio (results down below):
A)
QD=128 and jobs=8 (same effective IO concurrency as previously = 1024)
iops=200184
The IOPS stay constant during the run (120s).
B)
QD=128 and jobs=16 (effective concurrency = 2048)
iops=1068.7K
But, but:
The IOPS slowly go up to over 5 mio, then collapses to like 20k, and
then go up again. Very strange.
C)
QD=128 and jobs=32 (effective concurrency = 4096)
FIO claims: iops=2135.9K
Which is still 3.5x lower than what I get with the sync engine and 2800
threads!
Plus: that strange behavior over run time .. IOPS go up to 10M:
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-29-13-ZEyCVcKZ.1485196199.png
and the collapse to 0 IOPS
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-30-20-GEEEQR6f.1485196243.png
at which the NVMes don't show any load (I am watching them in another
window).
===
libaio is nowhere near what I get with engine=sync and high job counts.
Mmh. Plus the strange behavior.
And as said, that doesn't represent my workload anyways.
I want to stay away from AIO ..
Cheers,
/Tobias
A)
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 8 threads
Jobs: 1 (f=1): [_(2),r(1),_(5)] [38.3% done] [0KB/0KB/0KB /s] [0/0/0
iops] [eta 03m:23s]
randread: (groupid=0, jobs=8): err= 0: pid=1994: Mon Jan 23 19:23:23 2017
read : io=93837MB, bw=800739KB/s, iops=200184, runt=120001msec
slat (usec): min=0, max=4291, avg=39.28, stdev=76.95
clat (usec): min=2, max=22205, avg=5075.21, stdev=3646.18
lat (usec): min=5, max=22333, avg=5114.55, stdev=3674.10
clat percentiles (usec):
| 1.00th=[ 916], 5.00th=[ 1224], 10.00th=[ 1448], 20.00th=[ 1864],
| 30.00th=[ 2320], 40.00th=[ 2960], 50.00th=[ 3920], 60.00th=[ 5024],
| 70.00th=[ 6368], 80.00th=[ 8384], 90.00th=[10944], 95.00th=[12608],
| 99.00th=[14272], 99.50th=[15168], 99.90th=[16768], 99.95th=[17536],
| 99.99th=[18816]
bw (KB /s): min=33088, max=400688, per=12.35%, avg=98898.47,
stdev=76253.23
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.22%, 1000=1.48%
lat (msec) : 2=21.67%, 4=27.51%, 10=35.37%, 20=13.74%, 50=0.01%
cpu : usr=1.53%, sys=13.53%, ctx=7504182, majf=0, minf=1032
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=24022368/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=93837MB, aggrb=800738KB/s, minb=800738KB/s,
maxb=800738KB/s, mint=120001msec, maxt=120001msec
Disk stats (read/write):
md1: ios=7485313/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=468407/0, aggrmerge=0/0, aggrticks=51834/0, aggrin_queue=51770,
aggrutil=35.00%
nvme15n1: ios=468133/0, merge=0/0, ticks=52628/0, in_queue=52532,
util=34.39%
nvme6n1: ios=468355/0, merge=0/0, ticks=48944/0, in_queue=48840,
util=32.34%
nvme9n1: ios=468561/0, merge=0/0, ticks=53924/0, in_queue=53956,
util=35.00%
nvme11n1: ios=468354/0, merge=0/0, ticks=53424/0, in_queue=53396,
util=34.70%
nvme2n1: ios=468418/0, merge=0/0, ticks=51536/0, in_queue=51496,
util=33.63%
nvme14n1: ios=468669/0, merge=0/0, ticks=51696/0, in_queue=51576,
util=33.84%
nvme5n1: ios=468526/0, merge=0/0, ticks=50004/0, in_queue=49928,
util=33.00%
nvme8n1: ios=468233/0, merge=0/0, ticks=52232/0, in_queue=52140,
util=33.82%
nvme10n1: ios=468501/0, merge=0/0, ticks=52532/0, in_queue=52416,
util=34.29%
nvme1n1: ios=468434/0, merge=0/0, ticks=53492/0, in_queue=53404,
util=34.58%
nvme13n1: ios=468544/0, merge=0/0, ticks=51876/0, in_queue=51860,
util=33.85%
nvme4n1: ios=468513/0, merge=0/0, ticks=51172/0, in_queue=51176,
util=33.30%
nvme7n1: ios=468245/0, merge=0/0, ticks=50564/0, in_queue=50484,
util=33.14%
nvme0n1: ios=468318/0, merge=0/0, ticks=49812/0, in_queue=49760,
util=32.67%
nvme12n1: ios=468279/0, merge=0/0, ticks=52416/0, in_queue=52344,
util=34.17%
nvme3n1: ios=468442/0, merge=0/0, ticks=53092/0, in_queue=53016,
util=34.37%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
B)
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 16 threads
Jobs: 1 (f=1): [_(15),r(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]
[eta 00m:00s]
randread: (groupid=0, jobs=16): err= 0: pid=2141: Mon Jan 23 19:27:38 2017
read : io=500942MB, bw=4174.5MB/s, iops=1068.7K, runt=120001msec
slat (usec): min=0, max=3647, avg=11.07, stdev=37.60
clat (usec): min=2, max=19872, avg=1475.65, stdev=2510.83
lat (usec): min=4, max=19964, avg=1486.76, stdev=2530.31
clat percentiles (usec):
| 1.00th=[ 334], 5.00th=[ 346], 10.00th=[ 358], 20.00th=[ 362],
| 30.00th=[ 370], 40.00th=[ 378], 50.00th=[ 398], 60.00th=[ 494],
| 70.00th=[ 780], 80.00th=[ 1480], 90.00th=[ 4256], 95.00th=[ 8032],
| 99.00th=[12096], 99.50th=[12736], 99.90th=[14272], 99.95th=[14912],
| 99.99th=[16512]
bw (KB /s): min= 0, max=1512848, per=8.04%, avg=343481.50,
stdev=460791.59
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=60.27%, 750=8.95%, 1000=4.94%
lat (msec) : 2=9.33%, 4=5.98%, 10=7.89%, 20=2.63%
cpu : usr=3.19%, sys=44.95%, ctx=9452424, majf=0, minf=2064
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=128241193/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=500942MB, aggrb=4174.5MB/s, minb=4174.5MB/s,
maxb=4174.5MB/s, mint=120001msec, maxt=120001msec
Disk stats (read/write):
md1: ios=9392258/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=588533/0, aggrmerge=0/0, aggrticks=63464/0, aggrin_queue=63476,
aggrutil=36.40%
nvme15n1: ios=588661/0, merge=0/0, ticks=66932/0, in_queue=66824,
util=36.40%
nvme6n1: ios=589278/0, merge=0/0, ticks=60768/0, in_queue=60600,
util=34.84%
nvme9n1: ios=588744/0, merge=0/0, ticks=64344/0, in_queue=64480,
util=35.85%
nvme11n1: ios=588005/0, merge=0/0, ticks=65636/0, in_queue=65828,
util=36.02%
nvme2n1: ios=588097/0, merge=0/0, ticks=62296/0, in_queue=62440,
util=35.00%
nvme14n1: ios=588451/0, merge=0/0, ticks=64480/0, in_queue=64408,
util=35.87%
nvme5n1: ios=588654/0, merge=0/0, ticks=60736/0, in_queue=60704,
util=34.66%
nvme8n1: ios=588843/0, merge=0/0, ticks=63980/0, in_queue=63928,
util=35.40%
nvme10n1: ios=588315/0, merge=0/0, ticks=62436/0, in_queue=62432,
util=35.15%
nvme1n1: ios=588327/0, merge=0/0, ticks=64432/0, in_queue=64564,
util=36.10%
nvme13n1: ios=588342/0, merge=0/0, ticks=65856/0, in_queue=65892,
util=36.06%
nvme4n1: ios=588343/0, merge=0/0, ticks=64528/0, in_queue=64752,
util=35.73%
nvme7n1: ios=589243/0, merge=0/0, ticks=63740/0, in_queue=63696,
util=35.34%
nvme0n1: ios=588499/0, merge=0/0, ticks=61308/0, in_queue=61268,
util=34.83%
nvme12n1: ios=588221/0, merge=0/0, ticks=62076/0, in_queue=61976,
util=35.19%
nvme3n1: ios=588512/0, merge=0/0, ticks=61880/0, in_queue=61824,
util=35.09%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
C)
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
...
fio-2.1.11
Starting 32 threads
Jobs: 1 (f=0): [_(24),r(1),_(7)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0
iops] [eta 00m:00s]
randread: (groupid=0, jobs=32): err= 0: pid=2263: Mon Jan 23 19:30:49 2017
read : io=977.76GB, bw=8343.4MB/s, iops=2135.9K, runt=120001msec
slat (usec): min=0, max=3372, avg= 7.30, stdev=27.48
clat (usec): min=1, max=21871, avg=997.26, stdev=1995.10
lat (usec): min=4, max=21982, avg=1004.60, stdev=2010.61
clat percentiles (usec):
| 1.00th=[ 374], 5.00th=[ 378], 10.00th=[ 378], 20.00th=[ 386],
| 30.00th=[ 390], 40.00th=[ 394], 50.00th=[ 394], 60.00th=[ 398],
| 70.00th=[ 406], 80.00th=[ 540], 90.00th=[ 1496], 95.00th=[ 5408],
| 99.00th=[10944], 99.50th=[12224], 99.90th=[14016], 99.95th=[14784],
| 99.99th=[16512]
bw (KB /s): min= 0, max=1353208, per=5.91%, avg=505187.96,
stdev=549388.79
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 500=78.69%, 750=5.80%, 1000=2.94%
lat (msec) : 2=3.84%, 4=2.66%, 10=4.52%, 20=1.56%, 50=0.01%
cpu : usr=3.09%, sys=68.19%, ctx=10916103, majf=0, minf=4128
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
issued : total=r=256309234/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=977.76GB, aggrb=8343.4MB/s, minb=8343.4MB/s,
maxb=8343.4MB/s, mint=120001msec, maxt=120001msec
Disk stats (read/write):
md1: ios=10762806/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=675866/0, aggrmerge=0/0, aggrticks=70332/0, aggrin_queue=70505,
aggrutil=28.65%
nvme15n1: ios=675832/0, merge=0/0, ticks=69604/0, in_queue=69648,
util=27.82%
nvme6n1: ios=676181/0, merge=0/0, ticks=75584/0, in_queue=75552,
util=28.65%
nvme9n1: ios=675762/0, merge=0/0, ticks=67916/0, in_queue=68236,
util=27.79%
nvme11n1: ios=675745/0, merge=0/0, ticks=68296/0, in_queue=68804,
util=27.66%
nvme2n1: ios=676036/0, merge=0/0, ticks=70904/0, in_queue=71240,
util=28.14%
nvme14n1: ios=675737/0, merge=0/0, ticks=71560/0, in_queue=71716,
util=28.13%
nvme5n1: ios=676592/0, merge=0/0, ticks=71832/0, in_queue=71976,
util=28.02%
nvme8n1: ios=675969/0, merge=0/0, ticks=69152/0, in_queue=69192,
util=27.63%
nvme10n1: ios=675607/0, merge=0/0, ticks=67600/0, in_queue=67668,
util=27.74%
nvme1n1: ios=675528/0, merge=0/0, ticks=72856/0, in_queue=73136,
util=28.48%
nvme13n1: ios=675189/0, merge=0/0, ticks=69736/0, in_queue=70084,
util=28.04%
nvme4n1: ios=676117/0, merge=0/0, ticks=68120/0, in_queue=68600,
util=27.88%
nvme7n1: ios=675726/0, merge=0/0, ticks=72004/0, in_queue=71960,
util=28.25%
nvme0n1: ios=676119/0, merge=0/0, ticks=71228/0, in_queue=71264,
util=28.12%
nvme12n1: ios=675837/0, merge=0/0, ticks=70320/0, in_queue=70368,
util=27.99%
nvme3n1: ios=675887/0, merge=0/0, ticks=68600/0, in_queue=68636,
util=27.95%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 18:33 ` Tobias Oberstein
2017-01-23 19:10 ` Kudryavtsev, Andrey O
@ 2017-01-23 19:13 ` Sitsofe Wheeler
2017-01-23 19:40 ` Tobias Oberstein
[not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
2 siblings, 1 reply; 27+ messages in thread
From: Sitsofe Wheeler @ 2017-01-23 19:13 UTC (permalink / raw)
To: Tobias Oberstein; +Cc: Andrey Kuzmin, fio
On 23 January 2017 at 18:33, Tobias Oberstein
<tobias.oberstein@gmail.com> wrote:
>
> libaio is nowhere near what I get with engine=sync and high job counts. Mmh.
> Plus the strange behavior.
Have you tried batching the IOs and controlling how much are you
reaping at any one time? See
http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
for some of the options for controlling this...
--
Sitsofe | http://sucs.org/~sits/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:10 ` Kudryavtsev, Andrey O
@ 2017-01-23 19:26 ` Tobias Oberstein
0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 19:26 UTC (permalink / raw)
To: Kudryavtsev, Andrey O, Andrey Kuzmin; +Cc: fio
Hi Andrey,
Am 23.01.2017 um 20:10 schrieb Kudryavtsev, Andrey O:
> Tobias,
> I’d try 128 jobs, QD 32 and disable random map and latency measurements
> randrepeat=0
> norandommap
I had those already set ..
> disable_ lat
>
This I hadn't set.
Using the settings you suggest on the MD over 16 NVMes, and after
increasing to
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat
/proc/sys/fs/aio-max-nr
1048576
I get iops=4082.2K, which is much closer to the 7 mio IOPS I get with
engine=sync and jobs=2800.
Cheers,
/Tobias
PS: I am still working on your other hints .. so many tips. Thanks guys!
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=32
...
fio-2.1.11
Starting 128 threads
Jobs: 127 (f=0): [r(51),E(1),r(76)] [3.5% done] [15018MB/0KB/0KB /s]
[3845K/0/0 iops] [eta 14m:11s]
randread: (groupid=0, jobs=128): err= 0: pid=5878: Mon Jan 23 20:25:01 2017
read : io=478427MB, bw=15946MB/s, iops=4082.2K, runt= 30003msec
slat (usec): min=1, max=47954, avg=29.39, stdev=34.90
clat (usec): min=37, max=49119, avg=972.35, stdev=673.40
clat percentiles (usec):
| 1.00th=[ 338], 5.00th=[ 446], 10.00th=[ 532], 20.00th=[ 660],
| 30.00th=[ 756], 40.00th=[ 836], 50.00th=[ 892], 60.00th=[ 956],
| 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1224], 95.00th=[ 1368],
| 99.00th=[ 4832], 99.50th=[ 5664], 99.90th=[ 6816], 99.95th=[ 7328],
| 99.99th=[ 8896]
bw (KB /s): min=14024, max=393664, per=0.78%, avg=127573.83,
stdev=51679.15
lat (usec) : 50=0.01%, 100=0.01%, 250=0.07%, 500=8.15%, 750=21.53%
lat (usec) : 1000=37.36%
lat (msec) : 2=29.83%, 4=1.53%, 10=1.53%, 20=0.01%, 50=0.01%
cpu : usr=5.34%, sys=94.48%, ctx=11411, majf=0, minf=4224
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=122477269/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: io=478427MB, aggrb=15946MB/s, minb=15946MB/s, maxb=15946MB/s,
mint=30003msec, maxt=30003msec
Disk stats (read/write):
md1: ios=121675684/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=7654829/0, aggrmerge=0/0, aggrticks=985171/0,
aggrin_queue=1037857, aggrutil=100.00%
nvme15n1: ios=7650998/0, merge=0/0, ticks=938492/0, in_queue=968336,
util=100.00%
nvme6n1: ios=7655891/0, merge=0/0, ticks=1044320/0, in_queue=1074048,
util=100.00%
nvme9n1: ios=7654289/0, merge=0/0, ticks=954912/0, in_queue=1043060,
util=100.00%
nvme11n1: ios=7656494/0, merge=0/0, ticks=955896/0, in_queue=1050748,
util=100.00%
nvme2n1: ios=7656190/0, merge=0/0, ticks=998112/0, in_queue=1090236,
util=100.00%
nvme14n1: ios=7655685/0, merge=0/0, ticks=956648/0, in_queue=982168,
util=100.00%
nvme5n1: ios=7652531/0, merge=0/0, ticks=1040592/0, in_queue=1068920,
util=100.00%
nvme8n1: ios=7652934/0, merge=0/0, ticks=969800/0, in_queue=994468,
util=100.00%
nvme10n1: ios=7655795/0, merge=0/0, ticks=949068/0, in_queue=975252,
util=100.00%
nvme1n1: ios=7652373/0, merge=0/0, ticks=955772/0, in_queue=1040828,
util=100.00%
nvme13n1: ios=7654611/0, merge=0/0, ticks=965664/0, in_queue=1053560,
util=100.00%
nvme4n1: ios=7655941/0, merge=0/0, ticks=1001460/0, in_queue=1113764,
util=100.00%
nvme7n1: ios=7652420/0, merge=0/0, ticks=991072/0, in_queue=1018248,
util=100.00%
nvme0n1: ios=7656124/0, merge=0/0, ticks=1051448/0, in_queue=1083992,
util=100.00%
nvme12n1: ios=7656450/0, merge=0/0, ticks=1031252/0,
in_queue=1064052, util=100.00%
nvme3n1: ios=7658543/0, merge=0/0, ticks=958228/0, in_queue=984040,
util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat
postgresql_storage_workload.fio
[global]
group_reporting
#filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
filename=/dev/md1
#filename=/data/test.dat
#filename=/dev/data/data
size=30G
#ioengine=sync
#iodepth=1
ioengine=libaio
iodepth=32
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
#bs=8k
bs=4k
#ramp_time=0
runtime=30
[randread]
stonewall
rw=randread
numjobs=128
#[randwrite]
#stonewall
#rw=randwrite
#numjobs=32
#[randreadwrite7030]
#stonewall
#rw=randrw
#rwmixread=70
#numjobs=256
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:13 ` Sitsofe Wheeler
@ 2017-01-23 19:40 ` Tobias Oberstein
2017-01-23 20:24 ` Sitsofe Wheeler
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 19:40 UTC (permalink / raw)
To: Sitsofe Wheeler; +Cc: Andrey Kuzmin, fio
Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
> On 23 January 2017 at 18:33, Tobias Oberstein
> <tobias.oberstein@gmail.com> wrote:
>>
>> libaio is nowhere near what I get with engine=sync and high job counts. Mmh.
>> Plus the strange behavior.
>
> Have you tried batching the IOs and controlling how much are you
> reaping at any one time? See
> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
> for some of the options for controlling this...
>
Thanks! Nice.
For libaio, and with all the hints applied (no 4k sectors yet), I get
(4k randread)
Individual NVMes: iops=7350.4K
MD (RAID-0) over NVMes: iops=4112.8K
The going up and down of IOPS is gone.
It's becoming more apparent I'd say, that tthere is a MD bottleneck though.
Cheers,
/Tobias
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat best_libaio.fio
# sudo sh -c 'echo "1048576" > /proc/sys/fs/aio-max-nr'
[global]
group_reporting
size=30G
ioengine=libaio
iodepth=32
iodepth_batch_submit=8
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
bs=4k
runtime=30
[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128
[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio
best_libaio.fio
randread-individual-nvmes: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=32
...
randread-md-over-nvmes: (g=1): rw=randread, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=32
...
fio-2.1.11
Starting 256 threads
Jobs: 128 (f=128): [_(128),r(128)] [7.9% done] [16173MB/0KB/0KB /s]
[4140K/0/0 iops] [eta 11m:51s]
randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=6988: Mon
Jan 23 20:37:30 2017
read : io=861513MB, bw=28712MB/s, iops=7350.4K, runt= 30005msec
slat (usec): min=1, max=179194, avg= 9.61, stdev=166.67
clat (usec): min=8, max=174722, avg=543.86, stdev=736.75
clat percentiles (usec):
| 1.00th=[ 117], 5.00th=[ 139], 10.00th=[ 153], 20.00th=[ 175],
| 30.00th=[ 199], 40.00th=[ 223], 50.00th=[ 258], 60.00th=[ 302],
| 70.00th=[ 394], 80.00th=[ 636], 90.00th=[ 1480], 95.00th=[ 2192],
| 99.00th=[ 3408], 99.50th=[ 3856], 99.90th=[ 4960], 99.95th=[ 5536],
| 99.99th=[10048]
bw (KB /s): min=14992, max=432176, per=0.78%, avg=229721.98,
stdev=44902.57
lat (usec) : 10=0.01%, 50=0.01%, 100=0.10%, 250=48.21%, 500=27.38%
lat (usec) : 750=6.48%, 1000=3.18%
lat (msec) : 2=8.54%, 4=5.73%, 10=0.38%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=8.25%, sys=64.76%, ctx=57533651, majf=0, minf=4224
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=220547266/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=7138: Mon Jan
23 20:37:30 2017
read : io=482013MB, bw=16065MB/s, iops=4112.8K, runt= 30003msec
slat (usec): min=1, max=48048, avg=29.39, stdev=36.10
clat (usec): min=47, max=74459, avg=964.89, stdev=637.97
clat percentiles (usec):
| 1.00th=[ 454], 5.00th=[ 540], 10.00th=[ 604], 20.00th=[ 692],
| 30.00th=[ 764], 40.00th=[ 828], 50.00th=[ 876], 60.00th=[ 924],
| 70.00th=[ 980], 80.00th=[ 1064], 90.00th=[ 1176], 95.00th=[ 1320],
| 99.00th=[ 4768], 99.50th=[ 5536], 99.90th=[ 6432], 99.95th=[ 6752],
| 99.99th=[ 7968]
bw (KB /s): min=14512, max=350248, per=0.78%, avg=128572.72,
stdev=42938.35
lat (usec) : 50=0.01%, 100=0.01%, 250=0.03%, 500=2.69%, 750=24.84%
lat (usec) : 1000=45.08%
lat (msec) : 2=24.43%, 4=1.40%, 10=1.51%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=4.98%, sys=94.81%, ctx=12736, majf=0, minf=3328
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
issued : total=r=123395206/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: io=861513MB, aggrb=28712MB/s, minb=28712MB/s, maxb=28712MB/s,
mint=30005msec, maxt=30005msec
Run status group 1 (all jobs):
READ: io=482013MB, aggrb=16065MB/s, minb=16065MB/s, maxb=16065MB/s,
mint=30003msec, maxt=30003msec
Disk stats (read/write):
nvme0n1: ios=13713322/0, merge=0/0, ticks=2809744/0,
in_queue=2867236, util=98.51%
nvme1n1: ios=13713230/0, merge=0/0, ticks=11534416/0,
in_queue=12284600, util=99.60%
nvme2n1: ios=13713491/0, merge=0/0, ticks=9773908/0,
in_queue=10359404, util=99.80%
nvme3n1: ios=13713296/0, merge=0/0, ticks=6619552/0,
in_queue=6803384, util=99.49%
nvme4n1: ios=13713658/0, merge=0/0, ticks=6055532/0,
in_queue=6533236, util=100.00%
nvme5n1: ios=13713740/0, merge=0/0, ticks=2863528/0,
in_queue=2931544, util=99.89%
nvme6n1: ios=13713827/0, merge=0/0, ticks=2796528/0,
in_queue=2859208, util=99.72%
nvme7n1: ios=13713905/0, merge=0/0, ticks=2846160/0,
in_queue=2904800, util=99.74%
nvme8n1: ios=13713529/0, merge=0/0, ticks=7422588/0,
in_queue=7582496, util=100.00%
nvme9n1: ios=13713414/0, merge=0/0, ticks=13762972/0,
in_queue=14664088, util=100.00%
nvme10n1: ios=13714158/0, merge=0/0, ticks=6570356/0,
in_queue=6735324, util=100.00%
nvme11n1: ios=13714217/0, merge=0/0, ticks=4189764/0,
in_queue=4519824, util=100.00%
nvme12n1: ios=13714299/0, merge=0/0, ticks=7225476/0,
in_queue=7393668, util=100.00%
nvme13n1: ios=13714375/0, merge=0/0, ticks=4988804/0,
in_queue=5267536, util=100.00%
nvme14n1: ios=13714461/0, merge=0/0, ticks=7336928/0,
in_queue=7502260, util=100.00%
nvme15n1: ios=13713918/0, merge=0/0, ticks=11861500/0,
in_queue=12202492, util=100.00%
md1: ios=123098498/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
[not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
@ 2017-01-23 20:10 ` Tobias Oberstein
[not found] ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>
[not found] ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>
0 siblings, 2 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 20:10 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: fio
Hi Andrey,
Thanks again for your tips .. the psync thingy in particular. I need to
verify if that applies to PostgreSQL, because it brings huge gains
compared to sync!
Here is the summary of my latest numbers:
1) engine=libaio
Individual NVMes:
iops=7350.4K
usr=8.25%, sys=64.76%, ctx=57533651
MD (RAID-0) over NVMes:
iops=4112.8K
usr=4.98%, sys=94.81%, ctx=12736
=> MD reaches 55% of perf compared to non-MD.
2) engine=sync
Individual NVMes:
IOPS=6657k
usr=0.56%, sys=4.43%, ctx=200588483
MD (RAID-0) over NVMes:
IOPS=1467k
usr=0.07%, sys=4.13%, ctx=46545978
=> MD reaches 22% of perf compared to non-MD.
3) engine=psync
Individual NVMes:
IOPS=7086k
usr=0.60%, sys=4.43%, ctx=214720330
MD (RAID-0) over NVMes:
IOPS=4154k
usr=0.46%, sys=5.81%, ctx=124737165
=> MD reaches 58% of perf compared to non-MD.
==================
Are the CPU load numbers reported by FIO reliable?
I mean, compare the load between libaio and sync/psync!
Cheers,
/Tobias
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat
best_sync_individual_nvmes.fio
[global]
group_reporting
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
bs=4k
runtime=30
[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=2800
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat
best_sync_md_over_nvmes.fio
[global]
group_reporting
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
bs=4k
runtime=30
[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=2800
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio best_sync_individual_nvmes.fio
randread-individual-nvmes: (g=0): rw=randread,
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2747 (f=28032):
[f(9),_(1),f(27),_(3),f(20),_(1),f(2),_(1),f(57),_(1),f(250),_(1),f(108),_(1),f(48),_(1),f(26),_(1),f(14),_(2),f(444),_(1),f(36),_(1),f(193),_(1),f(100),_(1),f(26),_(1),f(40),_(1),f(1),_(1),f(19),_(2),f(36),_(1),f(77),_(1),f(20),_(1),f(37),_(1),f(6),_(1),f(8),_(1),f(45),_(1),f(3),_(1),f(10),_(1),f(38),_(1),f(7),_(1),f(16),_(1),f(10),_(1),f(3),_(1),f(3),_(2),f(11),_(1),f(26),_(1),f(39),_(1),f(5),_(1),f(15),_(1),f(90),_(1),f(80),_(1),f(87),_(1),f(67),_(1),f(91),_(1),f(9),_(1),f(35),E(1),f(166),_(1),f(78),_(1),f(152),_(1),f(57)][100.0%][r=18.7GiB/s,w=0KiB/s][r=4885k,w=0
IOPS][eta 00m:00s]
randread-individual-nvmes: (groupid=0, jobs=2800): err= 0: pid=8021: Mon
Jan 23 20:51:43 2017
read: IOPS=6657k, BW=25.5GiB/s (27.3GB/s)(762GiB/30012msec)
clat (usec): min=31, max=35890, avg=403.07, stdev=587.78
clat percentiles (usec):
| 1.00th=[ 112], 5.00th=[ 131], 10.00th=[ 145], 20.00th=[ 167],
| 30.00th=[ 187], 40.00th=[ 211], 50.00th=[ 237], 60.00th=[ 270],
| 70.00th=[ 318], 80.00th=[ 406], 90.00th=[ 676], 95.00th=[ 1336],
| 99.00th=[ 3280], 99.50th=[ 4016], 99.90th=[ 5536], 99.95th=[ 6304],
| 99.99th=[ 9536]
lat (usec) : 50=0.01%, 100=0.18%, 250=54.00%, 500=31.18%, 750=5.73%
lat (usec) : 1000=2.24%
lat (msec) : 2=3.63%, 4=2.52%, 10=0.50%, 20=0.01%, 50=0.01%
cpu : usr=0.56%, sys=4.43%, ctx=200588483, majf=0, minf=2797
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=199803621,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=25.5GiB/s (27.3GB/s), 25.5GiB/s-25.5GiB/s
(27.3GB/s-27.3GB/s), io=762GiB (818GB), run=30012-30012msec
Disk stats (read/write):
nvme0n1: ios=12474932/0, merge=0/0, ticks=3440096/0,
in_queue=3545768, util=97.54%
nvme1n1: ios=12488816/0, merge=0/0, ticks=6811092/0,
in_queue=7420304, util=97.96%
nvme2n1: ios=12488737/0, merge=0/0, ticks=4947416/0,
in_queue=5379024, util=97.12%
nvme3n1: ios=12488626/0, merge=0/0, ticks=4578888/0,
in_queue=4696164, util=96.85%
nvme4n1: ios=12488514/0, merge=0/0, ticks=3848360/0,
in_queue=4189952, util=97.85%
nvme5n1: ios=12488384/0, merge=0/0, ticks=2872728/0,
in_queue=2946696, util=96.89%
nvme6n1: ios=12488271/0, merge=0/0, ticks=2480536/0,
in_queue=2544704, util=96.92%
nvme7n1: ios=12488165/0, merge=0/0, ticks=4038500/0,
in_queue=4154768, util=96.91%
nvme8n1: ios=12488052/0, merge=0/0, ticks=4553428/0,
in_queue=4675568, util=97.22%
nvme9n1: ios=12487937/0, merge=0/0, ticks=5487888/0,
in_queue=5956252, util=97.72%
nvme10n1: ios=12486833/0, merge=0/0, ticks=6234216/0,
in_queue=6402356, util=97.54%
nvme11n1: ios=12486699/0, merge=0/0, ticks=4646856/0,
in_queue=5042628, util=97.76%
nvme12n1: ios=12486586/0, merge=0/0, ticks=5331000/0,
in_queue=5478728, util=97.59%
nvme13n1: ios=12486467/0, merge=0/0, ticks=3464404/0,
in_queue=3715416, util=98.27%
nvme14n1: ios=12486358/0, merge=0/0, ticks=2576312/0,
in_queue=2641952, util=97.49%
nvme15n1: ios=12486251/0, merge=0/0, ticks=4135908/0,
in_queue=4270008, util=97.69%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio best_sync_md_over_nvmes.fio
randread-md-over-nvmes: (g=0): rw=randread,
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2800 (f=2800): [r(2800)][100.0%][r=5764MiB/s,w=0KiB/s][r=1476k,w=0
IOPS][eta 00m:00s]
randread-md-over-nvmes: (groupid=0, jobs=2800): err= 0: pid=11137: Mon
Jan 23 20:52:30 2017
read: IOPS=1467k, BW=5732MiB/s (6011MB/s)(169GiB/30116msec)
clat (usec): min=27, max=33113, avg=124.27, stdev=112.85
clat percentiles (usec):
| 1.00th=[ 77], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 88],
| 30.00th=[ 93], 40.00th=[ 101], 50.00th=[ 104], 60.00th=[ 107],
| 70.00th=[ 115], 80.00th=[ 133], 90.00th=[ 177], 95.00th=[ 227],
| 99.00th=[ 370], 99.50th=[ 506], 99.90th=[ 2096], 99.95th=[ 2544],
| 99.99th=[ 2960]
lat (usec) : 50=0.04%, 100=36.72%, 250=60.00%, 500=2.73%, 750=0.22%
lat (usec) : 1000=0.07%
lat (msec) : 2=0.12%, 4=0.11%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=0.07%, sys=4.13%, ctx=46545978, majf=0, minf=2797
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=44193488,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=5732MiB/s (6011MB/s), 5732MiB/s-5732MiB/s
(6011MB/s-6011MB/s), io=169GiB (181GB), run=30116-30116msec
Disk stats (read/write):
md1: ios=44010950/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=2762093/0, aggrmerge=0/0, aggrticks=280663/0,
aggrin_queue=284837, aggrutil=99.12%
nvme15n1: ios=2766734/0, merge=0/0, ticks=264808/0, in_queue=267732,
util=98.68%
nvme6n1: ios=2761142/0, merge=0/0, ticks=288704/0, in_queue=291288,
util=98.76%
nvme9n1: ios=2759118/0, merge=0/0, ticks=275752/0, in_queue=282288,
util=98.95%
nvme11n1: ios=2762423/0, merge=0/0, ticks=264996/0, in_queue=271464,
util=98.91%
nvme2n1: ios=2764361/0, merge=0/0, ticks=281520/0, in_queue=288924,
util=99.12%
nvme14n1: ios=2760515/0, merge=0/0, ticks=264796/0, in_queue=266752,
util=98.61%
nvme5n1: ios=2761756/0, merge=0/0, ticks=280020/0, in_queue=282840,
util=98.92%
nvme8n1: ios=2763138/0, merge=0/0, ticks=279332/0, in_queue=280624,
util=98.53%
nvme10n1: ios=2764117/0, merge=0/0, ticks=291264/0, in_queue=293444,
util=98.67%
nvme1n1: ios=2761579/0, merge=0/0, ticks=275872/0, in_queue=282080,
util=98.90%
nvme13n1: ios=2759948/0, merge=0/0, ticks=280080/0, in_queue=286324,
util=99.05%
nvme4n1: ios=2763271/0, merge=0/0, ticks=279592/0, in_queue=287944,
util=98.96%
nvme7n1: ios=2759669/0, merge=0/0, ticks=280708/0, in_queue=284056,
util=98.88%
nvme0n1: ios=2761263/0, merge=0/0, ticks=296868/0, in_queue=300408,
util=98.78%
nvme12n1: ios=2763077/0, merge=0/0, ticks=288264/0, in_queue=290264,
util=98.71%
nvme3n1: ios=2761377/0, merge=0/0, ticks=298040/0, in_queue=300960,
util=98.74%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
=================
Changing engine to psync, leaving everything else unchanged:
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio best_sync_individual_nvmes.fio
randread-individual-nvmes: (g=0): rw=randread,
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=psync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2771 (f=40464):
[f(8),_(1),f(14),_(1),f(30),_(1),f(6),_(1),f(4),_(1),f(7),_(1),f(14),_(1),f(6),_(1),f(62),_(1),f(3),_(1),f(167),_(1),f(309),_(1),f(269),_(1),f(47),_(1),f(206),_(1),f(26),_(1),f(56),_(2),f(4),_(1),f(39),_(1),f(148),_(1),f(148),_(1),f(4),_(1),f(63),_(1),f(27),_(1),f(19),_(1),f(314),_(1),f(189),_(1),f(205),_(1),f(377)][100.0%][r=25.7GiB/s,w=0KiB/s][r=6726k,w=0
IOPS][eta 00m:00s]
randread-individual-nvmes: (groupid=0, jobs=2800): err= 0: pid=14753:
Mon Jan 23 20:58:45 2017
read: IOPS=7086k, BW=27.4GiB/s (29.3GB/s)(811GiB/30010msec)
clat (usec): min=34, max=57916, avg=381.14, stdev=524.36
clat percentiles (usec):
| 1.00th=[ 121], 5.00th=[ 145], 10.00th=[ 159], 20.00th=[ 185],
| 30.00th=[ 207], 40.00th=[ 229], 50.00th=[ 255], 60.00th=[ 286],
| 70.00th=[ 326], 80.00th=[ 394], 90.00th=[ 564], 95.00th=[ 988],
| 99.00th=[ 2928], 99.50th=[ 3632], 99.90th=[ 5344], 99.95th=[ 6688],
| 99.99th=[11200]
lat (usec) : 50=0.01%, 100=0.08%, 250=48.03%, 500=39.59%, 750=5.69%
lat (usec) : 1000=1.66%
lat (msec) : 2=2.69%, 4=1.91%, 10=0.32%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=0.60%, sys=4.43%, ctx=214720330, majf=0, minf=2797
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=212658246,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=27.4GiB/s (29.3GB/s), 27.4GiB/s-27.4GiB/s
(29.3GB/s-29.3GB/s), io=811GiB (871GB), run=30010-30010msec
Disk stats (read/write):
nvme0n1: ios=13204662/0, merge=0/0, ticks=5579056/0,
in_queue=5713604, util=97.16%
nvme1n1: ios=13292212/0, merge=0/0, ticks=3336164/0,
in_queue=3661216, util=97.52%
nvme2n1: ios=13292063/0, merge=0/0, ticks=3097888/0,
in_queue=3359552, util=97.09%
nvme3n1: ios=13291900/0, merge=0/0, ticks=2973176/0,
in_queue=3072764, util=96.31%
nvme4n1: ios=13291734/0, merge=0/0, ticks=4962684/0,
in_queue=5434620, util=97.02%
nvme5n1: ios=13291540/0, merge=0/0, ticks=7857284/0,
in_queue=8108332, util=96.75%
nvme6n1: ios=13291403/0, merge=0/0, ticks=3160292/0,
in_queue=3249508, util=96.46%
nvme7n1: ios=13291270/0, merge=0/0, ticks=5593256/0,
in_queue=5748080, util=96.42%
nvme8n1: ios=13291057/0, merge=0/0, ticks=3345216/0,
in_queue=3450892, util=96.81%
nvme9n1: ios=13290897/0, merge=0/0, ticks=3102344/0,
in_queue=3394168, util=97.38%
nvme10n1: ios=13290753/0, merge=0/0, ticks=3050116/0,
in_queue=3129208, util=96.74%
nvme11n1: ios=13290570/0, merge=0/0, ticks=6353996/0,
in_queue=6956272, util=97.59%
nvme12n1: ios=13290405/0, merge=0/0, ticks=3268144/0,
in_queue=3372100, util=97.04%
nvme13n1: ios=13290255/0, merge=0/0, ticks=3037220/0,
in_queue=3297944, util=97.78%
nvme14n1: ios=13290110/0, merge=0/0, ticks=8279264/0,
in_queue=8503324, util=97.47%
nvme15n1: ios=13289722/0, merge=0/0, ticks=3361284/0,
in_queue=3467660, util=97.22%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
/opt/fio/bin/fio best_sync_md_over_nvmes.fio
randread-md-over-nvmes: (g=0): rw=randread,
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=psync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2367 (f=2342):
[_(1),r(2),_(1),r(38),_(10),r(1),_(1),r(2),_(2),r(2),_(11),r(1),_(1),r(5),_(1),E(1),r(2),f(2),E(1),r(1),f(3),r(19),f(1),r(87),_(1),r(234),_(1),r(13),_(1),r(29),f(1),_(1),r(17),E(1),r(9),E(1),r(9),E(1),r(3),E(1),r(6),_(1),r(16),E(1),r(2),_(1),r(8),E(1),r(30),_(1),r(15),E(1),r(11),f(1),r(27),f(1),r(11),E(1),r(13),_(1),r(27),E(1),r(31),E(1),r(32),E(1),r(6),_(1),r(26),E(1),r(18),E(1),r(5),_(1),E(1),r(16),f(1),r(1),f(1),r(3),f(3),r(3),f(2),r(1),f(3),r(1),f(1),r(1),f(1),r(1),f(4),r(3),f(5),r(1),f(12),E(1),r(2),f(3),r(2),f(1),_(1),f(8),r(1),f(9),r(1),f(1),r(1),f(2),r(1),f(4),r(1),f(7),r(2),f(5),r(1),f(2),r(1),f(2),r(1),f(2),_(1),f(1),r(1),f(2),r(1),f(2),r(1),f(5),r(1),f(1),r(2),f(1),r(4),f(1),r(1),f(5),r(1),f(1),r(2),f(1),r(1),E(1),r(1),f(3),r(2),f(5),r(1),f(1),r(2),f(1),r(1),f(1),r(2),_(1),f(9),E(1),f(3),_(2),f(11),_(1),f(3),_(1),f(4),_(2),f(1),_(1),f(7),_(1),f(3),_(2),f(7),_(1),f(4),_(1),f(4),_(1),f(5),_(1),f(3),_(1),f(12),_(1),f(12),_(1),f(4),_(1),f(2),_(1),f(7),_(1),f(1),_(1),f(15),_(2),f(1),_(1),f(2),_(1),f(10),_(1),f(2),_(1),f(12),_(1),f(10),_(1),f(5),_(1),f(6),_(2),f(6),_(1),f(2),_(1),f(13),_(1),f(6),_(1),f(21),_(1),f(2),_(1),f(1),_(2),f(1),_(1),f(26),_(1),f(1),_(1),f(1),E(1),f(6),_(1),f(3),_(1),f(2),_(1),f(2),_(1),f(3),_(1),f(10),_(1),f(8),_(1),f(11),_(1),f(7),_(1),f(2),_(1),f(4),_(1),f(5),_(1),f(4),_(1),f(8),_(1),f(6),_(1),f(5),_(1),f(9),_(2),f(3),_(1),f(1),_(1),f(13),_(1),f(3),_(1),f(2),_(1),f(1),_(1),f(5),_(1),f(14),_(1),f(4),_(1),f(5),_(1),f(12),_(1),f(1),_(2),f(1),_(1),f(3),_(1),f(2),_(3),f(2),_(1),f(3),_(1),f(5),_(1),f(7),_(3),f(19),_(1),f(4),_(1),f(6),_(1),f(9),_(1),f(9),_(2),f(2),_(2),f(22),_(1),f(69),_(1),f(17),_(1),f(26),_(1),f(1),_(1),f(5),_(1),f(3),_(1),f(9),_(1),f(19),_(1),f(11),_(2),f(7),_(1),f(21),_(1),f(3),_(1),f(6),_(1),f(10),_(1),f(2),_(1),f(26),_(1),f(7),_(1),f(1),_(2),f(2),_(1),f(8),_(1),f(20),_(1),f(15),_(2),f(2),_(1),f(11),_(1),f(8),_(1),f(14),_(1),f(10),_(1),f(6),_(1),f(2),_(1),f(25),_(1),f(2),_(1),f(1),_(1),f(4),_(1),f(42),_(1),f(5),_(2),f(14),_(2),f(2),_(2),f(7),_(1),f(2),_(1),f(2),_(2),f(12),_(1),f(15),_(1),f(2),_(1),f(1),_(1),f(2),_(1),f(4),_(1),f(6),_(1),f(8),_(4),f(2),_(3),f(4),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(18),_(2),f(1),_(1),f(1),_(2),f(11),_(1),f(20),_(1),f(7),_(1),f(4),_(1),f(6),_(1),f(4),_(1),f(11),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(2),_(1),f(2),_(1),f(4),_(2),f(3),_(1),f(4),_(1),E(1),_(1),f(1),_(1),f(1),_(1),E(1),_(3),f(2),_(5),f(1),_(1),E(1),f(1),_(1),f(2),_(1),f(5),_(2),f(2),_(1),E(1),f(2),_(1),f(3),E(1),f(1),_(2),f(10),_(1),f(1),_(4),f(1),_(1),f(2),_(2),f(3),_(1),f(2),_(3),f(1),_(3),f(1),_(2),f(2),E(1),f(2),_(1),f(1),_(3),f(1),_(1),f(2),E(1),f(9),_(1),f(1),E(1),f(1),_(1),f(1),_(1),f(1),E(1),f(1),E(1),_(1),f(3),E(1),f(1),_(2),f(1),_(1),E(1),f(1),_(2),f(3),_(1),f(1),_(1),f(3),_(1),f(2),_(2),f(2),_(1),f(2),_(3),f(2),_(2),f(8),_(1),f(1),_(2),f(1),_(1),f(3),_(2),f(1),_(1),f(1),_(1),f(1),_(1),f(1),_(1),f(1),_(1),f(3),_(1),f(5),_(2),f(6),_(2),f(1),_(1),f(9),_(1),f(3),_(1),f(7),_(1),f(1),_(2),f(1),_(1),f(2),_(1),f(5),_(2),f(4),_(1),f(1),_(2),f(3),_(3),f(12),_(1),f(2),_(3),f(3),_(1),f(3),_(1),f(1),_(2),f(3),_(1),f(2),_(1),f(3),_(1),f(3),_(2),f(1),_(1),f(2),_(2),f(9),E(1),f(1),E(1),f(5),_(1),E(1),f(7),_(1),f(1),_(1),f(4),_(2),f(2),_(1),f(3),_(3),f(14),_(1),f(10),_(1),f(1),_(1),f(1),_(1),E(1),f(2),E(1),f(1),_(1),f(1),_(3),f(6),_(1),f(4),E(1),f(4),_(4),f(3),_(1),f(1),_(3),f(1),_(1),f(1),E(1),f(2),_(1),f(2),_(1),f(2),_(1),f(1),E(1),_(1),E(1),f(1),_(2),f(1),_(1),f(2),_(1),f(2),_(9),f(1),_(3),f(3),_(1),f(1),_(1),f(3),_(2),f(3),_(2),f(2),_(1),f(2),_(1),f(1),_(2),f(1),_(2),f(2)][0.5%][r=15.2GiB/s,w=0KiB/s][r=3960k,w=0
IOPS][eta 01h:38m:47s]
randread-md-over-nvmes: (groupid=0, jobs=2800): err= 0: pid=17756: Mon
Jan 23 20:59:22 2017
read: IOPS=4154k, BW=15.9GiB/s (17.2GB/s)(476GiB/30015msec)
clat (usec): min=38, max=264790, avg=669.08, stdev=954.35
clat percentiles (usec):
| 1.00th=[ 149], 5.00th=[ 207], 10.00th=[ 262], 20.00th=[ 342],
| 30.00th=[ 410], 40.00th=[ 470], 50.00th=[ 532], 60.00th=[ 604],
| 70.00th=[ 684], 80.00th=[ 788], 90.00th=[ 956], 95.00th=[ 1160],
| 99.00th=[ 4512], 99.50th=[ 7392], 99.90th=[12480], 99.95th=[14400],
| 99.99th=[19072]
lat (usec) : 50=0.01%, 100=0.04%, 250=8.86%, 500=35.57%, 750=32.34%
lat (usec) : 1000=14.64%
lat (msec) : 2=6.53%, 4=0.91%, 10=0.89%, 20=0.22%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%
cpu : usr=0.46%, sys=5.81%, ctx=124737165, majf=0, minf=2797
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwt: total=124675330,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=15.9GiB/s (17.2GB/s), 15.9GiB/s-15.9GiB/s
(17.2GB/s-17.2GB/s), io=476GiB (511GB), run=30015-30015msec
Disk stats (read/write):
md1: ios=124675330/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=7792208/0, aggrmerge=0/0, aggrticks=1051705/0,
aggrin_queue=1120720, aggrutil=100.00%
nvme15n1: ios=7790429/0, merge=0/0, ticks=1048276/0,
in_queue=1090348, util=100.00%
nvme6n1: ios=7792474/0, merge=0/0, ticks=999284/0, in_queue=1035092,
util=100.00%
nvme9n1: ios=7792704/0, merge=0/0, ticks=1033208/0, in_queue=1151824,
util=100.00%
nvme11n1: ios=7792344/0, merge=0/0, ticks=1103896/0,
in_queue=1231748, util=100.00%
nvme2n1: ios=7791972/0, merge=0/0, ticks=1001928/0, in_queue=1121472,
util=100.00%
nvme14n1: ios=7795323/0, merge=0/0, ticks=1154676/0,
in_queue=1190940, util=100.00%
nvme5n1: ios=7784969/0, merge=0/0, ticks=1048052/0, in_queue=1081964,
util=100.00%
nvme8n1: ios=7792042/0, merge=0/0, ticks=1080976/0, in_queue=1112776,
util=100.00%
nvme10n1: ios=7786642/0, merge=0/0, ticks=1018484/0,
in_queue=1054712, util=100.00%
nvme1n1: ios=7793892/0, merge=0/0, ticks=1072588/0, in_queue=1194612,
util=100.00%
nvme13n1: ios=7792651/0, merge=0/0, ticks=1040368/0,
in_queue=1157356, util=100.00%
nvme4n1: ios=7794567/0, merge=0/0, ticks=1065096/0, in_queue=1198308,
util=100.00%
nvme7n1: ios=7794169/0, merge=0/0, ticks=1061900/0, in_queue=1104168,
util=100.00%
nvme0n1: ios=7794534/0, merge=0/0, ticks=1039064/0, in_queue=1071864,
util=100.00%
nvme12n1: ios=7796809/0, merge=0/0, ticks=1044664/0,
in_queue=1081852, util=100.00%
nvme3n1: ios=7789809/0, merge=0/0, ticks=1014828/0, in_queue=1052484,
util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
[not found] ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>
@ 2017-01-23 20:20 ` Tobias Oberstein
0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 20:20 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: fio
> Are the CPU load numbers reported by FIO reliable?
>
>
> Yes, they're quite solid, just keep in mind that cpu is being reported on a
> thread basis.
Ahhh =)
That explains that
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_21-15-59-MEHOP3ZW.1485202585.png
Which is engine=psync on MD
and
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_21-19-56-9ieRvRZy.1485202817.png
which is engine=libaio on MD
--
Ha. And I thought for a second the machine is now going into "full magic
mode" ;)
Thanks,
Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:40 ` Tobias Oberstein
@ 2017-01-23 20:24 ` Sitsofe Wheeler
2017-01-23 21:22 ` Tobias Oberstein
0 siblings, 1 reply; 27+ messages in thread
From: Sitsofe Wheeler @ 2017-01-23 20:24 UTC (permalink / raw)
To: Tobias Oberstein; +Cc: Andrey Kuzmin, fio
On 23 January 2017 at 19:40, Tobias Oberstein
<tobias.oberstein@gmail.com> wrote:
> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
>>
>> On 23 January 2017 at 18:33, Tobias Oberstein
>> <tobias.oberstein@gmail.com> wrote:
>>>
>>> libaio is nowhere near what I get with engine=sync and high job counts.
>>> Mmh.
>>> Plus the strange behavior.
>>
>> Have you tried batching the IOs and controlling how much are you
>> reaping at any one time? See
>>
>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
>> for some of the options for controlling this...
>
> Thanks! Nice.
>
> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
> randread)
>
> Individual NVMes: iops=7350.4K
> MD (RAID-0) over NVMes: iops=4112.8K
>
> The going up and down of IOPS is gone.
>
> It's becoming more apparent I'd say, that tthere is a MD bottleneck though.
If you're "just" trying for higher IOPS you can also try gtod_reduce
(see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce
). This subsumes things like disable_lat but you'll get fewer and less
accurate measurement stats back. With libaio userspace reap
(http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap
) can sometimes nudge numbers up but at the cost of CPU.
--
Sitsofe | http://sucs.org/~sits/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 20:24 ` Sitsofe Wheeler
@ 2017-01-23 21:22 ` Tobias Oberstein
[not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 21:22 UTC (permalink / raw)
To: Sitsofe Wheeler; +Cc: Andrey Kuzmin, fio
Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler:
> On 23 January 2017 at 19:40, Tobias Oberstein
> <tobias.oberstein@gmail.com> wrote:
>> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
>>>
>>> On 23 January 2017 at 18:33, Tobias Oberstein
>>> <tobias.oberstein@gmail.com> wrote:
>>>>
>>>> libaio is nowhere near what I get with engine=sync and high job counts.
>>>> Mmh.
>>>> Plus the strange behavior.
>>>
>>> Have you tried batching the IOs and controlling how much are you
>>> reaping at any one time? See
>>>
>>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
>>> for some of the options for controlling this...
>>
>> Thanks! Nice.
>>
>> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
>> randread)
>>
>> Individual NVMes: iops=7350.4K
>> MD (RAID-0) over NVMes: iops=4112.8K
>>
>> The going up and down of IOPS is gone.
>>
>> It's becoming more apparent I'd say, that tthere is a MD bottleneck though.
>
> If you're "just" trying for higher IOPS you can also try gtod_reduce
> (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce
> ). This subsumes things like disable_lat but you'll get fewer and less
> accurate measurement stats back. With libaio userspace reap
> (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap
> ) can sometimes nudge numbers up but at the cost of CPU.
>
Using that option plus bumping to QD=64 and batch submit 16, I get
plain NVMes: iops=7415.9K
MD over NVMes: iops=4112.4K
These are staggering numbers for sure!
In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB
Since we have 8 (physical) of these, the real world measurement (7.4
mio) is even above the datasheet (6.8 mio).
I'd say: very good job Intel =)
The price of course is the CPU load to reach these numbers .. we have
the 2nd largest Intel Xeon available
Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
and 4 of these .. and even that isn't enough to saturate these NVMe
beasts while still having room to do useful work (PostgreSQL).
So we're gonna be CPU bound .. again - this is the 2nd iteration of such
a box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU
bound on PostgreSQL anyway .. with 3TB RAM.
Cheers,
/Tobias
randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon
Jan 23 22:12:30 2017
read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec
cpu : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320
randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon
Jan 23 22:12:30 2017
read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec
cpu : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784
[global]
group_reporting
size=30G
ioengine=libaio
iodepth=64
iodepth_batch_submit=16
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
gtod_reduce=1
bs=4k
runtime=30
[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128
[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
[not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
@ 2017-01-23 21:42 ` Andrey Kuzmin
2017-01-23 23:51 ` Tobias Oberstein
0 siblings, 1 reply; 27+ messages in thread
From: Andrey Kuzmin @ 2017-01-23 21:42 UTC (permalink / raw)
To: Tobias Oberstein; +Cc: Jens Axboe, fio
[-- Attachment #1: Type: text/plain, Size: 4078 bytes --]
On Jan 24, 2017 00:22, "Tobias Oberstein" <tobias.oberstein@gmail.com>
wrote:
Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler:
> On 23 January 2017 at 19:40, Tobias Oberstein
> <tobias.oberstein@gmail.com> wrote:
>
>> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
>>
>>>
>>> On 23 January 2017 at 18:33, Tobias Oberstein
>>> <tobias.oberstein@gmail.com> wrote:
>>>
>>>>
>>>> libaio is nowhere near what I get with engine=sync and high job counts.
>>>> Mmh.
>>>> Plus the strange behavior.
>>>>
>>>
>>> Have you tried batching the IOs and controlling how much are you
>>> reaping at any one time? See
>>>
>>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-a
>>> rg-iodepth_batch_submit
>>> for some of the options for controlling this...
>>>
>>
>> Thanks! Nice.
>>
>> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
>> randread)
>>
>> Individual NVMes: iops=7350.4K
>> MD (RAID-0) over NVMes: iops=4112.8K
>>
>> The going up and down of IOPS is gone.
>>
>> It's becoming more apparent I'd say, that tthere is a MD bottleneck
>> though.
>>
>
> If you're "just" trying for higher IOPS you can also try gtod_reduce
> (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-a
> rg-gtod_reduce
> ). This subsumes things like disable_lat but you'll get fewer and less
> accurate measurement stats back. With libaio userspace reap
> (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-
> arg-userspace_reap
> ) can sometimes nudge numbers up but at the cost of CPU.
>
>
Using that option plus bumping to QD=64 and batch submit 16, I get
plain NVMes: iops=7415.9K
MD over NVMes: iops=4112.4K
These are staggering numbers for sure!
In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB
Since we have 8 (physical) of these, the real world measurement (7.4 mio)
is even above the datasheet (6.8 mio).
I'd say: very good job Intel =)
The price of course is the CPU load to reach these numbers .. we have the
2nd largest Intel Xeon available
Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
and 4 of these .. and even that isn't enough to saturate these NVMe beasts
while still having room to do useful work (PostgreSQL).
The root cause behind the high cpu utilization is the IRQ load your eight
NVMe drives generate, although context switching your 2048 threads also add
a lot.
To cope with the unsustainable interrupt rate, you might want to give a
shot to the psync engine with RWF_HIPRI option set, which turns on polling
mode in the block layer (Jens has been very much behind it, so he's the guy
in the know of the details).
Polling avoids interrupts at the price of the somewhat inflated latency,
but reduces the cpu load noticeably, so it may turn out a good option for
your box specifically. Notice you'll need preadv2/pwrirev2 syscall support
in your kernel.
Regards,
Andrey
So we're gonna be CPU bound .. again - this is the 2nd iteration of such a
box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU bound on
PostgreSQL anyway .. with 3TB RAM.
Cheers,
/Tobias
randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon
Jan 23 22:12:30 2017
read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec
cpu : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320
randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon Jan
23 22:12:30 2017
read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec
cpu : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784
[global]
group_reporting
size=30G
ioengine=libaio
iodepth=64
iodepth_batch_submit=16
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
gtod_reduce=1
bs=4k
runtime=30
[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1
:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nv
me8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1
:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128
[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128
[-- Attachment #2: Type: text/html, Size: 6568 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
[not found] ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>
@ 2017-01-23 21:49 ` Tobias Oberstein
0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 21:49 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: fio
Hi Andrey,
> Thanks again for your tips .. the psync thingy in particular. I need to
> verify if that applies to PostgreSQL, because it brings huge gains compared
> to sync!
>
>
> That's easy to explain, it just does one syscall less per IO. It should
> indeed bring home a measurable gain as, with synchronous I/O, I believe
> you're cpu-limited.
Sadly, it seems PostgreSQL currently does lseek/read/write. (I'll double
check tomorrow running perf against an active PostgreSQL instance).
There was a patch discussed here using pread/pwrite when avail
https://www.postgresql.org/message-id/flat/CABUevEzZ%3DCGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q%40mail.gmail.com#CABUevEzZ=CGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q@mail.gmail.com
which ends with a comment by Tom Lane (PostgreSQL core developer)
"Well, my point remains that I see little value in messing with
long-established code if you can't demonstrate a benefit that's clearly
above the noise level."
=(
I will post the findings from our discussion here to the PG hackers
list. Maybe ...
Cheers,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 21:42 ` Andrey Kuzmin
@ 2017-01-23 23:51 ` Tobias Oberstein
2017-01-24 8:21 ` Andrey Kuzmin
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 23:51 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: Jens Axboe, fio
> The root cause behind the high cpu utilization is the IRQ load your eight
> NVMe drives generate, although context switching your 2048 threads also add
> a lot.
Indeed, the ctx switches and interrupts are in the millions/sec.
With engine=sync and numjobs=2048, I have
ctx_sw: 8828446
inter: 5780374
It's astonishing that this is even possible.
> To cope with the unsustainable interrupt rate, you might want to give a
> shot to the psync engine with RWF_HIPRI option set, which turns on polling
> mode in the block layer (Jens has been very much behind it, so he's the guy
> in the know of the details).
>
> Polling avoids interrupts at the price of the somewhat inflated latency,
> but reduces the cpu load noticeably, so it may turn out a good option for
> your box specifically. Notice you'll need preadv2/pwrirev2 syscall support
> in your kernel.
I have run an exhaustive number of 30 tests using the different engines,
including pvsync2 + hipri.
Please find everything here
https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines/README.md
and in the containing folder there.
Using pvsync2 + hipri indeed changes the picture .. but not to the better =(
The machine completely bogs down and the IOPS doesn't get higher.
Sidenote: would nice if FIO logged the total CPU and interrupt rates ..
Here is a screenshot while running pvsync2+hipri
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_23-52-10-55NJYHu2.1485215076.png
--
My current preliminary conclusions on this box / workload:
- running psync is much better than sync
- all engines "above" psync only bring minor perf. gains
- Linux MD (pure striping, RAID-0) comes with rougly 45% overhead
- saturing the storage subsystem consumes nearly all CPU
Cheers,
/Tobias
PS: I have a small time window left (days) until this box goes into
further setup for production (which means, I cannot scratch the storage
anymore) - if you have anything you want me to try, let me know. I do my
best to get it tested. The hardware is probably not mainstream ..
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 23:51 ` Tobias Oberstein
@ 2017-01-24 8:21 ` Andrey Kuzmin
2017-01-24 9:28 ` Tobias Oberstein
0 siblings, 1 reply; 27+ messages in thread
From: Andrey Kuzmin @ 2017-01-24 8:21 UTC (permalink / raw)
To: Tobias Oberstein; +Cc: fio, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 2341 bytes --]
On Jan 24, 2017 02:51, "Tobias Oberstein" <tobias.oberstein@gmail.com>
wrote:
The root cause behind the high cpu utilization is the IRQ load your eight
> NVMe drives generate, although context switching your 2048 threads also add
> a lot.
>
Indeed, the ctx switches and interrupts are in the millions/sec.
With engine=sync and numjobs=2048, I have
ctx_sw: 8828446
inter: 5780374
It's astonishing that this is even possible.
To cope with the unsustainable interrupt rate, you might want to give a
> shot to the psync engine with RWF_HIPRI option set, which turns on polling
> mode in the block layer (Jens has been very much behind it, so he's the guy
> in the know of the details).
>
> Polling avoids interrupts at the price of the somewhat inflated latency,
> but reduces the cpu load noticeably, so it may turn out a good option for
> your box specifically. Notice you'll need preadv2/pwrirev2 syscall support
> in your kernel.
>
I have run an exhaustive number of 30 tests using the different engines,
including pvsync2 + hipri.
Please find everything here
https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines/README.md
and in the containing folder there.
Using pvsync2 + hipri indeed changes the picture .. but not to the better =(
Surprising it didn't work for you since polling is very well suited for
your specific scenario.
The machine completely bogs down and the IOPS doesn't get higher.
Sidenote: would nice if FIO logged the total CPU and interrupt rates ..
Here is a screenshot while running pvsync2+hipri
http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_23-
52-10-55NJYHu2.1485215076.png
--
My current preliminary conclusions on this box / workload:
- running psync is much better than sync
So you likely have a convincing case for Postgres guys to switch over to
pread/pwrite.
Regards,
Andrey
- all engines "above" psync only bring minor perf. gains
- Linux MD (pure striping, RAID-0) comes with rougly 45% overhead
- saturing the storage subsystem consumes nearly all CPU
Cheers,
/Tobias
PS: I have a small time window left (days) until this box goes into further
setup for production (which means, I cannot scratch the storage anymore) -
if you have anything you want me to try, let me know. I do my best to get
it tested. The hardware is probably not mainstream ..
[-- Attachment #2: Type: text/html, Size: 4133 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-24 8:21 ` Andrey Kuzmin
@ 2017-01-24 9:28 ` Tobias Oberstein
2017-01-24 9:40 ` Andrey Kuzmin
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 9:28 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: fio, Jens Axboe
> My current preliminary conclusions on this box / workload:
>
> - running psync is much better than sync
>
> So you likely have a convincing case for Postgres guys to switch over to
> pread/pwrite.
I will approach them, but I want to make sure I did all my homework first.
One question that bugs me:
the difference in performance between sync and psync engines only
surface with MD, _not_ when running over individual devices.
---
I ran Linux perf with these results:
https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/individual-nvmes-sync.md
https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/individual-nvmes-psync.md
https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/md-nvmes-sync.md
https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/md-nvmes-psync.md
---
md-nvmes-sync shows the "issue":
Overhead Command Shared Object Symbol
73.48% fio [kernel.kallsyms] [k] osq_lock
So while I think it would be good in general if PostgreSQL used
pread/pwrite instead of lseek/read/write when available, I am afraid
there might be a bottleneck in MD.
What do you think?
And if so, where should I raise this rgd MD? I have no clue where the
hackers of MD hang out ..
Cheers,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-24 9:28 ` Tobias Oberstein
@ 2017-01-24 9:40 ` Andrey Kuzmin
2017-01-24 22:51 ` Tobias Oberstein
0 siblings, 1 reply; 27+ messages in thread
From: Andrey Kuzmin @ 2017-01-24 9:40 UTC (permalink / raw)
To: Tobias Oberstein; +Cc: fio, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]
On Jan 24, 2017 12:28, "Tobias Oberstein" <tobias.oberstein@gmail.com>
wrote:
My current preliminary conclusions on this box / workload:
>
> - running psync is much better than sync
>
> So you likely have a convincing case for Postgres guys to switch over to
> pread/pwrite.
>
I will approach them, but I want to make sure I did all my homework first.
One question that bugs me:
the difference in performance between sync and psync engines only surface
with MD, _not_ when running over individual devices.
My guess is, with individual devices there's no cpu headroom for press
savings to show up. Once MD bottleneck gets in, you're not bound by cpu
anymore and the difference between doing a single syscall vs. two shows up.
---
I ran Linux perf with these results:
https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/individual-nvmes-sync.md
https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/individual-nvmes-psync.md
https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/md-nvmes-sync.md
https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/md-nvmes-psync.md
---
md-nvmes-sync shows the "issue":
Overhead Command Shared Object Symbol
73.48% fio [kernel.kallsyms] [k] osq_lock
So while I think it would be good in general if PostgreSQL used
pread/pwrite instead of lseek/read/write when available, I am afraid there
might be a bottleneck in MD.
What do you think?
And if so, where should I raise this rgd MD? I have no clue where the
hackers of MD hang out ..
Yup, I believe it makes sense to post to the md mail list.
Regards,
Andrey
Cheers,
/Tobias
[-- Attachment #2: Type: text/html, Size: 3616 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:06 ` Kudryavtsev, Andrey O
@ 2017-01-24 9:46 ` Tobias Oberstein
2017-01-24 9:55 ` Tobias Oberstein
` (2 subsequent siblings)
3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 9:46 UTC (permalink / raw)
To: Kudryavtsev, Andrey O, fio
Hi Andrey,
Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
>
> Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
> Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
>
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
>
>
I have gone through the whole manual, but I cannot find info about the
meaning of different LBAFormats.
The Oracle article above uses
LBAFormat=3
which I presume means 4k secor size.
The P3608 seams to support a value up to 6:
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all
-intelssd 0 | grep LBA
LBAFormat : 0
MaximumLBA : 3907029167
NativeMaxLBA : 3907029167
NumLBAFormats : 6
So is this the correct mapping for the value?
LBAFormat Sector Size
0 512
1 1024
2 2048
3 4096
4 8192
5 16384
6 32768
In this case, I'd use
LBAFormat=4
to get 8k sectors, sine my workload is purely 8k.
Cheers,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:06 ` Kudryavtsev, Andrey O
2017-01-24 9:46 ` Tobias Oberstein
@ 2017-01-24 9:55 ` Tobias Oberstein
2017-01-24 10:03 ` Tobias Oberstein
2017-01-24 15:19 ` Tobias Oberstein
3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 9:55 UTC (permalink / raw)
To: Kudryavtsev, Andrey O, fio
Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
>
> Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
> Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
>
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
>
>
It doesn't work =(
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct start
-nvmeformat -intelssd 0 \
> LBAFormat=4 \
> SecureEraseSetting=0 \
> ProtectionInformation=0 \
> MetaDataSettings=0
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...
- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -
Status : NVMe command reported a problem.
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all
-intelssd 0
- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -
AggregationThreshold : 0
AggregationTime : 0
ArbitrationBurst : 0
Bootloader : 8B1B0133
CoalescingDisable : 1
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
EndToEndDataProtCapabilities : 17
EnduranceAnalyzer : Media Workload Indicators have reset values. Run 60+
minute workload prior to running the endurance analyzer.
ErrorString :
Firmware : 8DV101F0
FirmwareUpdateAvailable : The selected Intel SSD contains current
firmware as of this tool release.
HighPriorityWeightArbitration : 0
IOCompletionQueuesRequested : 30
IOSubmissionQueuesRequested : 30
Index : 0
Intel : True
IntelGen3SATA : False
IntelNVMe : True
InterruptVector : 0
LBAFormat : 0
LatencyTrackingEnabled : False
LowPriorityWeightArbitration : 0
MaximumLBA : 3907029167
MediumPriorityWeightArbitration : 0
MetadataSetting : 0
ModelNumber : INTEL SSDPECME040T4
NVMeControllerID : 0
NVMeMajorVersion : 1
NVMeMinorVersion : 0
NVMePowerState : 0
NVMeTertiaryVersion : 0
NamespaceId : 1
NativeMaxLBA : 3907029167
NumErrorLogPageEntries : 63
NumLBAFormats : 6
OEM : Generic
PCILinkGenSpeed : 3
PCILinkWidth : 4
PowerGovernorMode : 0 40W for 8 Lane Slot power
Product : Fultondale X8
ProductFamily : Intel SSD DC P3608 Series
ProductProtocol : NVME
ProtectionInformation : 0
ProtectionInformationLocation : 0
SMARTEnabled : True
SMARTHealthCriticalWarningsConfiguration : 0
SMBusAddress : 106
SectorSize : 512
SerialNumber : CVF8551400324P0DGN-1
TCGSupported : False
TempThreshold : 85
TimeLimitedErrorRecovery : 0
TrimSupported : True
VolatileWriteCacheEnabled : False
WriteAtomicityDisableNormal : 0
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ isdct --version
Syntax Error: Invalid command. Error at or around '--version'.
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ isdct version
- Version Information -
Name: Intel(R) Data Center Tool
Version: 3.0.2
Description: Interact and configure Intel SSDs.
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:06 ` Kudryavtsev, Andrey O
2017-01-24 9:46 ` Tobias Oberstein
2017-01-24 9:55 ` Tobias Oberstein
@ 2017-01-24 10:03 ` Tobias Oberstein
2017-01-24 15:19 ` Tobias Oberstein
3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 10:03 UTC (permalink / raw)
To: Kudryavtsev, Andrey O, fio
Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
>
> Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
> Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
>
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
>
>
It doesn't work with LBAFormat=3 either:
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct start
-nvmeformat -intelssd 0 \
> LBAFormat=3 \
> SecureEraseSetting=0 \
> ProtectionInformation=0 \
> MetaDataSettings=0
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...
- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -
Status : Interrupted system call
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all
-intelssd 0 | grep LBA
LBAFormat : 0
MaximumLBA : 3907029167
NativeMaxLBA : 3907029167
NumLBAFormats : 6
-----
And using exactly the same parameters as the article above:
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ time sudo isdct start
-nvmeformat -intelssd 0 \
> LBAFormat=3 \
> SecureEraseSetting=2 \
> ProtectionInformation=0 \
> MetaDataSettings=0
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...
- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -
Status : Interrupted system call
real 0m26.901s
user 0m0.048s
sys 0m0.032s
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$
----
I see the following in kernel log:
[417528.128501] nvme nvme0: I/O 0 QID 0 timeout, reset controller
[417786.440977] nvme nvme0: I/O 0 QID 0 timeout, reset controller
What should I do?
Thanks alot,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-23 19:06 ` Kudryavtsev, Andrey O
` (2 preceding siblings ...)
2017-01-24 10:03 ` Tobias Oberstein
@ 2017-01-24 15:19 ` Tobias Oberstein
3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 15:19 UTC (permalink / raw)
To: Kudryavtsev, Andrey O, fio
Hi Andrey,
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
After overcoming my issues with isdct, and reformatting the NVMes to 4k
sector size, success!
9.5 mio IOPS =)
This is another 34% faster than before.
So: thanks a bunch for your tip!
Cheers,
/Tobias
Next steps:
- approach MD developers about bottlenecks there
- approach PostgreSQL about using pread/pwrite (instead of lseek/read/write)
randread-individual-nvmes: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
ioengine=libaio, iodepth=128
...
fio-2.1.11
Starting 128 threads
Jobs: 128 (f=2048): [r(128)] [100.0% done] [37244MB/0KB/0KB /s]
[9534K/0/0 iops] [eta 00m:00s]
randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=25406: Tue
Jan 24 15:57:19 2017
read : io=1083.9GB, bw=36964MB/s, iops=9462.8K, runt= 30026msec
cpu : usr=9.00%, sys=77.01%, ctx=49252920, majf=0, minf=16512
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-24 9:40 ` Andrey Kuzmin
@ 2017-01-24 22:51 ` Tobias Oberstein
2017-01-25 16:23 ` Elliott, Robert (Persistent Memory)
0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 22:51 UTC (permalink / raw)
To: Andrey Kuzmin; +Cc: fio, Jens Axboe
> My current preliminary conclusions on this box / workload:
>>
>> - running psync is much better than sync
>>
>> So you likely have a convincing case for Postgres guys to switch over to
>> pread/pwrite.
I did raise it on the PG hackers mailing list, but I couldn't convince
them =(
Pity, since there even was a patch in the past (the change seems to be
easy, but was rejected).
They say, I would need to come up with a real world PostgreSQL database
workload that shows this effect is above the noise level.
And since PostgreSQL is such a CPU hog anyway, and since I don't have
time for a full research project, I leave it.
---
But, I did more FIO level benchmarking to compare the efficiency of
different IO methods:
Here are more numbers that quantify the differences of the IO method used.
ioengine sync psync vsync pvsync pvsync2 pvsync2+hipri
iodepth 1 1 1 1 1 1
numjobs 1024 1024 1024 1024 1024 1024
concurrency 1024 1024 1024 1024 1024 1024
iops (k) 9171 9390 9196 9473 9527 9516
user 7,7 9,3 8,6 9,0 9,3 2,6
system 86,8 77,0 85,8 76,3 77,3 97,4
total 94,5 86,3 94,4 85,3 86,6 100,0
iops/system 105,7 121,9 107,2 124,2 123,2 97,7
As can be seen, the kIOPS normalized to system CPU load (last line) for
psync (pread/pwrite) is significantly higher than for sync
(lseek/read/write).
Now here is AIO:
ioengine libaio libaio libaio
iodepth 32 32 32
numjobs 128 64 32
concurrency 4096 2048 1024
iops (k) 9485,6 9479,4 8718,1
user 6,7 3,4 2,4
system 59,2 30,0 16,7
total 65,9 33,4 19,1
iops/system 160,2 316,0 522,0
The highest kIOPS/system is reached at a concurrency of 1024.
However, during my tests, I get this in kernel log:
[459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for 22s!
[swapper/46:0]
[461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for 22s!
[swapper/26:0]
[461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s!
[swapper/23:0]
I wild guess: these lockups are actually deadlocks. AIO seems to be
tricky for the kernel too.
Cheers,
/Tobias
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-24 22:51 ` Tobias Oberstein
@ 2017-01-25 16:23 ` Elliott, Robert (Persistent Memory)
2017-01-26 17:52 ` Tobias Oberstein
0 siblings, 1 reply; 27+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-01-25 16:23 UTC (permalink / raw)
To: Tobias Oberstein, Andrey Kuzmin; +Cc: fio, Jens Axboe
> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
> Behalf Of Tobias Oberstein
> Sent: Tuesday, January 24, 2017 4:52 PM
> To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>
> Cc: fio@vger.kernel.org; Jens Axboe <axboe@kernel.dk>
> Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
>
> However, during my tests, I get this in kernel log:
>
> [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for
> 22s!
> [swapper/46:0]
> [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for
> 22s!
> [swapper/26:0]
> [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for
> 22s!
> [swapper/23:0]
>
> I wild guess: these lockups are actually deadlocks. AIO seems to be
> tricky for the kernel too.
>
Probably not deadlocks. One easy to way trigger those is to submit
IOs on one set of CPUs and expect a different set of CPUs to handle
the interrupts and completions. The latter CPUs can easily become
overwhelmed. The best remedy I've found is to require CPUs to handle
their own IOs, which self-throttles them from submitting more IOs
than they can handle.
The storage device driver needs to set up its hardware interrupts
that way. Then, rq_affinity=2 ensures the block layer completions
are handled on the submitting CPU.
You can add this to the kernel command line (e.g., in
/boot/grub/grub.conf) to squelch those checks:
nosoftlockup
Those prints themselves can induce more soft lockups if you have a
live serial port, because printing to the serial port is slow
and blocking.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
2017-01-25 16:23 ` Elliott, Robert (Persistent Memory)
@ 2017-01-26 17:52 ` Tobias Oberstein
0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-26 17:52 UTC (permalink / raw)
To: Elliott, Robert (Persistent Memory); +Cc: fio, Jens.Wilke@parcIT.de
Hi Robert,
Am 25.01.2017 um 17:23 schrieb Elliott, Robert (Persistent Memory):
>
>
>> -----Original Message-----
>> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
>> Behalf Of Tobias Oberstein
>> Sent: Tuesday, January 24, 2017 4:52 PM
>> To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>
>> Cc: fio@vger.kernel.org; Jens Axboe <axboe@kernel.dk>
>> Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
>>
>> However, during my tests, I get this in kernel log:
>>
>> [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for
>> 22s!
>> [swapper/46:0]
>> [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for
>> 22s!
>> [swapper/26:0]
>> [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for
>> 22s!
>> [swapper/23:0]
>>
>> I wild guess: these lockups are actually deadlocks. AIO seems to be
>> tricky for the kernel too.
>>
>
> Probably not deadlocks. One easy to way trigger those is to submit
> IOs on one set of CPUs and expect a different set of CPUs to handle
> the interrupts and completions. The latter CPUs can easily become
> overwhelmed. The best remedy I've found is to require CPUs to handle
> their own IOs, which self-throttles them from submitting more IOs
> than they can handle.
>
> The storage device driver needs to set up its hardware interrupts
> that way. Then, rq_affinity=2 ensures the block layer completions
> are handled on the submitting CPU.
>
> You can add this to the kernel command line (e.g., in
> /boot/grub/grub.conf) to squelch those checks:
> nosoftlockup
>
> Those prints themselves can induce more soft lockups if you have a
> live serial port, because printing to the serial port is slow
> and blocking.
>
Thanks alot for your tips!
Indeed, we currently have rq_affinity=1.
Are there any risks involved?
I mean, this is a complex box .. pls see below.
Also: sadly, not each of the NUMA sockets has exactly 2 NVMes (due to
mainboard / slot limitations). So wouldn't enforcing IO affinity be a
problem with this?
Cheers,
/Tobias
PS: The mainboard is
https://www.supermicro.nl/products/motherboard/Xeon/C600/X10QBI.cfm
Yeah, I know, no offense - this particular piece isn't HPE;)
The current settings / hardware:
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/rq_affinity
1
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/scheduler
none
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/optimal_io_size
0
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/iostats
1
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
128
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/hw_sector_size
4096
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/physical_block_size
4096
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/nomerges
0
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/io_poll
1
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/minimum_io_size
4096
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/write_cache
write through
oberstet@svr-psql19:~$ cat /proc/cpuinfo | grep "Intel(R) Xeon(R) CPU
E7-8880 v4 @ 2.20GHz" | wc -l
176
oberstet@svr-psql19:~$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 88
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
node 0 size: 773944 MB
node 0 free: 770949 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
42 43 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
126 127 128 129 130 131
node 1 size: 774137 MB
node 1 free: 762335 MB
node 2 cpus: 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 65 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
148 149 150 151 152 153
node 2 size: 774126 MB
node 2 free: 763220 MB
node 3 cpus: 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
86 87 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
170 171 172 173 174 175
node 3 size: 774136 MB
node 3 free: 770518 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10
oberstet@svr-psql19:~$ find /sys/devices | egrep 'nvme[0-9][0-9]?$'
/sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/0000:08:02.0/0000:0a:00.0/nvme/nvme3
/sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/0000:08:01.0/0000:09:00.0/nvme/nvme2
/sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0/0000:04:01.0/0000:05:00.0/nvme/nvme0
/sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0/0000:04:02.0/0000:06:00.0/nvme/nvme1
/sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/0000:84:02.0/0000:86:00.0/nvme/nvme9
/sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/0000:84:01.0/0000:85:00.0/nvme/nvme8
/sys/devices/pci0000:40/0000:40:03.2/0000:46:00.0/0000:47:01.0/0000:48:00.0/nvme/nvme6
/sys/devices/pci0000:40/0000:40:03.2/0000:46:00.0/0000:47:02.0/0000:49:00.0/nvme/nvme7
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.0/0000:42:02.0/0000:44:00.0/nvme/nvme5
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.0/0000:42:01.0/0000:43:00.0/nvme/nvme4
/sys/devices/pci0000:c0/0000:c0:02.2/0000:c5:00.0/0000:c6:02.0/0000:c8:00.0/nvme/nvme13
/sys/devices/pci0000:c0/0000:c0:02.2/0000:c5:00.0/0000:c6:01.0/0000:c7:00.0/nvme/nvme12
/sys/devices/pci0000:c0/0000:c0:02.0/0000:c1:00.0/0000:c2:01.0/0000:c3:00.0/nvme/nvme10
/sys/devices/pci0000:c0/0000:c0:02.0/0000:c1:00.0/0000:c2:02.0/0000:c4:00.0/nvme/nvme11
/sys/devices/pci0000:c0/0000:c0:03.0/0000:c9:00.0/0000:ca:02.0/0000:cc:00.0/nvme/nvme15
/sys/devices/pci0000:c0/0000:c0:03.0/0000:c9:00.0/0000:ca:01.0/0000:cb:00.0/nvme/nvme14
oberstet@svr-psql19:~$ egrep -H '.*' /sys/bus/pci/slots/*/address
/sys/bus/pci/slots/0/address:0000:01:00
/sys/bus/pci/slots/10/address:0000:c5:00
/sys/bus/pci/slots/11/address:0000:c9:00
/sys/bus/pci/slots/1/address:0000:03:00
/sys/bus/pci/slots/2/address:0000:07:00
/sys/bus/pci/slots/3/address:0000:46:00
/sys/bus/pci/slots/4/address:0000:41:00
/sys/bus/pci/slots/5/address:0000:45:00
/sys/bus/pci/slots/6/address:0000:81:00
/sys/bus/pci/slots/7/address:0000:82:00
/sys/bus/pci/slots/8/address:0000:c1:00
/sys/bus/pci/slots/9/address:0000:83:00
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2017-01-26 17:52 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein
[not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
2017-01-23 17:52 ` Tobias Oberstein
[not found] ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
2017-01-23 18:33 ` Tobias Oberstein
2017-01-23 19:10 ` Kudryavtsev, Andrey O
2017-01-23 19:26 ` Tobias Oberstein
2017-01-23 19:13 ` Sitsofe Wheeler
2017-01-23 19:40 ` Tobias Oberstein
2017-01-23 20:24 ` Sitsofe Wheeler
2017-01-23 21:22 ` Tobias Oberstein
[not found] ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
2017-01-23 21:42 ` Andrey Kuzmin
2017-01-23 23:51 ` Tobias Oberstein
2017-01-24 8:21 ` Andrey Kuzmin
2017-01-24 9:28 ` Tobias Oberstein
2017-01-24 9:40 ` Andrey Kuzmin
2017-01-24 22:51 ` Tobias Oberstein
2017-01-25 16:23 ` Elliott, Robert (Persistent Memory)
2017-01-26 17:52 ` Tobias Oberstein
[not found] ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
2017-01-23 20:10 ` Tobias Oberstein
[not found] ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>
[not found] ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>
2017-01-23 20:20 ` Tobias Oberstein
[not found] ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>
[not found] ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>
2017-01-23 21:49 ` Tobias Oberstein
2017-01-23 18:18 ` Kudryavtsev, Andrey O
2017-01-23 18:53 ` Tobias Oberstein
2017-01-23 19:06 ` Kudryavtsev, Andrey O
2017-01-24 9:46 ` Tobias Oberstein
2017-01-24 9:55 ` Tobias Oberstein
2017-01-24 10:03 ` Tobias Oberstein
2017-01-24 15:19 ` Tobias Oberstein
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.