All of lore.kernel.org
 help / color / mirror / Atom feed
* 4x lower IOPS: Linux MD vs indiv. devices - why?
@ 2017-01-23 16:26 Tobias Oberstein
       [not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
  2017-01-23 18:18 ` Kudryavtsev, Andrey O
  0 siblings, 2 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 16:26 UTC (permalink / raw)
  To: fio

Hi,

I have a question rgd Linux software RAID (MD) as tested with FIO - so 
this is slightly OT, but I am hoping for expert advice or redirection to 
a more appropriate place (if this is unwelcome here).

I have a box with this HW:

- 88 cores Xeon E7 (176 HTs) + 3TB RAM
- 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)

With random 4kB read load, I am able to max it out at 7 million IOPS - 
but only if I run FIO on the _individual_ NVMe devices.

[global]
group_reporting
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
bs=4k
runtime=120

[randread]
stonewall
rw=randread
numjobs=2560

When I create a stripe set over all devices:

sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
    /dev/nvme0n1 \
    /dev/nvme1n1 \
    /dev/nvme2n1 \
    /dev/nvme3n1 \
    /dev/nvme4n1 \
    /dev/nvme5n1 \
    /dev/nvme6n1 \
    /dev/nvme7n1 \
    /dev/nvme8n1 \
    /dev/nvme9n1 \
    /dev/nvme10n1 \
    /dev/nvme11n1 \
    /dev/nvme12n1 \
    /dev/nvme13n1 \
    /dev/nvme14n1 \
    /dev/nvme15n1

I only get 1.6 million IOPS. Detail results down below.

Note: the array is created with chunk size 8K because this is for 
database workload. Here I tested with 4k block size, but the it's 
similar (lower perf on MD) with 8k

Any helps or hints would be greatly appreciated!

Cheers,
/Tobias



7 million IOPS on raw, individual NVMe devices
==============================================

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, 
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2367 (f=29896): 
[_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0 
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 
15:47:17 2017
    read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
     clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
      lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
     clat percentiles (usec):
      |  1.00th=[  114],  5.00th=[  135], 10.00th=[  149], 20.00th=[  171],
      | 30.00th=[  191], 40.00th=[  213], 50.00th=[  239], 60.00th=[  270],
      | 70.00th=[  314], 80.00th=[  378], 90.00th=[  556], 95.00th=[  980],
      | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
      | 99.99th=[ 8096]
     lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
     lat (usec) : 1000=1.79%
     lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
   cpu          : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s 
(28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec

Disk stats (read/write):
   nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, 
in_queue=14802400, util=100.00%
   nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, 
in_queue=15101276, util=100.00%
   nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, 
in_queue=12053112, util=100.00%
   nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, 
in_queue=11135004, util=100.00%
   nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, 
in_queue=21079576, util=100.00%
   nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, 
in_queue=19393024, util=100.00%
   nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, 
in_queue=20140104, util=100.00%
   nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, 
in_queue=21090048, util=100.00%
   nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, 
in_queue=14929172, util=100.00%
   nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, 
in_queue=13919288, util=100.00%
   nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0, 
in_queue=11390392, util=100.00%
   nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0, 
in_queue=20110288, util=100.00%
   nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0, 
in_queue=11683568, util=100.00%
   nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0, 
in_queue=16314628, util=100.00%
   nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0, 
in_queue=27659920, util=100.00%
   nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0, 
in_queue=17910636, util=100.00%


1.6 millions IOPS on Linux MD over 16 NVMe devices
==================================================

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
/opt/fio/bin/fio postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, 
ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2560 threads
Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0 
IOPS][eta 00m:00s]
randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 
17:21:15 2017
    read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
     clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
      lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
     clat percentiles (usec):
      |  1.00th=[   78],  5.00th=[   84], 10.00th=[   86], 20.00th=[   89],
      | 30.00th=[   95], 40.00th=[  102], 50.00th=[  105], 60.00th=[  108],
      | 70.00th=[  118], 80.00th=[  133], 90.00th=[  173], 95.00th=[  221],
      | 99.00th=[  358], 99.50th=[  506], 99.90th=[ 2192], 99.95th=[ 2608],
      | 99.99th=[ 2960]
     lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
     lat (usec) : 1000=0.07%
     lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
   cpu          : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s 
(6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec

Disk stats (read/write):
     md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0, 
aggrin_queue=1247601, aggrutil=100.00%
   nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, 
in_queue=1225896, util=100.00%
   nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, 
in_queue=1191452, util=100.00%
   nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, 
in_queue=1296728, util=100.00%
   nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, 
in_queue=1239808, util=100.00%
   nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, 
in_queue=1272916, util=100.00%
   nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, 
in_queue=1178360, util=100.00%
   nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, 
in_queue=1207808, util=100.00%
   nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, 
in_queue=1258956, util=100.00%
   nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, 
in_queue=1304536, util=100.00%
   nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, 
in_queue=1281952, util=100.00%
   nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, 
in_queue=1271820, util=100.00%
   nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, 
in_queue=1224192, util=100.00%
   nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, 
in_queue=1214240, util=100.00%
   nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, 
in_queue=1242372, util=100.00%
   nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, 
in_queue=1277600, util=100.00%
   nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, 
in_queue=1272988, util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
       [not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
@ 2017-01-23 17:52   ` Tobias Oberstein
       [not found]     ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 17:52 UTC (permalink / raw)
  To: Andrey Kuzmin, fio

Am 23.01.2017 um 18:03 schrieb Andrey Kuzmin:
 > Why don't you just 'perf' your md run and find out where it spends (an
 > awful lot if extra) time?

Good idea!

I ran with threads=1024 (to account for perf overhead). At that 
concurrency, Linux MD reaches 25% lower IOPS and has higher system load.

Please see here:

https://github.com/oberstet/scratchbox/tree/master/cruncher/sql19/linux-md-bottleneck

With higher concurrency, the discrepancy gets wider up to 7 mio vs 1.6 
mio IOPS.

I am not a kernel hacker.

What is osq_lock?

FWIW, this is a NUMA machine with 4 x E7 (88 cores / 176 HT) and 8 x 
Intel P3608 NVMe.

Any hints or anything I should try / measure?

Thanks a lot for your tips and assistence!

Cheers,
/Tobias

>
> On Jan 23, 2017 19:28, "Tobias Oberstein" <tobias.oberstein@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a question rgd Linux software RAID (MD) as tested with FIO - so
>> this is slightly OT, but I am hoping for expert advice or redirection to a
>> more appropriate place (if this is unwelcome here).
>>
>> I have a box with this HW:
>>
>> - 88 cores Xeon E7 (176 HTs) + 3TB RAM
>> - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
>>
>> With random 4kB read load, I am able to max it out at 7 million IOPS - but
>> only if I run FIO on the _individual_ NVMe devices.
>>
>> [global]
>> group_reporting
>> filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1
>> :/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/
>> nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/
>> nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
>> size=30G
>> ioengine=sync
>> iodepth=1
>> thread=1
>> direct=1
>> time_based=1
>> randrepeat=0
>> norandommap=1
>> bs=4k
>> runtime=120
>>
>> [randread]
>> stonewall
>> rw=randread
>> numjobs=2560
>>
>> When I create a stripe set over all devices:
>>
>> sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
>>    /dev/nvme0n1 \
>>    /dev/nvme1n1 \
>>    /dev/nvme2n1 \
>>    /dev/nvme3n1 \
>>    /dev/nvme4n1 \
>>    /dev/nvme5n1 \
>>    /dev/nvme6n1 \
>>    /dev/nvme7n1 \
>>    /dev/nvme8n1 \
>>    /dev/nvme9n1 \
>>    /dev/nvme10n1 \
>>    /dev/nvme11n1 \
>>    /dev/nvme12n1 \
>>    /dev/nvme13n1 \
>>    /dev/nvme14n1 \
>>    /dev/nvme15n1
>>
>> I only get 1.6 million IOPS. Detail results down below.
>>
>> Note: the array is created with chunk size 8K because this is for database
>> workload. Here I tested with 4k block size, but the it's similar (lower
>> perf on MD) with 8k
>>
>> Any helps or hints would be greatly appreciated!
>>
>> Cheers,
>> /Tobias
>>
>>
>>
>> 7 million IOPS on raw, individual NVMe devices
>> ==============================================
>>
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
>> /opt/fio/bin/fio postgresql_storage_workload.fio
>> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
>> ioengine=sync, iodepth=1
>> ...
>> fio-2.17-17-g9cf1
>> Starting 2560 threads
>> Jobs: 2367 (f=29896): [_(2),f(3),_(2),f(11),_(2),f(2
>> ),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(
>> 1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(
>> 2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(
>> 1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_
>> (1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(
>> 11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(
>> 1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(
>> 134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30)
>> ,_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),
>> f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(
>> 1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(
>> 15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(
>> 1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_
>> (1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_
>> (1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(
>> 22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_
>> (1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(
>> 11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1)
>> ,f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),
>> _(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(
>> 45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1)
>> ,f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),
>> _(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(
>> 18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=
>> 21.1GiB/s,w=0KiB/s][r=5751k,w=0 IOPS][eta 00m:00s]
>> randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 15:47:17
>> 2017
>>    read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
>>     clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
>>      lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
>>     clat percentiles (usec):
>>      |  1.00th=[  114],  5.00th=[  135], 10.00th=[  149], 20.00th=[  171],
>>      | 30.00th=[  191], 40.00th=[  213], 50.00th=[  239], 60.00th=[  270],
>>      | 70.00th=[  314], 80.00th=[  378], 90.00th=[  556], 95.00th=[  980],
>>      | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
>>      | 99.99th=[ 8096]
>>     lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
>>     lat (usec) : 1000=1.79%
>>     lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
>>   cpu          : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>>      issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
>>      latency   : target=0, window=0, percentile=100.00%, depth=1
>>
>> Run status group 0 (all jobs):
>>    READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s (28.6GB/s-28.6GB/s),
>> io=3189GiB (3424GB), run=120007-120007msec
>>
>> Disk stats (read/write):
>>   nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, in_queue=14802400,
>> util=100.00%
>>   nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, in_queue=15101276,
>> util=100.00%
>>   nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, in_queue=12053112,
>> util=100.00%
>>   nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, in_queue=11135004,
>> util=100.00%
>>   nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, in_queue=21079576,
>> util=100.00%
>>   nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, in_queue=19393024,
>> util=100.00%
>>   nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, in_queue=20140104,
>> util=100.00%
>>   nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, in_queue=21090048,
>> util=100.00%
>>   nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, in_queue=14929172,
>> util=100.00%
>>   nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, in_queue=13919288,
>> util=100.00%
>>   nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0,
>> in_queue=11390392, util=100.00%
>>   nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0,
>> in_queue=20110288, util=100.00%
>>   nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0,
>> in_queue=11683568, util=100.00%
>>   nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0,
>> in_queue=16314628, util=100.00%
>>   nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0,
>> in_queue=27659920, util=100.00%
>>   nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0,
>> in_queue=17910636, util=100.00%
>>
>>
>> 1.6 millions IOPS on Linux MD over 16 NVMe devices
>> ==================================================
>>
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo
>> /opt/fio/bin/fio postgresql_storage_workload.fio
>> randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B,
>> ioengine=sync, iodepth=1
>> ...
>> fio-2.17-17-g9cf1
>> Starting 2560 threads
>> Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0
>> IOPS][eta 00m:00s]
>> randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 17:21:15
>> 2017
>>    read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
>>     clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
>>      lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
>>     clat percentiles (usec):
>>      |  1.00th=[   78],  5.00th=[   84], 10.00th=[   86], 20.00th=[   89],
>>      | 30.00th=[   95], 40.00th=[  102], 50.00th=[  105], 60.00th=[  108],
>>      | 70.00th=[  118], 80.00th=[  133], 90.00th=[  173], 95.00th=[  221],
>>      | 99.00th=[  358], 99.50th=[  506], 99.90th=[ 2192], 99.95th=[ 2608],
>>      | 99.99th=[ 2960]
>>     lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
>>     lat (usec) : 1000=0.07%
>>     lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
>>   cpu          : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>>      issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
>>      latency   : target=0, window=0, percentile=100.00%, depth=1
>>
>> Run status group 0 (all jobs):
>>    READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s (6505MB/s-6505MB/s),
>> io=728GiB (781GB), run=120098-120098msec
>>
>> Disk stats (read/write):
>>     md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
>> aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0,
>> aggrin_queue=1247601, aggrutil=100.00%
>>   nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, in_queue=1225896,
>> util=100.00%
>>   nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, in_queue=1191452,
>> util=100.00%
>>   nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, in_queue=1296728,
>> util=100.00%
>>   nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, in_queue=1239808,
>> util=100.00%
>>   nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, in_queue=1272916,
>> util=100.00%
>>   nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, in_queue=1178360,
>> util=100.00%
>>   nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, in_queue=1207808,
>> util=100.00%
>>   nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, in_queue=1258956,
>> util=100.00%
>>   nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, in_queue=1304536,
>> util=100.00%
>>   nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, in_queue=1281952,
>> util=100.00%
>>   nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, in_queue=1271820,
>> util=100.00%
>>   nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, in_queue=1224192,
>> util=100.00%
>>   nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, in_queue=1214240,
>> util=100.00%
>>   nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, in_queue=1242372,
>> util=100.00%
>>   nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, in_queue=1277600,
>> util=100.00%
>>   nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, in_queue=1272988,
>> util=100.00%
>> oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
>>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein
       [not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
@ 2017-01-23 18:18 ` Kudryavtsev, Andrey O
  2017-01-23 18:53   ` Tobias Oberstein
  1 sibling, 1 reply; 27+ messages in thread
From: Kudryavtsev, Andrey O @ 2017-01-23 18:18 UTC (permalink / raw)
  To: Tobias Oberstein, fio

Hi Tobias, 
MDRAID overhead is always there, but you can play with some tuning knobs. 
I recommend following: 
1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs. 
2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular
3. and finally using “imsm” MDRAID extensions and latest MDADM build. 

See some other hints there:
http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives
 
some config examples for NVMe are here:
https://github.com/01org/fiovisualizer/tree/master/Workloads


-- 
Andrey Kudryavtsev, 

SSD Solution Architect
Intel Corp. 
inet: 83564353
work: +1-916-356-4353
mobile: +1-916-221-2281

On 1/23/17, 8:26 AM, "fio-owner@vger.kernel.org on behalf of Tobias Oberstein" <fio-owner@vger.kernel.org on behalf of tobias.oberstein@gmail.com> wrote:

    Hi,
    
    I have a question rgd Linux software RAID (MD) as tested with FIO - so 
    this is slightly OT, but I am hoping for expert advice or redirection to 
    a more appropriate place (if this is unwelcome here).
    
    I have a box with this HW:
    
    - 88 cores Xeon E7 (176 HTs) + 3TB RAM
    - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes)
    
    With random 4kB read load, I am able to max it out at 7 million IOPS - 
    but only if I run FIO on the _individual_ NVMe devices.
    
    [global]
    group_reporting
    filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
    size=30G
    ioengine=sync
    iodepth=1
    thread=1
    direct=1
    time_based=1
    randrepeat=0
    norandommap=1
    bs=4k
    runtime=120
    
    [randread]
    stonewall
    rw=randread
    numjobs=2560
    
    When I create a stripe set over all devices:
    
    sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \
        /dev/nvme0n1 \
        /dev/nvme1n1 \
        /dev/nvme2n1 \
        /dev/nvme3n1 \
        /dev/nvme4n1 \
        /dev/nvme5n1 \
        /dev/nvme6n1 \
        /dev/nvme7n1 \
        /dev/nvme8n1 \
        /dev/nvme9n1 \
        /dev/nvme10n1 \
        /dev/nvme11n1 \
        /dev/nvme12n1 \
        /dev/nvme13n1 \
        /dev/nvme14n1 \
        /dev/nvme15n1
    
    I only get 1.6 million IOPS. Detail results down below.
    
    Note: the array is created with chunk size 8K because this is for 
    database workload. Here I tested with 4k block size, but the it's 
    similar (lower perf on MD) with 8k
    
    Any helps or hints would be greatly appreciated!
    
    Cheers,
    /Tobias
    
    
    
    7 million IOPS on raw, individual NVMe devices
    ==============================================
    
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
    /opt/fio/bin/fio postgresql_storage_workload.fio
    randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, 
    ioengine=sync, iodepth=1
    ...
    fio-2.17-17-g9cf1
    Starting 2560 threads
    Jobs: 2367 (f=29896): 
    [_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0 
    IOPS][eta 00m:00s]
    randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 
    15:47:17 2017
        read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec)
         clat (usec): min=38, max=33262, avg=360.11, stdev=465.36
          lat (usec): min=38, max=33262, avg=360.20, stdev=465.40
         clat percentiles (usec):
          |  1.00th=[  114],  5.00th=[  135], 10.00th=[  149], 20.00th=[  171],
          | 30.00th=[  191], 40.00th=[  213], 50.00th=[  239], 60.00th=[  270],
          | 70.00th=[  314], 80.00th=[  378], 90.00th=[  556], 95.00th=[  980],
          | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216],
          | 99.99th=[ 8096]
         lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23%
         lat (usec) : 1000=1.79%
         lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01%
       cpu          : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557
       IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
     >=64=0.0%
          submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0
          latency   : target=0, window=0, percentile=100.00%, depth=1
    
    Run status group 0 (all jobs):
        READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s 
    (28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec
    
    Disk stats (read/write):
       nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, 
    in_queue=14802400, util=100.00%
       nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, 
    in_queue=15101276, util=100.00%
       nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, 
    in_queue=12053112, util=100.00%
       nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, 
    in_queue=11135004, util=100.00%
       nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, 
    in_queue=21079576, util=100.00%
       nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, 
    in_queue=19393024, util=100.00%
       nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, 
    in_queue=20140104, util=100.00%
       nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, 
    in_queue=21090048, util=100.00%
       nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, 
    in_queue=14929172, util=100.00%
       nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, 
    in_queue=13919288, util=100.00%
       nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0, 
    in_queue=11390392, util=100.00%
       nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0, 
    in_queue=20110288, util=100.00%
       nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0, 
    in_queue=11683568, util=100.00%
       nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0, 
    in_queue=16314628, util=100.00%
       nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0, 
    in_queue=27659920, util=100.00%
       nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0, 
    in_queue=17910636, util=100.00%
    
    
    1.6 millions IOPS on Linux MD over 16 NVMe devices
    ==================================================
    
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
    /opt/fio/bin/fio postgresql_storage_workload.fio
    randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, 
    ioengine=sync, iodepth=1
    ...
    fio-2.17-17-g9cf1
    Starting 2560 threads
    Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0 
    IOPS][eta 00m:00s]
    randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 
    17:21:15 2017
        read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec)
         clat (usec): min=27, max=28498, avg=124.51, stdev=113.10
          lat (usec): min=27, max=28498, avg=124.58, stdev=113.10
         clat percentiles (usec):
          |  1.00th=[   78],  5.00th=[   84], 10.00th=[   86], 20.00th=[   89],
          | 30.00th=[   95], 40.00th=[  102], 50.00th=[  105], 60.00th=[  108],
          | 70.00th=[  118], 80.00th=[  133], 90.00th=[  173], 95.00th=[  221],
          | 99.00th=[  358], 99.50th=[  506], 99.90th=[ 2192], 99.95th=[ 2608],
          | 99.99th=[ 2960]
         lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19%
         lat (usec) : 1000=0.07%
         lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
       cpu          : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557
       IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
     >=64=0.0%
          submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0
          latency   : target=0, window=0, percentile=100.00%, depth=1
    
    Run status group 0 (all jobs):
        READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s 
    (6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec
    
    Disk stats (read/write):
         md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
    aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0, 
    aggrin_queue=1247601, aggrutil=100.00%
       nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, 
    in_queue=1225896, util=100.00%
       nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, 
    in_queue=1191452, util=100.00%
       nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, 
    in_queue=1296728, util=100.00%
       nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, 
    in_queue=1239808, util=100.00%
       nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, 
    in_queue=1272916, util=100.00%
       nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, 
    in_queue=1178360, util=100.00%
       nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, 
    in_queue=1207808, util=100.00%
       nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, 
    in_queue=1258956, util=100.00%
       nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, 
    in_queue=1304536, util=100.00%
       nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, 
    in_queue=1281952, util=100.00%
       nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, 
    in_queue=1271820, util=100.00%
       nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, 
    in_queue=1224192, util=100.00%
       nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, 
    in_queue=1214240, util=100.00%
       nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, 
    in_queue=1242372, util=100.00%
       nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, 
    in_queue=1277600, util=100.00%
       nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, 
    in_queue=1272988, util=100.00%
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
    N�����r��y���b�X��ǧv�^�)޺{.n�+�������?��ܨ}���Ơz�&j:+v���?����zZ+��+zf���h���~����i���z�?�w���?����&�)ߢ?f


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
       [not found]     ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
@ 2017-01-23 18:33       ` Tobias Oberstein
  2017-01-23 19:10         ` Kudryavtsev, Andrey O
                           ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 18:33 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio

> You're just running a huge number of threads against the same md device and
> bottleneck on some internal lock. If you step back and set up, say, 256

Ah, alright. Shit.

> threads with ioengine=libaio, qd=128 (to match the in-flight I/O number),
> you'd likely see the locking impact reduced substantially.

The problem with using libaio and QD>1 is: that doesn't represent the 
workload I am optimizing for.

The workload is PostgreSQL, and that is doing all it's IO as regular 
read/writes, and hence the use of ioengine=sync with large thread counts.

Note: we have an internal tool that is able to parallelize PostgreSQL 
via database sessions.

--

I tried anyway. Here is what I get with engine=libaio (results down below):

A)
QD=128 and jobs=8 (same effective IO concurrency as previously = 1024)

iops=200184

The IOPS stay constant during the run (120s).

B)
QD=128 and jobs=16 (effective concurrency = 2048)

iops=1068.7K

But, but:

The IOPS slowly go up to over 5 mio, then collapses to like 20k, and 
then go up again. Very strange.

C)
QD=128 and jobs=32 (effective concurrency = 4096)

FIO claims: iops=2135.9K

Which is still 3.5x lower than what I get with the sync engine and 2800 
threads!

Plus: that strange behavior over run time .. IOPS go up to 10M:

http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-29-13-ZEyCVcKZ.1485196199.png

and the collapse to 0 IOPS

http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-30-20-GEEEQR6f.1485196243.png

at which the NVMes don't show any load (I am watching them in another 
window).

===

libaio is nowhere near what I get with engine=sync and high job counts. 
Mmh. Plus the strange behavior.

And as said, that doesn't represent my workload anyways.

I want to stay away from AIO ..

Cheers,
/Tobias


A)

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=128
...
fio-2.1.11
Starting 8 threads
Jobs: 1 (f=1): [_(2),r(1),_(5)] [38.3% done] [0KB/0KB/0KB /s] [0/0/0 
iops] [eta 03m:23s]
randread: (groupid=0, jobs=8): err= 0: pid=1994: Mon Jan 23 19:23:23 2017
   read : io=93837MB, bw=800739KB/s, iops=200184, runt=120001msec
     slat (usec): min=0, max=4291, avg=39.28, stdev=76.95
     clat (usec): min=2, max=22205, avg=5075.21, stdev=3646.18
      lat (usec): min=5, max=22333, avg=5114.55, stdev=3674.10
     clat percentiles (usec):
      |  1.00th=[  916],  5.00th=[ 1224], 10.00th=[ 1448], 20.00th=[ 1864],
      | 30.00th=[ 2320], 40.00th=[ 2960], 50.00th=[ 3920], 60.00th=[ 5024],
      | 70.00th=[ 6368], 80.00th=[ 8384], 90.00th=[10944], 95.00th=[12608],
      | 99.00th=[14272], 99.50th=[15168], 99.90th=[16768], 99.95th=[17536],
      | 99.99th=[18816]
     bw (KB  /s): min=33088, max=400688, per=12.35%, avg=98898.47, 
stdev=76253.23
     lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
     lat (usec) : 250=0.01%, 500=0.01%, 750=0.22%, 1000=1.48%
     lat (msec) : 2=21.67%, 4=27.51%, 10=35.37%, 20=13.74%, 50=0.01%
   cpu          : usr=1.53%, sys=13.53%, ctx=7504182, majf=0, minf=1032
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=24022368/w=0/d=0, short=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=93837MB, aggrb=800738KB/s, minb=800738KB/s, 
maxb=800738KB/s, mint=120001msec, maxt=120001msec

Disk stats (read/write):
     md1: ios=7485313/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=468407/0, aggrmerge=0/0, aggrticks=51834/0, aggrin_queue=51770, 
aggrutil=35.00%
   nvme15n1: ios=468133/0, merge=0/0, ticks=52628/0, in_queue=52532, 
util=34.39%
   nvme6n1: ios=468355/0, merge=0/0, ticks=48944/0, in_queue=48840, 
util=32.34%
   nvme9n1: ios=468561/0, merge=0/0, ticks=53924/0, in_queue=53956, 
util=35.00%
   nvme11n1: ios=468354/0, merge=0/0, ticks=53424/0, in_queue=53396, 
util=34.70%
   nvme2n1: ios=468418/0, merge=0/0, ticks=51536/0, in_queue=51496, 
util=33.63%
   nvme14n1: ios=468669/0, merge=0/0, ticks=51696/0, in_queue=51576, 
util=33.84%
   nvme5n1: ios=468526/0, merge=0/0, ticks=50004/0, in_queue=49928, 
util=33.00%
   nvme8n1: ios=468233/0, merge=0/0, ticks=52232/0, in_queue=52140, 
util=33.82%
   nvme10n1: ios=468501/0, merge=0/0, ticks=52532/0, in_queue=52416, 
util=34.29%
   nvme1n1: ios=468434/0, merge=0/0, ticks=53492/0, in_queue=53404, 
util=34.58%
   nvme13n1: ios=468544/0, merge=0/0, ticks=51876/0, in_queue=51860, 
util=33.85%
   nvme4n1: ios=468513/0, merge=0/0, ticks=51172/0, in_queue=51176, 
util=33.30%
   nvme7n1: ios=468245/0, merge=0/0, ticks=50564/0, in_queue=50484, 
util=33.14%
   nvme0n1: ios=468318/0, merge=0/0, ticks=49812/0, in_queue=49760, 
util=32.67%
   nvme12n1: ios=468279/0, merge=0/0, ticks=52416/0, in_queue=52344, 
util=34.17%
   nvme3n1: ios=468442/0, merge=0/0, ticks=53092/0, in_queue=53016, 
util=34.37%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


B)

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=128
...
fio-2.1.11
Starting 16 threads
Jobs: 1 (f=1): [_(15),r(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] 
[eta 00m:00s]
randread: (groupid=0, jobs=16): err= 0: pid=2141: Mon Jan 23 19:27:38 2017
   read : io=500942MB, bw=4174.5MB/s, iops=1068.7K, runt=120001msec
     slat (usec): min=0, max=3647, avg=11.07, stdev=37.60
     clat (usec): min=2, max=19872, avg=1475.65, stdev=2510.83
      lat (usec): min=4, max=19964, avg=1486.76, stdev=2530.31
     clat percentiles (usec):
      |  1.00th=[  334],  5.00th=[  346], 10.00th=[  358], 20.00th=[  362],
      | 30.00th=[  370], 40.00th=[  378], 50.00th=[  398], 60.00th=[  494],
      | 70.00th=[  780], 80.00th=[ 1480], 90.00th=[ 4256], 95.00th=[ 8032],
      | 99.00th=[12096], 99.50th=[12736], 99.90th=[14272], 99.95th=[14912],
      | 99.99th=[16512]
     bw (KB  /s): min=    0, max=1512848, per=8.04%, avg=343481.50, 
stdev=460791.59
     lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
     lat (usec) : 250=0.01%, 500=60.27%, 750=8.95%, 1000=4.94%
     lat (msec) : 2=9.33%, 4=5.98%, 10=7.89%, 20=2.63%
   cpu          : usr=3.19%, sys=44.95%, ctx=9452424, majf=0, minf=2064
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=128241193/w=0/d=0, short=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=500942MB, aggrb=4174.5MB/s, minb=4174.5MB/s, 
maxb=4174.5MB/s, mint=120001msec, maxt=120001msec

Disk stats (read/write):
     md1: ios=9392258/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=588533/0, aggrmerge=0/0, aggrticks=63464/0, aggrin_queue=63476, 
aggrutil=36.40%
   nvme15n1: ios=588661/0, merge=0/0, ticks=66932/0, in_queue=66824, 
util=36.40%
   nvme6n1: ios=589278/0, merge=0/0, ticks=60768/0, in_queue=60600, 
util=34.84%
   nvme9n1: ios=588744/0, merge=0/0, ticks=64344/0, in_queue=64480, 
util=35.85%
   nvme11n1: ios=588005/0, merge=0/0, ticks=65636/0, in_queue=65828, 
util=36.02%
   nvme2n1: ios=588097/0, merge=0/0, ticks=62296/0, in_queue=62440, 
util=35.00%
   nvme14n1: ios=588451/0, merge=0/0, ticks=64480/0, in_queue=64408, 
util=35.87%
   nvme5n1: ios=588654/0, merge=0/0, ticks=60736/0, in_queue=60704, 
util=34.66%
   nvme8n1: ios=588843/0, merge=0/0, ticks=63980/0, in_queue=63928, 
util=35.40%
   nvme10n1: ios=588315/0, merge=0/0, ticks=62436/0, in_queue=62432, 
util=35.15%
   nvme1n1: ios=588327/0, merge=0/0, ticks=64432/0, in_queue=64564, 
util=36.10%
   nvme13n1: ios=588342/0, merge=0/0, ticks=65856/0, in_queue=65892, 
util=36.06%
   nvme4n1: ios=588343/0, merge=0/0, ticks=64528/0, in_queue=64752, 
util=35.73%
   nvme7n1: ios=589243/0, merge=0/0, ticks=63740/0, in_queue=63696, 
util=35.34%
   nvme0n1: ios=588499/0, merge=0/0, ticks=61308/0, in_queue=61268, 
util=34.83%
   nvme12n1: ios=588221/0, merge=0/0, ticks=62076/0, in_queue=61976, 
util=35.19%
   nvme3n1: ios=588512/0, merge=0/0, ticks=61880/0, in_queue=61824, 
util=35.09%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


C)

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=128
...
fio-2.1.11
Starting 32 threads
Jobs: 1 (f=0): [_(24),r(1),_(7)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 
iops] [eta 00m:00s]
randread: (groupid=0, jobs=32): err= 0: pid=2263: Mon Jan 23 19:30:49 2017
   read : io=977.76GB, bw=8343.4MB/s, iops=2135.9K, runt=120001msec
     slat (usec): min=0, max=3372, avg= 7.30, stdev=27.48
     clat (usec): min=1, max=21871, avg=997.26, stdev=1995.10
      lat (usec): min=4, max=21982, avg=1004.60, stdev=2010.61
     clat percentiles (usec):
      |  1.00th=[  374],  5.00th=[  378], 10.00th=[  378], 20.00th=[  386],
      | 30.00th=[  390], 40.00th=[  394], 50.00th=[  394], 60.00th=[  398],
      | 70.00th=[  406], 80.00th=[  540], 90.00th=[ 1496], 95.00th=[ 5408],
      | 99.00th=[10944], 99.50th=[12224], 99.90th=[14016], 99.95th=[14784],
      | 99.99th=[16512]
     bw (KB  /s): min=    0, max=1353208, per=5.91%, avg=505187.96, 
stdev=549388.79
     lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec) : 100=0.01%, 250=0.01%, 500=78.69%, 750=5.80%, 1000=2.94%
     lat (msec) : 2=3.84%, 4=2.66%, 10=4.52%, 20=1.56%, 50=0.01%
   cpu          : usr=3.09%, sys=68.19%, ctx=10916103, majf=0, minf=4128
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=256309234/w=0/d=0, short=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=977.76GB, aggrb=8343.4MB/s, minb=8343.4MB/s, 
maxb=8343.4MB/s, mint=120001msec, maxt=120001msec

Disk stats (read/write):
     md1: ios=10762806/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=675866/0, aggrmerge=0/0, aggrticks=70332/0, aggrin_queue=70505, 
aggrutil=28.65%
   nvme15n1: ios=675832/0, merge=0/0, ticks=69604/0, in_queue=69648, 
util=27.82%
   nvme6n1: ios=676181/0, merge=0/0, ticks=75584/0, in_queue=75552, 
util=28.65%
   nvme9n1: ios=675762/0, merge=0/0, ticks=67916/0, in_queue=68236, 
util=27.79%
   nvme11n1: ios=675745/0, merge=0/0, ticks=68296/0, in_queue=68804, 
util=27.66%
   nvme2n1: ios=676036/0, merge=0/0, ticks=70904/0, in_queue=71240, 
util=28.14%
   nvme14n1: ios=675737/0, merge=0/0, ticks=71560/0, in_queue=71716, 
util=28.13%
   nvme5n1: ios=676592/0, merge=0/0, ticks=71832/0, in_queue=71976, 
util=28.02%
   nvme8n1: ios=675969/0, merge=0/0, ticks=69152/0, in_queue=69192, 
util=27.63%
   nvme10n1: ios=675607/0, merge=0/0, ticks=67600/0, in_queue=67668, 
util=27.74%
   nvme1n1: ios=675528/0, merge=0/0, ticks=72856/0, in_queue=73136, 
util=28.48%
   nvme13n1: ios=675189/0, merge=0/0, ticks=69736/0, in_queue=70084, 
util=28.04%
   nvme4n1: ios=676117/0, merge=0/0, ticks=68120/0, in_queue=68600, 
util=27.88%
   nvme7n1: ios=675726/0, merge=0/0, ticks=72004/0, in_queue=71960, 
util=28.25%
   nvme0n1: ios=676119/0, merge=0/0, ticks=71228/0, in_queue=71264, 
util=28.12%
   nvme12n1: ios=675837/0, merge=0/0, ticks=70320/0, in_queue=70368, 
util=27.99%
   nvme3n1: ios=675887/0, merge=0/0, ticks=68600/0, in_queue=68636, 
util=27.95%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 18:18 ` Kudryavtsev, Andrey O
@ 2017-01-23 18:53   ` Tobias Oberstein
  2017-01-23 19:06     ` Kudryavtsev, Andrey O
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 18:53 UTC (permalink / raw)
  To: Kudryavtsev, Andrey O, fio

Hi Andrey,

thanks for your tips!

Am 23.01.2017 um 19:18 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> MDRAID overhead is always there, but you can play with some tuning knobs.
> I recommend following:
> 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs.

I get nearly 7 mio random 4k IOPS with engine=sync and threads=2800 on 
the 16 logical NVMe block devices (from 8 physical P3608 4TB).

The values I get with libaio are much lower (see my other reply).

My concrete problem is: I can't get these 7 mio IOPS through MD (striped 
over all 16 NVMe logical devices) .. MD hits a wall at 1.6 mio

Note: I also tried LVM striped volumes. Sluggish perf., much higher 
system load.

> 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular

Background: my workload is 100% 8kB and current results are here

https://github.com/oberstet/scratchbox/raw/master/cruncher/sql19/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf

The sector size on the NVMes currently is

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo isdct show -a 
-intelssd 0 | grep SectorSize
SectorSize : 512

Do you recommend changing that in my case?

> 3. and finally using “imsm” MDRAID extensions and latest MDADM build.

What is imsm?

Is that "Intel Matrix Storage Array"?

Is that fully open-source and in-tree kernel?

If not, I won't use it anyway, sorry, company policy.

We're running Debian 8 / Kernel 4.8 from backports (and soonish Debian 9).

> See some other hints there:
> http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives
>
> some config examples for NVMe are here:
> https://github.com/01org/fiovisualizer/tree/master/Workloads
>
>

What's your platform?

Eg on Windows, async IO is awesome. On *nix .. not. At least in my 
experience.

And then, my target workload (PostgreSQL) isn't doing AIO at all ..

Cheers,
/Tobias


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 18:53   ` Tobias Oberstein
@ 2017-01-23 19:06     ` Kudryavtsev, Andrey O
  2017-01-24  9:46       ` Tobias Oberstein
                         ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Kudryavtsev, Andrey O @ 2017-01-23 19:06 UTC (permalink / raw)
  To: Tobias Oberstein, fio

Hi Tobias, 
Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata. 

Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing. 

Changing sector to 4k is easy, this can really help. see DCT manual, it’s there. 
This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/


-- 
Andrey Kudryavtsev, 

SSD Solution Architect
Intel Corp. 
inet: 83564353
work: +1-916-356-4353
mobile: +1-916-221-2281

On 1/23/17, 10:53 AM, "Tobias Oberstein" <tobias.oberstein@gmail.com> wrote:

    Hi Andrey,
    
    thanks for your tips!
    
    Am 23.01.2017 um 19:18 schrieb Kudryavtsev, Andrey O:
    > Hi Tobias,
    > MDRAID overhead is always there, but you can play with some tuning knobs.
    > I recommend following:
    > 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs.
    
    I get nearly 7 mio random 4k IOPS with engine=sync and threads=2800 on 
    the 16 logical NVMe block devices (from 8 physical P3608 4TB).
    
    The values I get with libaio are much lower (see my other reply).
    
    My concrete problem is: I can't get these 7 mio IOPS through MD (striped 
    over all 16 NVMe logical devices) .. MD hits a wall at 1.6 mio
    
    Note: I also tried LVM striped volumes. Sluggish perf., much higher 
    system load.
    
    > 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular
    
    Background: my workload is 100% 8kB and current results are here
    
    https://github.com/oberstet/scratchbox/raw/master/cruncher/sql19/Performance%20Results%20-%20NVMe%20Scaling%20with%20IO%20Concurrency.pdf
    
    The sector size on the NVMes currently is
    
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo isdct show -a 
    -intelssd 0 | grep SectorSize
    SectorSize : 512
    
    Do you recommend changing that in my case?
    
    > 3. and finally using “imsm” MDRAID extensions and latest MDADM build.
    
    What is imsm?
    
    Is that "Intel Matrix Storage Array"?
    
    Is that fully open-source and in-tree kernel?
    
    If not, I won't use it anyway, sorry, company policy.
    
    We're running Debian 8 / Kernel 4.8 from backports (and soonish Debian 9).
    
    > See some other hints there:
    > http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives
    >
    > some config examples for NVMe are here:
    > https://github.com/01org/fiovisualizer/tree/master/Workloads
    >
    >
    
    What's your platform?
    
    Eg on Windows, async IO is awesome. On *nix .. not. At least in my 
    experience.
    
    And then, my target workload (PostgreSQL) isn't doing AIO at all ..
    
    Cheers,
    /Tobias
    
    


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 18:33       ` Tobias Oberstein
@ 2017-01-23 19:10         ` Kudryavtsev, Andrey O
  2017-01-23 19:26           ` Tobias Oberstein
  2017-01-23 19:13         ` Sitsofe Wheeler
       [not found]         ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
  2 siblings, 1 reply; 27+ messages in thread
From: Kudryavtsev, Andrey O @ 2017-01-23 19:10 UTC (permalink / raw)
  To: Tobias Oberstein, Andrey Kuzmin; +Cc: fio

Tobias,
I’d try 128 jobs, QD 32 and disable random map and latency measurements
       randrepeat=0
       norandommap
       disable_	lat

-- 
Andrey Kudryavtsev, 

SSD Solution Architect
Intel Corp. 
inet: 83564353
work: +1-916-356-4353
mobile: +1-916-221-2281

On 1/23/17, 10:33 AM, "fio-owner@vger.kernel.org on behalf of Tobias Oberstein" <fio-owner@vger.kernel.org on behalf of tobias.oberstein@gmail.com> wrote:

    > You're just running a huge number of threads against the same md device and
    > bottleneck on some internal lock. If you step back and set up, say, 256
    
    Ah, alright. Shit.
    
    > threads with ioengine=libaio, qd=128 (to match the in-flight I/O number),
    > you'd likely see the locking impact reduced substantially.
    
    The problem with using libaio and QD>1 is: that doesn't represent the 
    workload I am optimizing for.
    
    The workload is PostgreSQL, and that is doing all it's IO as regular 
    read/writes, and hence the use of ioengine=sync with large thread counts.
    
    Note: we have an internal tool that is able to parallelize PostgreSQL 
    via database sessions.
    
    --
    
    I tried anyway. Here is what I get with engine=libaio (results down below):
    
    A)
    QD=128 and jobs=8 (same effective IO concurrency as previously = 1024)
    
    iops=200184
    
    The IOPS stay constant during the run (120s).
    
    B)
    QD=128 and jobs=16 (effective concurrency = 2048)
    
    iops=1068.7K
    
    But, but:
    
    The IOPS slowly go up to over 5 mio, then collapses to like 20k, and 
    then go up again. Very strange.
    
    C)
    QD=128 and jobs=32 (effective concurrency = 4096)
    
    FIO claims: iops=2135.9K
    
    Which is still 3.5x lower than what I get with the sync engine and 2800 
    threads!
    
    Plus: that strange behavior over run time .. IOPS go up to 10M:
    
    http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-29-13-ZEyCVcKZ.1485196199.png
    
    and the collapse to 0 IOPS
    
    http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_19-30-20-GEEEQR6f.1485196243.png
    
    at which the NVMes don't show any load (I am watching them in another 
    window).
    
    ===
    
    libaio is nowhere near what I get with engine=sync and high job counts. 
    Mmh. Plus the strange behavior.
    
    And as said, that doesn't represent my workload anyways.
    
    I want to stay away from AIO ..
    
    Cheers,
    /Tobias
    
    
    A)
    
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
    postgresql_storage_workload.fio
    randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
    iodepth=128
    ...
    fio-2.1.11
    Starting 8 threads
    Jobs: 1 (f=1): [_(2),r(1),_(5)] [38.3% done] [0KB/0KB/0KB /s] [0/0/0 
    iops] [eta 03m:23s]
    randread: (groupid=0, jobs=8): err= 0: pid=1994: Mon Jan 23 19:23:23 2017
       read : io=93837MB, bw=800739KB/s, iops=200184, runt=120001msec
         slat (usec): min=0, max=4291, avg=39.28, stdev=76.95
         clat (usec): min=2, max=22205, avg=5075.21, stdev=3646.18
          lat (usec): min=5, max=22333, avg=5114.55, stdev=3674.10
         clat percentiles (usec):
          |  1.00th=[  916],  5.00th=[ 1224], 10.00th=[ 1448], 20.00th=[ 1864],
          | 30.00th=[ 2320], 40.00th=[ 2960], 50.00th=[ 3920], 60.00th=[ 5024],
          | 70.00th=[ 6368], 80.00th=[ 8384], 90.00th=[10944], 95.00th=[12608],
          | 99.00th=[14272], 99.50th=[15168], 99.90th=[16768], 99.95th=[17536],
          | 99.99th=[18816]
         bw (KB  /s): min=33088, max=400688, per=12.35%, avg=98898.47, 
    stdev=76253.23
         lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
         lat (usec) : 250=0.01%, 500=0.01%, 750=0.22%, 1000=1.48%
         lat (msec) : 2=21.67%, 4=27.51%, 10=35.37%, 20=13.74%, 50=0.01%
       cpu          : usr=1.53%, sys=13.53%, ctx=7504182, majf=0, minf=1032
       IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
     >=64=100.0%
          submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.1%
          issued    : total=r=24022368/w=0/d=0, short=r=0/w=0/d=0
          latency   : target=0, window=0, percentile=100.00%, depth=128
    
    Run status group 0 (all jobs):
        READ: io=93837MB, aggrb=800738KB/s, minb=800738KB/s, 
    maxb=800738KB/s, mint=120001msec, maxt=120001msec
    
    Disk stats (read/write):
         md1: ios=7485313/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
    aggrios=468407/0, aggrmerge=0/0, aggrticks=51834/0, aggrin_queue=51770, 
    aggrutil=35.00%
       nvme15n1: ios=468133/0, merge=0/0, ticks=52628/0, in_queue=52532, 
    util=34.39%
       nvme6n1: ios=468355/0, merge=0/0, ticks=48944/0, in_queue=48840, 
    util=32.34%
       nvme9n1: ios=468561/0, merge=0/0, ticks=53924/0, in_queue=53956, 
    util=35.00%
       nvme11n1: ios=468354/0, merge=0/0, ticks=53424/0, in_queue=53396, 
    util=34.70%
       nvme2n1: ios=468418/0, merge=0/0, ticks=51536/0, in_queue=51496, 
    util=33.63%
       nvme14n1: ios=468669/0, merge=0/0, ticks=51696/0, in_queue=51576, 
    util=33.84%
       nvme5n1: ios=468526/0, merge=0/0, ticks=50004/0, in_queue=49928, 
    util=33.00%
       nvme8n1: ios=468233/0, merge=0/0, ticks=52232/0, in_queue=52140, 
    util=33.82%
       nvme10n1: ios=468501/0, merge=0/0, ticks=52532/0, in_queue=52416, 
    util=34.29%
       nvme1n1: ios=468434/0, merge=0/0, ticks=53492/0, in_queue=53404, 
    util=34.58%
       nvme13n1: ios=468544/0, merge=0/0, ticks=51876/0, in_queue=51860, 
    util=33.85%
       nvme4n1: ios=468513/0, merge=0/0, ticks=51172/0, in_queue=51176, 
    util=33.30%
       nvme7n1: ios=468245/0, merge=0/0, ticks=50564/0, in_queue=50484, 
    util=33.14%
       nvme0n1: ios=468318/0, merge=0/0, ticks=49812/0, in_queue=49760, 
    util=32.67%
       nvme12n1: ios=468279/0, merge=0/0, ticks=52416/0, in_queue=52344, 
    util=34.17%
       nvme3n1: ios=468442/0, merge=0/0, ticks=53092/0, in_queue=53016, 
    util=34.37%
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
    
    
    B)
    
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
    postgresql_storage_workload.fio
    randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
    iodepth=128
    ...
    fio-2.1.11
    Starting 16 threads
    Jobs: 1 (f=1): [_(15),r(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] 
    [eta 00m:00s]
    randread: (groupid=0, jobs=16): err= 0: pid=2141: Mon Jan 23 19:27:38 2017
       read : io=500942MB, bw=4174.5MB/s, iops=1068.7K, runt=120001msec
         slat (usec): min=0, max=3647, avg=11.07, stdev=37.60
         clat (usec): min=2, max=19872, avg=1475.65, stdev=2510.83
          lat (usec): min=4, max=19964, avg=1486.76, stdev=2530.31
         clat percentiles (usec):
          |  1.00th=[  334],  5.00th=[  346], 10.00th=[  358], 20.00th=[  362],
          | 30.00th=[  370], 40.00th=[  378], 50.00th=[  398], 60.00th=[  494],
          | 70.00th=[  780], 80.00th=[ 1480], 90.00th=[ 4256], 95.00th=[ 8032],
          | 99.00th=[12096], 99.50th=[12736], 99.90th=[14272], 99.95th=[14912],
          | 99.99th=[16512]
         bw (KB  /s): min=    0, max=1512848, per=8.04%, avg=343481.50, 
    stdev=460791.59
         lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
         lat (usec) : 250=0.01%, 500=60.27%, 750=8.95%, 1000=4.94%
         lat (msec) : 2=9.33%, 4=5.98%, 10=7.89%, 20=2.63%
       cpu          : usr=3.19%, sys=44.95%, ctx=9452424, majf=0, minf=2064
       IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
     >=64=100.0%
          submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.1%
          issued    : total=r=128241193/w=0/d=0, short=r=0/w=0/d=0
          latency   : target=0, window=0, percentile=100.00%, depth=128
    
    Run status group 0 (all jobs):
        READ: io=500942MB, aggrb=4174.5MB/s, minb=4174.5MB/s, 
    maxb=4174.5MB/s, mint=120001msec, maxt=120001msec
    
    Disk stats (read/write):
         md1: ios=9392258/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
    aggrios=588533/0, aggrmerge=0/0, aggrticks=63464/0, aggrin_queue=63476, 
    aggrutil=36.40%
       nvme15n1: ios=588661/0, merge=0/0, ticks=66932/0, in_queue=66824, 
    util=36.40%
       nvme6n1: ios=589278/0, merge=0/0, ticks=60768/0, in_queue=60600, 
    util=34.84%
       nvme9n1: ios=588744/0, merge=0/0, ticks=64344/0, in_queue=64480, 
    util=35.85%
       nvme11n1: ios=588005/0, merge=0/0, ticks=65636/0, in_queue=65828, 
    util=36.02%
       nvme2n1: ios=588097/0, merge=0/0, ticks=62296/0, in_queue=62440, 
    util=35.00%
       nvme14n1: ios=588451/0, merge=0/0, ticks=64480/0, in_queue=64408, 
    util=35.87%
       nvme5n1: ios=588654/0, merge=0/0, ticks=60736/0, in_queue=60704, 
    util=34.66%
       nvme8n1: ios=588843/0, merge=0/0, ticks=63980/0, in_queue=63928, 
    util=35.40%
       nvme10n1: ios=588315/0, merge=0/0, ticks=62436/0, in_queue=62432, 
    util=35.15%
       nvme1n1: ios=588327/0, merge=0/0, ticks=64432/0, in_queue=64564, 
    util=36.10%
       nvme13n1: ios=588342/0, merge=0/0, ticks=65856/0, in_queue=65892, 
    util=36.06%
       nvme4n1: ios=588343/0, merge=0/0, ticks=64528/0, in_queue=64752, 
    util=35.73%
       nvme7n1: ios=589243/0, merge=0/0, ticks=63740/0, in_queue=63696, 
    util=35.34%
       nvme0n1: ios=588499/0, merge=0/0, ticks=61308/0, in_queue=61268, 
    util=34.83%
       nvme12n1: ios=588221/0, merge=0/0, ticks=62076/0, in_queue=61976, 
    util=35.19%
       nvme3n1: ios=588512/0, merge=0/0, ticks=61880/0, in_queue=61824, 
    util=35.09%
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
    
    
    C)
    
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
    postgresql_storage_workload.fio
    randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
    iodepth=128
    ...
    fio-2.1.11
    Starting 32 threads
    Jobs: 1 (f=0): [_(24),r(1),_(7)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 
    iops] [eta 00m:00s]
    randread: (groupid=0, jobs=32): err= 0: pid=2263: Mon Jan 23 19:30:49 2017
       read : io=977.76GB, bw=8343.4MB/s, iops=2135.9K, runt=120001msec
         slat (usec): min=0, max=3372, avg= 7.30, stdev=27.48
         clat (usec): min=1, max=21871, avg=997.26, stdev=1995.10
          lat (usec): min=4, max=21982, avg=1004.60, stdev=2010.61
         clat percentiles (usec):
          |  1.00th=[  374],  5.00th=[  378], 10.00th=[  378], 20.00th=[  386],
          | 30.00th=[  390], 40.00th=[  394], 50.00th=[  394], 60.00th=[  398],
          | 70.00th=[  406], 80.00th=[  540], 90.00th=[ 1496], 95.00th=[ 5408],
          | 99.00th=[10944], 99.50th=[12224], 99.90th=[14016], 99.95th=[14784],
          | 99.99th=[16512]
         bw (KB  /s): min=    0, max=1353208, per=5.91%, avg=505187.96, 
    stdev=549388.79
         lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
         lat (usec) : 100=0.01%, 250=0.01%, 500=78.69%, 750=5.80%, 1000=2.94%
         lat (msec) : 2=3.84%, 4=2.66%, 10=4.52%, 20=1.56%, 50=0.01%
       cpu          : usr=3.09%, sys=68.19%, ctx=10916103, majf=0, minf=4128
       IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
     >=64=100.0%
          submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.0%
          complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
     >=64=0.1%
          issued    : total=r=256309234/w=0/d=0, short=r=0/w=0/d=0
          latency   : target=0, window=0, percentile=100.00%, depth=128
    
    Run status group 0 (all jobs):
        READ: io=977.76GB, aggrb=8343.4MB/s, minb=8343.4MB/s, 
    maxb=8343.4MB/s, mint=120001msec, maxt=120001msec
    
    Disk stats (read/write):
         md1: ios=10762806/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
    aggrios=675866/0, aggrmerge=0/0, aggrticks=70332/0, aggrin_queue=70505, 
    aggrutil=28.65%
       nvme15n1: ios=675832/0, merge=0/0, ticks=69604/0, in_queue=69648, 
    util=27.82%
       nvme6n1: ios=676181/0, merge=0/0, ticks=75584/0, in_queue=75552, 
    util=28.65%
       nvme9n1: ios=675762/0, merge=0/0, ticks=67916/0, in_queue=68236, 
    util=27.79%
       nvme11n1: ios=675745/0, merge=0/0, ticks=68296/0, in_queue=68804, 
    util=27.66%
       nvme2n1: ios=676036/0, merge=0/0, ticks=70904/0, in_queue=71240, 
    util=28.14%
       nvme14n1: ios=675737/0, merge=0/0, ticks=71560/0, in_queue=71716, 
    util=28.13%
       nvme5n1: ios=676592/0, merge=0/0, ticks=71832/0, in_queue=71976, 
    util=28.02%
       nvme8n1: ios=675969/0, merge=0/0, ticks=69152/0, in_queue=69192, 
    util=27.63%
       nvme10n1: ios=675607/0, merge=0/0, ticks=67600/0, in_queue=67668, 
    util=27.74%
       nvme1n1: ios=675528/0, merge=0/0, ticks=72856/0, in_queue=73136, 
    util=28.48%
       nvme13n1: ios=675189/0, merge=0/0, ticks=69736/0, in_queue=70084, 
    util=28.04%
       nvme4n1: ios=676117/0, merge=0/0, ticks=68120/0, in_queue=68600, 
    util=27.88%
       nvme7n1: ios=675726/0, merge=0/0, ticks=72004/0, in_queue=71960, 
    util=28.25%
       nvme0n1: ios=676119/0, merge=0/0, ticks=71228/0, in_queue=71264, 
    util=28.12%
       nvme12n1: ios=675837/0, merge=0/0, ticks=70320/0, in_queue=70368, 
    util=27.99%
       nvme3n1: ios=675887/0, merge=0/0, ticks=68600/0, in_queue=68636, 
    util=27.95%
    oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
    
    --
    To unsubscribe from this list: send the line "unsubscribe fio" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 18:33       ` Tobias Oberstein
  2017-01-23 19:10         ` Kudryavtsev, Andrey O
@ 2017-01-23 19:13         ` Sitsofe Wheeler
  2017-01-23 19:40           ` Tobias Oberstein
       [not found]         ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
  2 siblings, 1 reply; 27+ messages in thread
From: Sitsofe Wheeler @ 2017-01-23 19:13 UTC (permalink / raw)
  To: Tobias Oberstein; +Cc: Andrey Kuzmin, fio

On 23 January 2017 at 18:33, Tobias Oberstein
<tobias.oberstein@gmail.com> wrote:
>
> libaio is nowhere near what I get with engine=sync and high job counts. Mmh.
> Plus the strange behavior.

Have you tried batching the IOs and controlling how much are you
reaping at any one time? See
http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
for some of the options for controlling this...

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:10         ` Kudryavtsev, Andrey O
@ 2017-01-23 19:26           ` Tobias Oberstein
  0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 19:26 UTC (permalink / raw)
  To: Kudryavtsev, Andrey O, Andrey Kuzmin; +Cc: fio

Hi Andrey,

Am 23.01.2017 um 20:10 schrieb Kudryavtsev, Andrey O:
> Tobias,
> I’d try 128 jobs, QD 32 and disable random map and latency measurements
>        randrepeat=0
>        norandommap

I had those already set ..

>        disable_	lat
>

This I hadn't set.

Using the settings you suggest on the MD over 16 NVMes, and after 
increasing to

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat 
/proc/sys/fs/aio-max-nr
1048576

I get iops=4082.2K, which is much closer to the 7 mio IOPS I get with 
engine=sync and jobs=2800.

Cheers,
/Tobias

PS: I am still working on your other hints .. so many tips. Thanks guys!




oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
postgresql_storage_workload.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=32
...
fio-2.1.11
Starting 128 threads
Jobs: 127 (f=0): [r(51),E(1),r(76)] [3.5% done] [15018MB/0KB/0KB /s] 
[3845K/0/0 iops] [eta 14m:11s]
randread: (groupid=0, jobs=128): err= 0: pid=5878: Mon Jan 23 20:25:01 2017
   read : io=478427MB, bw=15946MB/s, iops=4082.2K, runt= 30003msec
     slat (usec): min=1, max=47954, avg=29.39, stdev=34.90
     clat (usec): min=37, max=49119, avg=972.35, stdev=673.40
     clat percentiles (usec):
      |  1.00th=[  338],  5.00th=[  446], 10.00th=[  532], 20.00th=[  660],
      | 30.00th=[  756], 40.00th=[  836], 50.00th=[  892], 60.00th=[  956],
      | 70.00th=[ 1020], 80.00th=[ 1112], 90.00th=[ 1224], 95.00th=[ 1368],
      | 99.00th=[ 4832], 99.50th=[ 5664], 99.90th=[ 6816], 99.95th=[ 7328],
      | 99.99th=[ 8896]
     bw (KB  /s): min=14024, max=393664, per=0.78%, avg=127573.83, 
stdev=51679.15
     lat (usec) : 50=0.01%, 100=0.01%, 250=0.07%, 500=8.15%, 750=21.53%
     lat (usec) : 1000=37.36%
     lat (msec) : 2=29.83%, 4=1.53%, 10=1.53%, 20=0.01%, 50=0.01%
   cpu          : usr=5.34%, sys=94.48%, ctx=11411, majf=0, minf=4224
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=122477269/w=0/d=0, short=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
    READ: io=478427MB, aggrb=15946MB/s, minb=15946MB/s, maxb=15946MB/s, 
mint=30003msec, maxt=30003msec

Disk stats (read/write):
     md1: ios=121675684/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=7654829/0, aggrmerge=0/0, aggrticks=985171/0, 
aggrin_queue=1037857, aggrutil=100.00%
   nvme15n1: ios=7650998/0, merge=0/0, ticks=938492/0, in_queue=968336, 
util=100.00%
   nvme6n1: ios=7655891/0, merge=0/0, ticks=1044320/0, in_queue=1074048, 
util=100.00%
   nvme9n1: ios=7654289/0, merge=0/0, ticks=954912/0, in_queue=1043060, 
util=100.00%
   nvme11n1: ios=7656494/0, merge=0/0, ticks=955896/0, in_queue=1050748, 
util=100.00%
   nvme2n1: ios=7656190/0, merge=0/0, ticks=998112/0, in_queue=1090236, 
util=100.00%
   nvme14n1: ios=7655685/0, merge=0/0, ticks=956648/0, in_queue=982168, 
util=100.00%
   nvme5n1: ios=7652531/0, merge=0/0, ticks=1040592/0, in_queue=1068920, 
util=100.00%
   nvme8n1: ios=7652934/0, merge=0/0, ticks=969800/0, in_queue=994468, 
util=100.00%
   nvme10n1: ios=7655795/0, merge=0/0, ticks=949068/0, in_queue=975252, 
util=100.00%
   nvme1n1: ios=7652373/0, merge=0/0, ticks=955772/0, in_queue=1040828, 
util=100.00%
   nvme13n1: ios=7654611/0, merge=0/0, ticks=965664/0, in_queue=1053560, 
util=100.00%
   nvme4n1: ios=7655941/0, merge=0/0, ticks=1001460/0, in_queue=1113764, 
util=100.00%
   nvme7n1: ios=7652420/0, merge=0/0, ticks=991072/0, in_queue=1018248, 
util=100.00%
   nvme0n1: ios=7656124/0, merge=0/0, ticks=1051448/0, in_queue=1083992, 
util=100.00%
   nvme12n1: ios=7656450/0, merge=0/0, ticks=1031252/0, 
in_queue=1064052, util=100.00%
   nvme3n1: ios=7658543/0, merge=0/0, ticks=958228/0, in_queue=984040, 
util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat 
postgresql_storage_workload.fio
[global]
group_reporting
#filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
filename=/dev/md1
#filename=/data/test.dat
#filename=/dev/data/data
size=30G
#ioengine=sync
#iodepth=1
ioengine=libaio
iodepth=32
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
#bs=8k
bs=4k
#ramp_time=0
runtime=30

[randread]
stonewall
rw=randread
numjobs=128

#[randwrite]
#stonewall
#rw=randwrite
#numjobs=32

#[randreadwrite7030]
#stonewall
#rw=randrw
#rwmixread=70
#numjobs=256

oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:13         ` Sitsofe Wheeler
@ 2017-01-23 19:40           ` Tobias Oberstein
  2017-01-23 20:24             ` Sitsofe Wheeler
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 19:40 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Andrey Kuzmin, fio

Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
> On 23 January 2017 at 18:33, Tobias Oberstein
> <tobias.oberstein@gmail.com> wrote:
>>
>> libaio is nowhere near what I get with engine=sync and high job counts. Mmh.
>> Plus the strange behavior.
>
> Have you tried batching the IOs and controlling how much are you
> reaping at any one time? See
> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
> for some of the options for controlling this...
>

Thanks! Nice.

For libaio, and with all the hints applied (no 4k sectors yet), I get 
(4k randread)

Individual NVMes: iops=7350.4K
MD (RAID-0) over NVMes: iops=4112.8K

The going up and down of IOPS is gone.

It's becoming more apparent I'd say, that tthere is a MD bottleneck though.

Cheers,
/Tobias


oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat best_libaio.fio
# sudo sh -c 'echo "1048576" > /proc/sys/fs/aio-max-nr'

[global]
group_reporting
size=30G
ioengine=libaio
iodepth=32
iodepth_batch_submit=8
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
bs=4k
runtime=30

[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128

[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128


oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo fio 
best_libaio.fio
randread-individual-nvmes: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
ioengine=libaio, iodepth=32
...
randread-md-over-nvmes: (g=1): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
ioengine=libaio, iodepth=32
...
fio-2.1.11
Starting 256 threads
Jobs: 128 (f=128): [_(128),r(128)] [7.9% done] [16173MB/0KB/0KB /s] 
[4140K/0/0 iops] [eta 11m:51s]
randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=6988: Mon 
Jan 23 20:37:30 2017
   read : io=861513MB, bw=28712MB/s, iops=7350.4K, runt= 30005msec
     slat (usec): min=1, max=179194, avg= 9.61, stdev=166.67
     clat (usec): min=8, max=174722, avg=543.86, stdev=736.75
     clat percentiles (usec):
      |  1.00th=[  117],  5.00th=[  139], 10.00th=[  153], 20.00th=[  175],
      | 30.00th=[  199], 40.00th=[  223], 50.00th=[  258], 60.00th=[  302],
      | 70.00th=[  394], 80.00th=[  636], 90.00th=[ 1480], 95.00th=[ 2192],
      | 99.00th=[ 3408], 99.50th=[ 3856], 99.90th=[ 4960], 99.95th=[ 5536],
      | 99.99th=[10048]
     bw (KB  /s): min=14992, max=432176, per=0.78%, avg=229721.98, 
stdev=44902.57
     lat (usec) : 10=0.01%, 50=0.01%, 100=0.10%, 250=48.21%, 500=27.38%
     lat (usec) : 750=6.48%, 1000=3.18%
     lat (msec) : 2=8.54%, 4=5.73%, 10=0.38%, 20=0.01%, 50=0.01%
     lat (msec) : 100=0.01%, 250=0.01%
   cpu          : usr=8.25%, sys=64.76%, ctx=57533651, majf=0, minf=4224
   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.1%, 16=0.1%, 32=100.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=220547266/w=0/d=0, short=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32
randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=7138: Mon Jan 
23 20:37:30 2017
   read : io=482013MB, bw=16065MB/s, iops=4112.8K, runt= 30003msec
     slat (usec): min=1, max=48048, avg=29.39, stdev=36.10
     clat (usec): min=47, max=74459, avg=964.89, stdev=637.97
     clat percentiles (usec):
      |  1.00th=[  454],  5.00th=[  540], 10.00th=[  604], 20.00th=[  692],
      | 30.00th=[  764], 40.00th=[  828], 50.00th=[  876], 60.00th=[  924],
      | 70.00th=[  980], 80.00th=[ 1064], 90.00th=[ 1176], 95.00th=[ 1320],
      | 99.00th=[ 4768], 99.50th=[ 5536], 99.90th=[ 6432], 99.95th=[ 6752],
      | 99.99th=[ 7968]
     bw (KB  /s): min=14512, max=350248, per=0.78%, avg=128572.72, 
stdev=42938.35
     lat (usec) : 50=0.01%, 100=0.01%, 250=0.03%, 500=2.69%, 750=24.84%
     lat (usec) : 1000=45.08%
     lat (msec) : 2=24.43%, 4=1.40%, 10=1.51%, 20=0.01%, 50=0.01%
     lat (msec) : 100=0.01%
   cpu          : usr=4.98%, sys=94.81%, ctx=12736, majf=0, minf=3328
   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.1%, 16=0.1%, 32=100.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=123395206/w=0/d=0, short=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
    READ: io=861513MB, aggrb=28712MB/s, minb=28712MB/s, maxb=28712MB/s, 
mint=30005msec, maxt=30005msec

Run status group 1 (all jobs):
    READ: io=482013MB, aggrb=16065MB/s, minb=16065MB/s, maxb=16065MB/s, 
mint=30003msec, maxt=30003msec

Disk stats (read/write):
   nvme0n1: ios=13713322/0, merge=0/0, ticks=2809744/0, 
in_queue=2867236, util=98.51%
   nvme1n1: ios=13713230/0, merge=0/0, ticks=11534416/0, 
in_queue=12284600, util=99.60%
   nvme2n1: ios=13713491/0, merge=0/0, ticks=9773908/0, 
in_queue=10359404, util=99.80%
   nvme3n1: ios=13713296/0, merge=0/0, ticks=6619552/0, 
in_queue=6803384, util=99.49%
   nvme4n1: ios=13713658/0, merge=0/0, ticks=6055532/0, 
in_queue=6533236, util=100.00%
   nvme5n1: ios=13713740/0, merge=0/0, ticks=2863528/0, 
in_queue=2931544, util=99.89%
   nvme6n1: ios=13713827/0, merge=0/0, ticks=2796528/0, 
in_queue=2859208, util=99.72%
   nvme7n1: ios=13713905/0, merge=0/0, ticks=2846160/0, 
in_queue=2904800, util=99.74%
   nvme8n1: ios=13713529/0, merge=0/0, ticks=7422588/0, 
in_queue=7582496, util=100.00%
   nvme9n1: ios=13713414/0, merge=0/0, ticks=13762972/0, 
in_queue=14664088, util=100.00%
   nvme10n1: ios=13714158/0, merge=0/0, ticks=6570356/0, 
in_queue=6735324, util=100.00%
   nvme11n1: ios=13714217/0, merge=0/0, ticks=4189764/0, 
in_queue=4519824, util=100.00%
   nvme12n1: ios=13714299/0, merge=0/0, ticks=7225476/0, 
in_queue=7393668, util=100.00%
   nvme13n1: ios=13714375/0, merge=0/0, ticks=4988804/0, 
in_queue=5267536, util=100.00%
   nvme14n1: ios=13714461/0, merge=0/0, ticks=7336928/0, 
in_queue=7502260, util=100.00%
   nvme15n1: ios=13713918/0, merge=0/0, ticks=11861500/0, 
in_queue=12202492, util=100.00%
   md1: ios=123098498/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
       [not found]         ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
@ 2017-01-23 20:10           ` Tobias Oberstein
       [not found]             ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>
       [not found]             ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>
  0 siblings, 2 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 20:10 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio

Hi Andrey,

Thanks again for your tips .. the psync thingy in particular. I need to 
verify if that applies to PostgreSQL, because it brings huge gains 
compared to sync!

Here is the summary of my latest numbers:

1) engine=libaio

Individual NVMes:

   iops=7350.4K
   usr=8.25%, sys=64.76%, ctx=57533651

MD (RAID-0) over NVMes:

   iops=4112.8K
   usr=4.98%, sys=94.81%, ctx=12736

=> MD reaches 55% of perf compared to non-MD.


2) engine=sync

Individual NVMes:

    IOPS=6657k
    usr=0.56%, sys=4.43%, ctx=200588483

MD (RAID-0) over NVMes:

    IOPS=1467k
    usr=0.07%, sys=4.13%, ctx=46545978

=> MD reaches 22% of perf compared to non-MD.


3) engine=psync

Individual NVMes:

    IOPS=7086k
    usr=0.60%, sys=4.43%, ctx=214720330

MD (RAID-0) over NVMes:

    IOPS=4154k
    usr=0.46%, sys=5.81%, ctx=124737165

=> MD reaches 58% of perf compared to non-MD.

==================

Are the CPU load numbers reported by FIO reliable?

I mean, compare the load between libaio and sync/psync!

Cheers,
/Tobias


oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat 
best_sync_individual_nvmes.fio
[global]
group_reporting
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
bs=4k
runtime=30

[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=2800
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ cat 
best_sync_md_over_nvmes.fio
[global]
group_reporting
size=30G
ioengine=sync
iodepth=1
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
bs=4k
runtime=30

[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=2800
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
/opt/fio/bin/fio best_sync_individual_nvmes.fio
randread-individual-nvmes: (g=0): rw=randread, 
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2747 (f=28032): 
[f(9),_(1),f(27),_(3),f(20),_(1),f(2),_(1),f(57),_(1),f(250),_(1),f(108),_(1),f(48),_(1),f(26),_(1),f(14),_(2),f(444),_(1),f(36),_(1),f(193),_(1),f(100),_(1),f(26),_(1),f(40),_(1),f(1),_(1),f(19),_(2),f(36),_(1),f(77),_(1),f(20),_(1),f(37),_(1),f(6),_(1),f(8),_(1),f(45),_(1),f(3),_(1),f(10),_(1),f(38),_(1),f(7),_(1),f(16),_(1),f(10),_(1),f(3),_(1),f(3),_(2),f(11),_(1),f(26),_(1),f(39),_(1),f(5),_(1),f(15),_(1),f(90),_(1),f(80),_(1),f(87),_(1),f(67),_(1),f(91),_(1),f(9),_(1),f(35),E(1),f(166),_(1),f(78),_(1),f(152),_(1),f(57)][100.0%][r=18.7GiB/s,w=0KiB/s][r=4885k,w=0 
IOPS][eta 00m:00s]
randread-individual-nvmes: (groupid=0, jobs=2800): err= 0: pid=8021: Mon 
Jan 23 20:51:43 2017
    read: IOPS=6657k, BW=25.5GiB/s (27.3GB/s)(762GiB/30012msec)
     clat (usec): min=31, max=35890, avg=403.07, stdev=587.78
     clat percentiles (usec):
      |  1.00th=[  112],  5.00th=[  131], 10.00th=[  145], 20.00th=[  167],
      | 30.00th=[  187], 40.00th=[  211], 50.00th=[  237], 60.00th=[  270],
      | 70.00th=[  318], 80.00th=[  406], 90.00th=[  676], 95.00th=[ 1336],
      | 99.00th=[ 3280], 99.50th=[ 4016], 99.90th=[ 5536], 99.95th=[ 6304],
      | 99.99th=[ 9536]
     lat (usec) : 50=0.01%, 100=0.18%, 250=54.00%, 500=31.18%, 750=5.73%
     lat (usec) : 1000=2.24%
     lat (msec) : 2=3.63%, 4=2.52%, 10=0.50%, 20=0.01%, 50=0.01%
   cpu          : usr=0.56%, sys=4.43%, ctx=200588483, majf=0, minf=2797
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwt: total=199803621,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=25.5GiB/s (27.3GB/s), 25.5GiB/s-25.5GiB/s 
(27.3GB/s-27.3GB/s), io=762GiB (818GB), run=30012-30012msec

Disk stats (read/write):
   nvme0n1: ios=12474932/0, merge=0/0, ticks=3440096/0, 
in_queue=3545768, util=97.54%
   nvme1n1: ios=12488816/0, merge=0/0, ticks=6811092/0, 
in_queue=7420304, util=97.96%
   nvme2n1: ios=12488737/0, merge=0/0, ticks=4947416/0, 
in_queue=5379024, util=97.12%
   nvme3n1: ios=12488626/0, merge=0/0, ticks=4578888/0, 
in_queue=4696164, util=96.85%
   nvme4n1: ios=12488514/0, merge=0/0, ticks=3848360/0, 
in_queue=4189952, util=97.85%
   nvme5n1: ios=12488384/0, merge=0/0, ticks=2872728/0, 
in_queue=2946696, util=96.89%
   nvme6n1: ios=12488271/0, merge=0/0, ticks=2480536/0, 
in_queue=2544704, util=96.92%
   nvme7n1: ios=12488165/0, merge=0/0, ticks=4038500/0, 
in_queue=4154768, util=96.91%
   nvme8n1: ios=12488052/0, merge=0/0, ticks=4553428/0, 
in_queue=4675568, util=97.22%
   nvme9n1: ios=12487937/0, merge=0/0, ticks=5487888/0, 
in_queue=5956252, util=97.72%
   nvme10n1: ios=12486833/0, merge=0/0, ticks=6234216/0, 
in_queue=6402356, util=97.54%
   nvme11n1: ios=12486699/0, merge=0/0, ticks=4646856/0, 
in_queue=5042628, util=97.76%
   nvme12n1: ios=12486586/0, merge=0/0, ticks=5331000/0, 
in_queue=5478728, util=97.59%
   nvme13n1: ios=12486467/0, merge=0/0, ticks=3464404/0, 
in_queue=3715416, util=98.27%
   nvme14n1: ios=12486358/0, merge=0/0, ticks=2576312/0, 
in_queue=2641952, util=97.49%
   nvme15n1: ios=12486251/0, merge=0/0, ticks=4135908/0, 
in_queue=4270008, util=97.69%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
/opt/fio/bin/fio best_sync_md_over_nvmes.fio
randread-md-over-nvmes: (g=0): rw=randread, 
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2800 (f=2800): [r(2800)][100.0%][r=5764MiB/s,w=0KiB/s][r=1476k,w=0 
IOPS][eta 00m:00s]
randread-md-over-nvmes: (groupid=0, jobs=2800): err= 0: pid=11137: Mon 
Jan 23 20:52:30 2017
    read: IOPS=1467k, BW=5732MiB/s (6011MB/s)(169GiB/30116msec)
     clat (usec): min=27, max=33113, avg=124.27, stdev=112.85
     clat percentiles (usec):
      |  1.00th=[   77],  5.00th=[   84], 10.00th=[   86], 20.00th=[   88],
      | 30.00th=[   93], 40.00th=[  101], 50.00th=[  104], 60.00th=[  107],
      | 70.00th=[  115], 80.00th=[  133], 90.00th=[  177], 95.00th=[  227],
      | 99.00th=[  370], 99.50th=[  506], 99.90th=[ 2096], 99.95th=[ 2544],
      | 99.99th=[ 2960]
     lat (usec) : 50=0.04%, 100=36.72%, 250=60.00%, 500=2.73%, 750=0.22%
     lat (usec) : 1000=0.07%
     lat (msec) : 2=0.12%, 4=0.11%, 10=0.01%, 20=0.01%, 50=0.01%
   cpu          : usr=0.07%, sys=4.13%, ctx=46545978, majf=0, minf=2797
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwt: total=44193488,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=5732MiB/s (6011MB/s), 5732MiB/s-5732MiB/s 
(6011MB/s-6011MB/s), io=169GiB (181GB), run=30116-30116msec

Disk stats (read/write):
     md1: ios=44010950/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=2762093/0, aggrmerge=0/0, aggrticks=280663/0, 
aggrin_queue=284837, aggrutil=99.12%
   nvme15n1: ios=2766734/0, merge=0/0, ticks=264808/0, in_queue=267732, 
util=98.68%
   nvme6n1: ios=2761142/0, merge=0/0, ticks=288704/0, in_queue=291288, 
util=98.76%
   nvme9n1: ios=2759118/0, merge=0/0, ticks=275752/0, in_queue=282288, 
util=98.95%
   nvme11n1: ios=2762423/0, merge=0/0, ticks=264996/0, in_queue=271464, 
util=98.91%
   nvme2n1: ios=2764361/0, merge=0/0, ticks=281520/0, in_queue=288924, 
util=99.12%
   nvme14n1: ios=2760515/0, merge=0/0, ticks=264796/0, in_queue=266752, 
util=98.61%
   nvme5n1: ios=2761756/0, merge=0/0, ticks=280020/0, in_queue=282840, 
util=98.92%
   nvme8n1: ios=2763138/0, merge=0/0, ticks=279332/0, in_queue=280624, 
util=98.53%
   nvme10n1: ios=2764117/0, merge=0/0, ticks=291264/0, in_queue=293444, 
util=98.67%
   nvme1n1: ios=2761579/0, merge=0/0, ticks=275872/0, in_queue=282080, 
util=98.90%
   nvme13n1: ios=2759948/0, merge=0/0, ticks=280080/0, in_queue=286324, 
util=99.05%
   nvme4n1: ios=2763271/0, merge=0/0, ticks=279592/0, in_queue=287944, 
util=98.96%
   nvme7n1: ios=2759669/0, merge=0/0, ticks=280708/0, in_queue=284056, 
util=98.88%
   nvme0n1: ios=2761263/0, merge=0/0, ticks=296868/0, in_queue=300408, 
util=98.78%
   nvme12n1: ios=2763077/0, merge=0/0, ticks=288264/0, in_queue=290264, 
util=98.71%
   nvme3n1: ios=2761377/0, merge=0/0, ticks=298040/0, in_queue=300960, 
util=98.74%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


=================

Changing engine to psync, leaving everything else unchanged:


oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
/opt/fio/bin/fio best_sync_individual_nvmes.fio
randread-individual-nvmes: (g=0): rw=randread, 
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=psync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2771 (f=40464): 
[f(8),_(1),f(14),_(1),f(30),_(1),f(6),_(1),f(4),_(1),f(7),_(1),f(14),_(1),f(6),_(1),f(62),_(1),f(3),_(1),f(167),_(1),f(309),_(1),f(269),_(1),f(47),_(1),f(206),_(1),f(26),_(1),f(56),_(2),f(4),_(1),f(39),_(1),f(148),_(1),f(148),_(1),f(4),_(1),f(63),_(1),f(27),_(1),f(19),_(1),f(314),_(1),f(189),_(1),f(205),_(1),f(377)][100.0%][r=25.7GiB/s,w=0KiB/s][r=6726k,w=0 
IOPS][eta 00m:00s]
randread-individual-nvmes: (groupid=0, jobs=2800): err= 0: pid=14753: 
Mon Jan 23 20:58:45 2017
    read: IOPS=7086k, BW=27.4GiB/s (29.3GB/s)(811GiB/30010msec)
     clat (usec): min=34, max=57916, avg=381.14, stdev=524.36
     clat percentiles (usec):
      |  1.00th=[  121],  5.00th=[  145], 10.00th=[  159], 20.00th=[  185],
      | 30.00th=[  207], 40.00th=[  229], 50.00th=[  255], 60.00th=[  286],
      | 70.00th=[  326], 80.00th=[  394], 90.00th=[  564], 95.00th=[  988],
      | 99.00th=[ 2928], 99.50th=[ 3632], 99.90th=[ 5344], 99.95th=[ 6688],
      | 99.99th=[11200]
     lat (usec) : 50=0.01%, 100=0.08%, 250=48.03%, 500=39.59%, 750=5.69%
     lat (usec) : 1000=1.66%
     lat (msec) : 2=2.69%, 4=1.91%, 10=0.32%, 20=0.01%, 50=0.01%
     lat (msec) : 100=0.01%
   cpu          : usr=0.60%, sys=4.43%, ctx=214720330, majf=0, minf=2797
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwt: total=212658246,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=27.4GiB/s (29.3GB/s), 27.4GiB/s-27.4GiB/s 
(29.3GB/s-29.3GB/s), io=811GiB (871GB), run=30010-30010msec

Disk stats (read/write):
   nvme0n1: ios=13204662/0, merge=0/0, ticks=5579056/0, 
in_queue=5713604, util=97.16%
   nvme1n1: ios=13292212/0, merge=0/0, ticks=3336164/0, 
in_queue=3661216, util=97.52%
   nvme2n1: ios=13292063/0, merge=0/0, ticks=3097888/0, 
in_queue=3359552, util=97.09%
   nvme3n1: ios=13291900/0, merge=0/0, ticks=2973176/0, 
in_queue=3072764, util=96.31%
   nvme4n1: ios=13291734/0, merge=0/0, ticks=4962684/0, 
in_queue=5434620, util=97.02%
   nvme5n1: ios=13291540/0, merge=0/0, ticks=7857284/0, 
in_queue=8108332, util=96.75%
   nvme6n1: ios=13291403/0, merge=0/0, ticks=3160292/0, 
in_queue=3249508, util=96.46%
   nvme7n1: ios=13291270/0, merge=0/0, ticks=5593256/0, 
in_queue=5748080, util=96.42%
   nvme8n1: ios=13291057/0, merge=0/0, ticks=3345216/0, 
in_queue=3450892, util=96.81%
   nvme9n1: ios=13290897/0, merge=0/0, ticks=3102344/0, 
in_queue=3394168, util=97.38%
   nvme10n1: ios=13290753/0, merge=0/0, ticks=3050116/0, 
in_queue=3129208, util=96.74%
   nvme11n1: ios=13290570/0, merge=0/0, ticks=6353996/0, 
in_queue=6956272, util=97.59%
   nvme12n1: ios=13290405/0, merge=0/0, ticks=3268144/0, 
in_queue=3372100, util=97.04%
   nvme13n1: ios=13290255/0, merge=0/0, ticks=3037220/0, 
in_queue=3297944, util=97.78%
   nvme14n1: ios=13290110/0, merge=0/0, ticks=8279264/0, 
in_queue=8503324, util=97.47%
   nvme15n1: ios=13289722/0, merge=0/0, ticks=3361284/0, 
in_queue=3467660, util=97.22%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo 
/opt/fio/bin/fio best_sync_md_over_nvmes.fio
randread-md-over-nvmes: (g=0): rw=randread, 
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=psync, iodepth=1
...
fio-2.17-17-g9cf1
Starting 2800 threads
Jobs: 2367 (f=2342): 
[_(1),r(2),_(1),r(38),_(10),r(1),_(1),r(2),_(2),r(2),_(11),r(1),_(1),r(5),_(1),E(1),r(2),f(2),E(1),r(1),f(3),r(19),f(1),r(87),_(1),r(234),_(1),r(13),_(1),r(29),f(1),_(1),r(17),E(1),r(9),E(1),r(9),E(1),r(3),E(1),r(6),_(1),r(16),E(1),r(2),_(1),r(8),E(1),r(30),_(1),r(15),E(1),r(11),f(1),r(27),f(1),r(11),E(1),r(13),_(1),r(27),E(1),r(31),E(1),r(32),E(1),r(6),_(1),r(26),E(1),r(18),E(1),r(5),_(1),E(1),r(16),f(1),r(1),f(1),r(3),f(3),r(3),f(2),r(1),f(3),r(1),f(1),r(1),f(1),r(1),f(4),r(3),f(5),r(1),f(12),E(1),r(2),f(3),r(2),f(1),_(1),f(8),r(1),f(9),r(1),f(1),r(1),f(2),r(1),f(4),r(1),f(7),r(2),f(5),r(1),f(2),r(1),f(2),r(1),f(2),_(1),f(1),r(1),f(2),r(1),f(2),r(1),f(5),r(1),f(1),r(2),f(1),r(4),f(1),r(1),f(5),r(1),f(1),r(2),f(1),r(1),E(1),r(1),f(3),r(2),f(5),r(1),f(1),r(2),f(1),r(1),f(1),r(2),_(1),f(9),E(1),f(3),_(2),f(11),_(1),f(3),_(1),f(4),_(2),f(1),_(1),f(7),_(1),f(3),_(2),f(7),_(1),f(4),_(1),f(4),_(1),f(5),_(1),f(3),_(1),f(12),_(1),f(12),_(1),f(4),_(1),f(2),_(1),f(7),_(1),f(1),_(1),f(15),_(2),f(1),_(1),f(2),_(1),f(10),_(1),f(2),_(1),f(12),_(1),f(10),_(1),f(5),_(1),f(6),_(2),f(6),_(1),f(2),_(1),f(13),_(1),f(6),_(1),f(21),_(1),f(2),_(1),f(1),_(2),f(1),_(1),f(26),_(1),f(1),_(1),f(1),E(1),f(6),_(1),f(3),_(1),f(2),_(1),f(2),_(1),f(3),_(1),f(10),_(1),f(8),_(1),f(11),_(1),f(7),_(1),f(2),_(1),f(4),_(1),f(5),_(1),f(4),_(1),f(8),_(1),f(6),_(1),f(5),_(1),f(9),_(2),f(3),_(1),f(1),_(1),f(13),_(1),f(3),_(1),f(2),_(1),f(1),_(1),f(5),_(1),f(14),_(1),f(4),_(1),f(5),_(1),f(12),_(1),f(1),_(2),f(1),_(1),f(3),_(1),f(2),_(3),f(2),_(1),f(3),_(1),f(5),_(1),f(7),_(3),f(19),_(1),f(4),_(1),f(6),_(1),f(9),_(1),f(9),_(2),f(2),_(2),f(22),_(1),f(69),_(1),f(17),_(1),f(26),_(1),f(1),_(1),f(5),_(1),f(3),_(1),f(9),_(1),f(19),_(1),f(11),_(2),f(7),_(1),f(21),_(1),f(3),_(1),f(6),_(1),f(10),_(1),f(2),_(1),f(26),_(1),f(7),_(1),f(1),_(2),f(2),_(1),f(8),_(1),f(20),_(1),f(15),_(2),f(2),_(1),f(11),_(1),f(8),_(1),f(14),_(1),f(10),_(1),f(6),_(1),f(2),_(1),f(25),_(1),f(2),_(1),f(1),_(1),f(4),_(1),f(42),_(1),f(5),_(2),f(14),_(2),f(2),_(2),f(7),_(1),f(2),_(1),f(2),_(2),f(12),_(1),f(15),_(1),f(2),_(1),f(1),_(1),f(2),_(1),f(4),_(1),f(6),_(1),f(8),_(4),f(2),_(3),f(4),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(18),_(2),f(1),_(1),f(1),_(2),f(11),_(1),f(20),_(1),f(7),_(1),f(4),_(1),f(6),_(1),f(4),_(1),f(11),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(2),_(1),f(2),_(1),f(4),_(2),f(3),_(1),f(4),_(1),E(1),_(1),f(1),_(1),f(1),_(1),E(1),_(3),f(2),_(5),f(1),_(1),E(1),f(1),_(1),f(2),_(1),f(5),_(2),f(2),_(1),E(1),f(2),_(1),f(3),E(1),f(1),_(2),f(10),_(1),f(1),_(4),f(1),_(1),f(2),_(2),f(3),_(1),f(2),_(3),f(1),_(3),f(1),_(2),f(2),E(1),f(2),_(1),f(1),_(3),f(1),_(1),f(2),E(1),f(9),_(1),f(1),E(1),f(1),_(1),f(1),_(1),f(1),E(1),f(1),E(1),_(1),f(3),E(1),f(1),_(2),f(1),_(1),E(1),f(1),_(2),f(3),_(1),f(1),_(1),f(3),_(1),f(2),_(2),f(2),_(1),f(2),_(3),f(2),_(2),f(8),_(1),f(1),_(2),f(1),_(1),f(3),_(2),f(1),_(1),f(1),_(1),f(1),_(1),f(1),_(1),f(1),_(1),f(3),_(1),f(5),_(2),f(6),_(2),f(1),_(1),f(9),_(1),f(3),_(1),f(7),_(1),f(1),_(2),f(1),_(1),f(2),_(1),f(5),_(2),f(4),_(1),f(1),_(2),f(3),_(3),f(12),_(1),f(2),_(3),f(3),_(1),f(3),_(1),f(1),_(2),f(3),_(1),f(2),_(1),f(3),_(1),f(3),_(2),f(1),_(1),f(2),_(2),f(9),E(1),f(1),E(1),f(5),_(1),E(1),f(7),_(1),f(1),_(1),f(4),_(2),f(2),_(1),f(3),_(3),f(14),_(1),f(10),_(1),f(1),_(1),f(1),_(1),E(1),f(2),E(1),f(1),_(1),f(1),_(3),f(6),_(1),f(4),E(1),f(4),_(4),f(3),_(1),f(1),_(3),f(1),_(1),f(1),E(1),f(2),_(1),f(2),_(1),f(2),_(1),f(1),E(1),_(1),E(1),f(1),_(2),f(1),_(1),f(2),_(1),f(2),_(9),f(1),_(3),f(3),_(1),f(1),_(1),f(3),_(2),f(3),_(2),f(2),_(1),f(2),_(1),f(1),_(2),f(1),_(2),f(2)][0.5%][r=15.2GiB/s,w=0KiB/s][r=3960k,w=0 
IOPS][eta 01h:38m:47s]
randread-md-over-nvmes: (groupid=0, jobs=2800): err= 0: pid=17756: Mon 
Jan 23 20:59:22 2017
    read: IOPS=4154k, BW=15.9GiB/s (17.2GB/s)(476GiB/30015msec)
     clat (usec): min=38, max=264790, avg=669.08, stdev=954.35
     clat percentiles (usec):
      |  1.00th=[  149],  5.00th=[  207], 10.00th=[  262], 20.00th=[  342],
      | 30.00th=[  410], 40.00th=[  470], 50.00th=[  532], 60.00th=[  604],
      | 70.00th=[  684], 80.00th=[  788], 90.00th=[  956], 95.00th=[ 1160],
      | 99.00th=[ 4512], 99.50th=[ 7392], 99.90th=[12480], 99.95th=[14400],
      | 99.99th=[19072]
     lat (usec) : 50=0.01%, 100=0.04%, 250=8.86%, 500=35.57%, 750=32.34%
     lat (usec) : 1000=14.64%
     lat (msec) : 2=6.53%, 4=0.91%, 10=0.89%, 20=0.22%, 50=0.01%
     lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%
   cpu          : usr=0.46%, sys=5.81%, ctx=124737165, majf=0, minf=2797
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued rwt: total=124675330,0,0, short=0,0,0, dropped=0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=15.9GiB/s (17.2GB/s), 15.9GiB/s-15.9GiB/s 
(17.2GB/s-17.2GB/s), io=476GiB (511GB), run=30015-30015msec

Disk stats (read/write):
     md1: ios=124675330/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=7792208/0, aggrmerge=0/0, aggrticks=1051705/0, 
aggrin_queue=1120720, aggrutil=100.00%
   nvme15n1: ios=7790429/0, merge=0/0, ticks=1048276/0, 
in_queue=1090348, util=100.00%
   nvme6n1: ios=7792474/0, merge=0/0, ticks=999284/0, in_queue=1035092, 
util=100.00%
   nvme9n1: ios=7792704/0, merge=0/0, ticks=1033208/0, in_queue=1151824, 
util=100.00%
   nvme11n1: ios=7792344/0, merge=0/0, ticks=1103896/0, 
in_queue=1231748, util=100.00%
   nvme2n1: ios=7791972/0, merge=0/0, ticks=1001928/0, in_queue=1121472, 
util=100.00%
   nvme14n1: ios=7795323/0, merge=0/0, ticks=1154676/0, 
in_queue=1190940, util=100.00%
   nvme5n1: ios=7784969/0, merge=0/0, ticks=1048052/0, in_queue=1081964, 
util=100.00%
   nvme8n1: ios=7792042/0, merge=0/0, ticks=1080976/0, in_queue=1112776, 
util=100.00%
   nvme10n1: ios=7786642/0, merge=0/0, ticks=1018484/0, 
in_queue=1054712, util=100.00%
   nvme1n1: ios=7793892/0, merge=0/0, ticks=1072588/0, in_queue=1194612, 
util=100.00%
   nvme13n1: ios=7792651/0, merge=0/0, ticks=1040368/0, 
in_queue=1157356, util=100.00%
   nvme4n1: ios=7794567/0, merge=0/0, ticks=1065096/0, in_queue=1198308, 
util=100.00%
   nvme7n1: ios=7794169/0, merge=0/0, ticks=1061900/0, in_queue=1104168, 
util=100.00%
   nvme0n1: ios=7794534/0, merge=0/0, ticks=1039064/0, in_queue=1071864, 
util=100.00%
   nvme12n1: ios=7796809/0, merge=0/0, ticks=1044664/0, 
in_queue=1081852, util=100.00%
   nvme3n1: ios=7789809/0, merge=0/0, ticks=1014828/0, in_queue=1052484, 
util=100.00%
oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
       [not found]               ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>
@ 2017-01-23 20:20                 ` Tobias Oberstein
  0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 20:20 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio

> Are the CPU load numbers reported by FIO reliable?
>
>
> Yes, they're quite solid, just keep in mind that cpu is being reported on a
> thread basis.


Ahhh =)

That explains that

http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_21-15-59-MEHOP3ZW.1485202585.png

Which is engine=psync on MD

and

http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_21-19-56-9ieRvRZy.1485202817.png

which is engine=libaio on MD

--

Ha. And I thought for a second the machine is now going into "full magic 
mode" ;)

Thanks,
Tobias


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:40           ` Tobias Oberstein
@ 2017-01-23 20:24             ` Sitsofe Wheeler
  2017-01-23 21:22               ` Tobias Oberstein
  0 siblings, 1 reply; 27+ messages in thread
From: Sitsofe Wheeler @ 2017-01-23 20:24 UTC (permalink / raw)
  To: Tobias Oberstein; +Cc: Andrey Kuzmin, fio

On 23 January 2017 at 19:40, Tobias Oberstein
<tobias.oberstein@gmail.com> wrote:
> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
>>
>> On 23 January 2017 at 18:33, Tobias Oberstein
>> <tobias.oberstein@gmail.com> wrote:
>>>
>>> libaio is nowhere near what I get with engine=sync and high job counts.
>>> Mmh.
>>> Plus the strange behavior.
>>
>> Have you tried batching the IOs and controlling how much are you
>> reaping at any one time? See
>>
>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
>> for some of the options for controlling this...
>
> Thanks! Nice.
>
> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
> randread)
>
> Individual NVMes: iops=7350.4K
> MD (RAID-0) over NVMes: iops=4112.8K
>
> The going up and down of IOPS is gone.
>
> It's becoming more apparent I'd say, that tthere is a MD bottleneck though.

If you're "just" trying for higher IOPS you can also try gtod_reduce
(see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce
). This subsumes things like disable_lat but you'll get fewer and less
accurate measurement stats back. With libaio userspace reap
(http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap
) can sometimes nudge numbers up but at the cost of CPU.

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 20:24             ` Sitsofe Wheeler
@ 2017-01-23 21:22               ` Tobias Oberstein
       [not found]                 ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 21:22 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Andrey Kuzmin, fio

Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler:
> On 23 January 2017 at 19:40, Tobias Oberstein
> <tobias.oberstein@gmail.com> wrote:
>> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
>>>
>>> On 23 January 2017 at 18:33, Tobias Oberstein
>>> <tobias.oberstein@gmail.com> wrote:
>>>>
>>>> libaio is nowhere near what I get with engine=sync and high job counts.
>>>> Mmh.
>>>> Plus the strange behavior.
>>>
>>> Have you tried batching the IOs and controlling how much are you
>>> reaping at any one time? See
>>>
>>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
>>> for some of the options for controlling this...
>>
>> Thanks! Nice.
>>
>> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
>> randread)
>>
>> Individual NVMes: iops=7350.4K
>> MD (RAID-0) over NVMes: iops=4112.8K
>>
>> The going up and down of IOPS is gone.
>>
>> It's becoming more apparent I'd say, that tthere is a MD bottleneck though.
>
> If you're "just" trying for higher IOPS you can also try gtod_reduce
> (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce
> ). This subsumes things like disable_lat but you'll get fewer and less
> accurate measurement stats back. With libaio userspace reap
> (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap
> ) can sometimes nudge numbers up but at the cost of CPU.
>

Using that option plus bumping to QD=64 and batch submit 16, I get

plain NVMes:   iops=7415.9K
MD over NVMes: iops=4112.4K

These are staggering numbers for sure!

In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB

Since we have 8 (physical) of these, the real world measurement (7.4 
mio) is even above the datasheet (6.8 mio).

I'd say: very good job Intel =)

The price of course is the CPU load to reach these numbers .. we have 
the 2nd largest Intel Xeon available

Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz

and 4 of these .. and even that isn't enough to saturate these NVMe 
beasts while still having room to do useful work (PostgreSQL).

So we're gonna be CPU bound .. again - this is the 2nd iteration of such 
a box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU 
bound on PostgreSQL anyway .. with 3TB RAM.

Cheers,
/Tobias




randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon 
Jan 23 22:12:30 2017
   read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec
   cpu          : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320

randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon 
Jan 23 22:12:30 2017
   read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec
   cpu          : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784



[global]
group_reporting
size=30G
ioengine=libaio
iodepth=64
iodepth_batch_submit=16
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
gtod_reduce=1
bs=4k
runtime=30

[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128

[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
       [not found]                 ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
@ 2017-01-23 21:42                   ` Andrey Kuzmin
  2017-01-23 23:51                     ` Tobias Oberstein
  0 siblings, 1 reply; 27+ messages in thread
From: Andrey Kuzmin @ 2017-01-23 21:42 UTC (permalink / raw)
  To: Tobias Oberstein; +Cc: Jens Axboe, fio

[-- Attachment #1: Type: text/plain, Size: 4078 bytes --]

On Jan 24, 2017 00:22, "Tobias Oberstein" <tobias.oberstein@gmail.com>
wrote:

Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler:

> On 23 January 2017 at 19:40, Tobias Oberstein
> <tobias.oberstein@gmail.com> wrote:
>
>> Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:
>>
>>>
>>> On 23 January 2017 at 18:33, Tobias Oberstein
>>> <tobias.oberstein@gmail.com> wrote:
>>>
>>>>
>>>> libaio is nowhere near what I get with engine=sync and high job counts.
>>>> Mmh.
>>>> Plus the strange behavior.
>>>>
>>>
>>> Have you tried batching the IOs and controlling how much are you
>>> reaping at any one time? See
>>>
>>> http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-a
>>> rg-iodepth_batch_submit
>>> for some of the options for controlling this...
>>>
>>
>> Thanks! Nice.
>>
>> For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
>> randread)
>>
>> Individual NVMes: iops=7350.4K
>> MD (RAID-0) over NVMes: iops=4112.8K
>>
>> The going up and down of IOPS is gone.
>>
>> It's becoming more apparent I'd say, that tthere is a MD bottleneck
>> though.
>>
>
> If you're "just" trying for higher IOPS you can also try gtod_reduce
> (see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-a
> rg-gtod_reduce
> ). This subsumes things like disable_lat but you'll get fewer and less
> accurate measurement stats back. With libaio userspace reap
> (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-
> arg-userspace_reap
> ) can sometimes nudge numbers up but at the cost of CPU.
>
>
Using that option plus bumping to QD=64 and batch submit 16, I get

plain NVMes:   iops=7415.9K
MD over NVMes: iops=4112.4K

These are staggering numbers for sure!

In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB

Since we have 8 (physical) of these, the real world measurement (7.4 mio)
is even above the datasheet (6.8 mio).

I'd say: very good job Intel =)

The price of course is the CPU load to reach these numbers .. we have the
2nd largest Intel Xeon available

Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz

and 4 of these .. and even that isn't enough to saturate these NVMe beasts
while still having room to do useful work (PostgreSQL).



The root cause behind the high cpu utilization is the IRQ load your eight
NVMe drives generate, although context switching your 2048 threads also add
a lot.

To cope with the unsustainable interrupt rate, you might want to give a
shot to the psync engine with RWF_HIPRI option set, which turns on polling
mode in the block layer (Jens has been very much behind it, so he's the guy
in the know of the details).

Polling avoids interrupts at the price of the somewhat inflated latency,
but reduces the cpu load noticeably, so it may turn out a good option for
your box specifically. Notice you'll need preadv2/pwrirev2 syscall support
in your kernel.

Regards,
Andrey



So we're gonna be CPU bound .. again - this is the 2nd iteration of such a
box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU bound on
PostgreSQL anyway .. with 3TB RAM.

Cheers,
/Tobias




randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon
Jan 23 22:12:30 2017
  read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec
  cpu          : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320

randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon Jan
23 22:12:30 2017
  read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec
  cpu          : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784



[global]
group_reporting
size=30G
ioengine=libaio
iodepth=64
iodepth_batch_submit=16

thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
gtod_reduce=1

bs=4k
runtime=30

[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1
:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nv
me8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1
:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128

[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128

[-- Attachment #2: Type: text/html, Size: 6568 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
       [not found]               ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>
@ 2017-01-23 21:49                 ` Tobias Oberstein
  0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 21:49 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio

Hi Andrey,

> Thanks again for your tips .. the psync thingy in particular. I need to
> verify if that applies to PostgreSQL, because it brings huge gains compared
> to sync!
>
>
> That's easy to explain, it just does one syscall less per IO. It should
> indeed bring home a measurable gain as, with synchronous I/O, I believe
> you're cpu-limited.

Sadly, it seems PostgreSQL currently does lseek/read/write. (I'll double 
check tomorrow running perf against an active PostgreSQL instance).

There was a patch discussed here using pread/pwrite when avail

https://www.postgresql.org/message-id/flat/CABUevEzZ%3DCGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q%40mail.gmail.com#CABUevEzZ=CGdmwSZwW9oNuf4pQZMExk33jcNO7rseqrAgKzj5Q@mail.gmail.com

which ends with a comment by Tom Lane (PostgreSQL core developer)

"Well, my point remains that I see little value in messing with
long-established code if you can't demonstrate a benefit that's clearly
above the noise level."

=(

I will post the findings from our discussion here to the PG hackers 
list. Maybe ...

Cheers,
/Tobias

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 21:42                   ` Andrey Kuzmin
@ 2017-01-23 23:51                     ` Tobias Oberstein
  2017-01-24  8:21                       ` Andrey Kuzmin
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-23 23:51 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: Jens Axboe, fio

> The root cause behind the high cpu utilization is the IRQ load your eight
> NVMe drives generate, although context switching your 2048 threads also add
> a lot.

Indeed, the ctx switches and interrupts are in the millions/sec.

With engine=sync and numjobs=2048, I have

ctx_sw: 8828446
inter:  5780374

It's astonishing that this is even possible.

> To cope with the unsustainable interrupt rate, you might want to give a
> shot to the psync engine with RWF_HIPRI option set, which turns on polling
> mode in the block layer (Jens has been very much behind it, so he's the guy
> in the know of the details).
>
> Polling avoids interrupts at the price of the somewhat inflated latency,
> but reduces the cpu load noticeably, so it may turn out a good option for
> your box specifically. Notice you'll need preadv2/pwrirev2 syscall support
> in your kernel.

I have run an exhaustive number of 30 tests using the different engines, 
including pvsync2 + hipri.

Please find everything here

https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines/README.md

and in the containing folder there.

Using pvsync2 + hipri indeed changes the picture .. but not to the better =(

The machine completely bogs down and the IOPS doesn't get higher.

Sidenote: would nice if FIO logged the total CPU and interrupt rates ..

Here is a screenshot while running pvsync2+hipri

http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_23-52-10-55NJYHu2.1485215076.png

--

My current preliminary conclusions on this box / workload:

- running psync is much better than sync
- all engines "above" psync only bring minor perf. gains
- Linux MD (pure striping, RAID-0) comes with rougly 45% overhead
- saturing the storage subsystem consumes nearly all CPU

Cheers,
/Tobias

PS: I have a small time window left (days) until this box goes into 
further setup for production (which means, I cannot scratch the storage 
anymore) - if you have anything you want me to try, let me know. I do my 
best to get it tested. The hardware is probably not mainstream ..




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 23:51                     ` Tobias Oberstein
@ 2017-01-24  8:21                       ` Andrey Kuzmin
  2017-01-24  9:28                         ` Tobias Oberstein
  0 siblings, 1 reply; 27+ messages in thread
From: Andrey Kuzmin @ 2017-01-24  8:21 UTC (permalink / raw)
  To: Tobias Oberstein; +Cc: fio, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 2341 bytes --]

On Jan 24, 2017 02:51, "Tobias Oberstein" <tobias.oberstein@gmail.com>
wrote:

The root cause behind the high cpu utilization is the IRQ load your eight
> NVMe drives generate, although context switching your 2048 threads also add
> a lot.
>

Indeed, the ctx switches and interrupts are in the millions/sec.

With engine=sync and numjobs=2048, I have

ctx_sw: 8828446
inter:  5780374

It's astonishing that this is even possible.


To cope with the unsustainable interrupt rate, you might want to give a
> shot to the psync engine with RWF_HIPRI option set, which turns on polling
> mode in the block layer (Jens has been very much behind it, so he's the guy
> in the know of the details).
>
> Polling avoids interrupts at the price of the somewhat inflated latency,
> but reduces the cpu load noticeably, so it may turn out a good option for
> your box specifically. Notice you'll need preadv2/pwrirev2 syscall support
> in your kernel.
>

I have run an exhaustive number of 30 tests using the different engines,
including pvsync2 + hipri.

Please find everything here

https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines/README.md

and in the containing folder there.

Using pvsync2 + hipri indeed changes the picture .. but not to the better =(



Surprising it didn't work for you since polling is very well suited for
your specific scenario.


The machine completely bogs down and the IOPS doesn't get higher.

Sidenote: would nice if FIO logged the total CPU and interrupt rates ..

Here is a screenshot while running pvsync2+hipri

http://picpaste.com/pics/Bildschirmfoto_vom_2017-01-23_23-
52-10-55NJYHu2.1485215076.png

--

My current preliminary conclusions on this box / workload:

- running psync is much better than sync



So you likely have a convincing case for Postgres guys to switch over to
pread/pwrite.

Regards,
Andrey

- all engines "above" psync only bring minor perf. gains
- Linux MD (pure striping, RAID-0) comes with rougly 45% overhead
- saturing the storage subsystem consumes nearly all CPU

Cheers,
/Tobias

PS: I have a small time window left (days) until this box goes into further
setup for production (which means, I cannot scratch the storage anymore) -
if you have anything you want me to try, let me know. I do my best to get
it tested. The hardware is probably not mainstream ..

[-- Attachment #2: Type: text/html, Size: 4133 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-24  8:21                       ` Andrey Kuzmin
@ 2017-01-24  9:28                         ` Tobias Oberstein
  2017-01-24  9:40                           ` Andrey Kuzmin
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24  9:28 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio, Jens Axboe

> My current preliminary conclusions on this box / workload:
>
> - running psync is much better than sync
>
> So you likely have a convincing case for Postgres guys to switch over to
> pread/pwrite.

I will approach them, but I want to make sure I did all my homework first.

One question that bugs me:

the difference in performance between sync and psync engines only 
surface with MD, _not_ when running over individual devices.

---

I ran Linux perf with these results:

https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/individual-nvmes-sync.md

https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/individual-nvmes-psync.md

https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/md-nvmes-sync.md

https://github.com/oberstet/scratchbox/blob/master/cruncher/sync-engines-perf/md-nvmes-psync.md

---

md-nvmes-sync shows the "issue":

Overhead  Command  Shared Object       Symbol
   73.48%  fio      [kernel.kallsyms]   [k] osq_lock


So while I think it would be good in general if PostgreSQL used 
pread/pwrite instead of lseek/read/write when available, I am afraid 
there might be a bottleneck in MD.

What do you think?

And if so, where should I raise this rgd MD? I have no clue where the 
hackers of MD hang out ..

Cheers,
/Tobias


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-24  9:28                         ` Tobias Oberstein
@ 2017-01-24  9:40                           ` Andrey Kuzmin
  2017-01-24 22:51                             ` Tobias Oberstein
  0 siblings, 1 reply; 27+ messages in thread
From: Andrey Kuzmin @ 2017-01-24  9:40 UTC (permalink / raw)
  To: Tobias Oberstein; +Cc: fio, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]

On Jan 24, 2017 12:28, "Tobias Oberstein" <tobias.oberstein@gmail.com>
wrote:

My current preliminary conclusions on this box / workload:
>
> - running psync is much better than sync
>
> So you likely have a convincing case for Postgres guys to switch over to
> pread/pwrite.
>

I will approach them, but I want to make sure I did all my homework first.

One question that bugs me:

the difference in performance between sync and psync engines only surface
with MD, _not_ when running over individual devices.



My guess is, with individual devices there's no cpu headroom for press
savings to show up. Once MD bottleneck gets in, you're not bound by cpu
anymore and the difference between doing a single syscall vs. two shows up.


---

I ran Linux perf with these results:

https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/individual-nvmes-sync.md

https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/individual-nvmes-psync.md

https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/md-nvmes-sync.md

https://github.com/oberstet/scratchbox/blob/master/cruncher/
sync-engines-perf/md-nvmes-psync.md

---

md-nvmes-sync shows the "issue":

Overhead  Command  Shared Object       Symbol
  73.48%  fio      [kernel.kallsyms]   [k] osq_lock


So while I think it would be good in general if PostgreSQL used
pread/pwrite instead of lseek/read/write when available, I am afraid there
might be a bottleneck in MD.

What do you think?

And if so, where should I raise this rgd MD? I have no clue where the
hackers of MD hang out ..


Yup, I believe it makes sense to post to the md mail list.

Regards,
Andrey


Cheers,
/Tobias

[-- Attachment #2: Type: text/html, Size: 3616 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:06     ` Kudryavtsev, Andrey O
@ 2017-01-24  9:46       ` Tobias Oberstein
  2017-01-24  9:55       ` Tobias Oberstein
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24  9:46 UTC (permalink / raw)
  To: Kudryavtsev, Andrey O, fio

Hi Andrey,

Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
>
> Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
> Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
>
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
>
>

I have gone through the whole manual, but I cannot find info about the 
meaning of different LBAFormats.

The Oracle article above uses

LBAFormat=3

which I presume means 4k secor size.

The P3608 seams to support a value up to 6:

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all 
-intelssd 0 | grep LBA
LBAFormat : 0
MaximumLBA : 3907029167
NativeMaxLBA : 3907029167
NumLBAFormats : 6

So is this the correct mapping for the value?

LBAFormat	Sector Size
0	512
1	1024
2	2048
3	4096
4	8192
5	16384
6	32768

In this case, I'd use

LBAFormat=4

to get 8k sectors, sine my workload is purely 8k.

Cheers,
/Tobias


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:06     ` Kudryavtsev, Andrey O
  2017-01-24  9:46       ` Tobias Oberstein
@ 2017-01-24  9:55       ` Tobias Oberstein
  2017-01-24 10:03       ` Tobias Oberstein
  2017-01-24 15:19       ` Tobias Oberstein
  3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24  9:55 UTC (permalink / raw)
  To: Kudryavtsev, Andrey O, fio

Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
>
> Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
> Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
>
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
>
>

It doesn't work =(


oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct start 
-nvmeformat -intelssd 0 \
 >   LBAFormat=4 \
 >   SecureEraseSetting=0 \
 >   ProtectionInformation=0 \
 >   MetaDataSettings=0
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...

- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -

Status : NVMe command reported a problem.

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all 
-intelssd 0

- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -

AggregationThreshold : 0
AggregationTime : 0
ArbitrationBurst : 0
Bootloader : 8B1B0133
CoalescingDisable : 1
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
EndToEndDataProtCapabilities : 17
EnduranceAnalyzer : Media Workload Indicators have reset values. Run 60+ 
minute workload prior to running the endurance analyzer.
ErrorString :
Firmware : 8DV101F0
FirmwareUpdateAvailable : The selected Intel SSD contains current 
firmware as of this tool release.
HighPriorityWeightArbitration : 0
IOCompletionQueuesRequested : 30
IOSubmissionQueuesRequested : 30
Index : 0
Intel : True
IntelGen3SATA : False
IntelNVMe : True
InterruptVector : 0
LBAFormat : 0
LatencyTrackingEnabled : False
LowPriorityWeightArbitration : 0
MaximumLBA : 3907029167
MediumPriorityWeightArbitration : 0
MetadataSetting : 0
ModelNumber : INTEL SSDPECME040T4
NVMeControllerID : 0
NVMeMajorVersion : 1
NVMeMinorVersion : 0
NVMePowerState : 0
NVMeTertiaryVersion : 0
NamespaceId : 1
NativeMaxLBA : 3907029167
NumErrorLogPageEntries : 63
NumLBAFormats : 6
OEM : Generic
PCILinkGenSpeed : 3
PCILinkWidth : 4
PowerGovernorMode : 0 40W for 8 Lane Slot power
Product : Fultondale X8
ProductFamily : Intel SSD DC P3608 Series
ProductProtocol : NVME
ProtectionInformation : 0
ProtectionInformationLocation : 0
SMARTEnabled : True
SMARTHealthCriticalWarningsConfiguration : 0
SMBusAddress : 106
SectorSize : 512
SerialNumber : CVF8551400324P0DGN-1
TCGSupported : False
TempThreshold : 85
TimeLimitedErrorRecovery : 0
TrimSupported : True
VolatileWriteCacheEnabled : False
WriteAtomicityDisableNormal : 0

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ isdct --version
Syntax Error: Invalid command. Error at or around '--version'.
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ isdct version
- Version Information -
Name: Intel(R) Data Center Tool
Version: 3.0.2
Description: Interact and configure Intel SSDs.


oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:06     ` Kudryavtsev, Andrey O
  2017-01-24  9:46       ` Tobias Oberstein
  2017-01-24  9:55       ` Tobias Oberstein
@ 2017-01-24 10:03       ` Tobias Oberstein
  2017-01-24 15:19       ` Tobias Oberstein
  3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 10:03 UTC (permalink / raw)
  To: Kudryavtsev, Andrey O, fio

Am 23.01.2017 um 20:06 schrieb Kudryavtsev, Andrey O:
> Hi Tobias,
> Yes, “imsm” is in generic release, you don’t need to go to the latest or special build then if you want to stay compliant. It’s mainly a different layout of a raid metadata.
>
> Your findings follow my expectations, for QD1 sync engine does good results. Can you try libio with QD4 and 2800/4 jobs?
> Most of the time I’m running Centos7 either with 3.10 or latest kernel depends of the scope of the testing.
>
> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/
>
>

It doesn't work with LBAFormat=3 either:

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct start 
-nvmeformat -intelssd 0 \
 >   LBAFormat=3 \
 >   SecureEraseSetting=0 \
 >   ProtectionInformation=0 \
 >   MetaDataSettings=0
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...

- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -

Status : Interrupted system call

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ sudo isdct show -all 
-intelssd 0 | grep LBA
LBAFormat : 0
MaximumLBA : 3907029167
NativeMaxLBA : 3907029167
NumLBAFormats : 6

-----

And using exactly the same parameters as the article above:

oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$ time sudo isdct start 
-nvmeformat -intelssd 0 \
 >   LBAFormat=3 \
 >   SecureEraseSetting=2 \
 >   ProtectionInformation=0 \
 >   MetaDataSettings=0
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...

- Intel SSD DC P3608 Series CVF8551400324P0DGN-1 -

Status : Interrupted system call


real	0m26.901s
user	0m0.048s
sys	0m0.032s
oberstet@svr-psql19:~/scm/parcit/RA/user/oberstet$


----

I see the following in kernel log:

[417528.128501] nvme nvme0: I/O 0 QID 0 timeout, reset controller
[417786.440977] nvme nvme0: I/O 0 QID 0 timeout, reset controller


What should I do?

Thanks alot,
/Tobias

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-23 19:06     ` Kudryavtsev, Andrey O
                         ` (2 preceding siblings ...)
  2017-01-24 10:03       ` Tobias Oberstein
@ 2017-01-24 15:19       ` Tobias Oberstein
  3 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 15:19 UTC (permalink / raw)
  To: Kudryavtsev, Andrey O, fio

Hi Andrey,

> Changing sector to 4k is easy, this can really help. see DCT manual, it’s there.
> This can be relevant for you https://itpeernetwork.intel.com/how-to-configure-oracle-redo-on-the-intel-pcie-ssd-dc-p3700/

After overcoming my issues with isdct, and reformatting the NVMes to 4k 
sector size, success!

9.5 mio IOPS =)

This is another 34% faster than before.

So: thanks a bunch for your tip!

Cheers,
/Tobias


Next steps:

- approach MD developers about bottlenecks there
- approach PostgreSQL about using pread/pwrite (instead of lseek/read/write)


randread-individual-nvmes: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, 
ioengine=libaio, iodepth=128
...
fio-2.1.11
Starting 128 threads
Jobs: 128 (f=2048): [r(128)] [100.0% done] [37244MB/0KB/0KB /s] 
[9534K/0/0 iops] [eta 00m:00s]
randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=25406: Tue 
Jan 24 15:57:19 2017
   read : io=1083.9GB, bw=36964MB/s, iops=9462.8K, runt= 30026msec
   cpu          : usr=9.00%, sys=77.01%, ctx=49252920, majf=0, minf=16512


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-24  9:40                           ` Andrey Kuzmin
@ 2017-01-24 22:51                             ` Tobias Oberstein
  2017-01-25 16:23                               ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-24 22:51 UTC (permalink / raw)
  To: Andrey Kuzmin; +Cc: fio, Jens Axboe

> My current preliminary conclusions on this box / workload:
>>
>> - running psync is much better than sync
>>
>> So you likely have a convincing case for Postgres guys to switch over to
>> pread/pwrite.

I did raise it on the PG hackers mailing list, but I couldn't convince 
them =(

Pity, since there even was a patch in the past (the change seems to be 
easy, but was rejected).

They say, I would need to come up with a real world PostgreSQL database 
workload that shows this effect is above the noise level.

And since PostgreSQL is such a CPU hog anyway, and since I don't have 
time for a full research project, I leave it.

---

But, I did more FIO level benchmarking to compare the efficiency of 
different IO methods:

Here are more numbers that quantify the differences of the IO method used.

ioengine	sync	psync	vsync	pvsync	pvsync2	pvsync2+hipri
iodepth		1	1	1	1	1	1
numjobs		1024	1024	1024	1024	1024	1024
						
						
concurrency	1024	1024	1024	1024	1024	1024
iops (k)	9171	9390	9196	9473	9527	9516
user		7,7	9,3	8,6	9,0	9,3	2,6
system		86,8	77,0	85,8	76,3	77,3	97,4
total		94,5	86,3	94,4	85,3	86,6	100,0
						
iops/system	105,7	121,9	107,2	124,2	123,2	97,7


As can be seen, the kIOPS normalized to system CPU load (last line) for 
psync (pread/pwrite) is significantly higher than for sync 
(lseek/read/write).

Now here is AIO:

ioengine	libaio	libaio	libaio
iodepth		32	32	32
numjobs		128	64	32
			
			
concurrency	4096	2048	1024
iops (k)	9485,6	9479,4	8718,1
user		6,7	3,4	2,4
system		59,2	30,0	16,7
total		65,9	33,4	19,1
			
iops/system	160,2	316,0	522,0

The highest kIOPS/system is reached at a concurrency of 1024.

However, during my tests, I get this in kernel log:

[459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for 22s! 
[swapper/46:0]
[461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for 22s! 
[swapper/26:0]
[461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! 
[swapper/23:0]

I wild guess: these lockups are actually deadlocks. AIO seems to be 
tricky for the kernel too.

Cheers,
/Tobias


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-24 22:51                             ` Tobias Oberstein
@ 2017-01-25 16:23                               ` Elliott, Robert (Persistent Memory)
  2017-01-26 17:52                                 ` Tobias Oberstein
  0 siblings, 1 reply; 27+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-01-25 16:23 UTC (permalink / raw)
  To: Tobias Oberstein, Andrey Kuzmin; +Cc: fio, Jens Axboe



> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
> Behalf Of Tobias Oberstein
> Sent: Tuesday, January 24, 2017 4:52 PM
> To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>
> Cc: fio@vger.kernel.org; Jens Axboe <axboe@kernel.dk>
> Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
> 
> However, during my tests, I get this in kernel log:
> 
> [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for
> 22s!
> [swapper/46:0]
> [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for
> 22s!
> [swapper/26:0]
> [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for
> 22s!
> [swapper/23:0]
> 
> I wild guess: these lockups are actually deadlocks. AIO seems to be
> tricky for the kernel too.
> 

Probably not deadlocks.  One easy to way trigger those is to submit
IOs on one set of CPUs and expect a different set of CPUs to handle
the interrupts and completions.  The latter CPUs can easily become
overwhelmed.  The best remedy I've found is to require CPUs to handle
their own IOs, which self-throttles them from submitting more IOs
than they can handle.

The storage device driver needs to set up its hardware interrupts
that way.  Then, rq_affinity=2 ensures the block layer completions
are handled on the submitting CPU.

You can add this to the kernel command line (e.g., in 
/boot/grub/grub.conf) to squelch those checks:
	nosoftlockup

Those prints themselves can induce more soft lockups if you have a
live serial port, because printing to the serial port is slow
and blocking.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
  2017-01-25 16:23                               ` Elliott, Robert (Persistent Memory)
@ 2017-01-26 17:52                                 ` Tobias Oberstein
  0 siblings, 0 replies; 27+ messages in thread
From: Tobias Oberstein @ 2017-01-26 17:52 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory); +Cc: fio, Jens.Wilke@parcIT.de

Hi Robert,

Am 25.01.2017 um 17:23 schrieb Elliott, Robert (Persistent Memory):
>
>
>> -----Original Message-----
>> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
>> Behalf Of Tobias Oberstein
>> Sent: Tuesday, January 24, 2017 4:52 PM
>> To: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>
>> Cc: fio@vger.kernel.org; Jens Axboe <axboe@kernel.dk>
>> Subject: Re: 4x lower IOPS: Linux MD vs indiv. devices - why?
>>
>> However, during my tests, I get this in kernel log:
>>
>> [459346.155564] NMI watchdog: BUG: soft lockup - CPU#46 stuck for
>> 22s!
>> [swapper/46:0]
>> [461040.530959] NMI watchdog: BUG: soft lockup - CPU#26 stuck for
>> 22s!
>> [swapper/26:0]
>> [461044.279081] NMI watchdog: BUG: soft lockup - CPU#23 stuck for
>> 22s!
>> [swapper/23:0]
>>
>> I wild guess: these lockups are actually deadlocks. AIO seems to be
>> tricky for the kernel too.
>>
>
> Probably not deadlocks.  One easy to way trigger those is to submit
> IOs on one set of CPUs and expect a different set of CPUs to handle
> the interrupts and completions.  The latter CPUs can easily become
> overwhelmed.  The best remedy I've found is to require CPUs to handle
> their own IOs, which self-throttles them from submitting more IOs
> than they can handle.
>
> The storage device driver needs to set up its hardware interrupts
> that way.  Then, rq_affinity=2 ensures the block layer completions
> are handled on the submitting CPU.
>
> You can add this to the kernel command line (e.g., in
> /boot/grub/grub.conf) to squelch those checks:
> 	nosoftlockup
>
> Those prints themselves can induce more soft lockups if you have a
> live serial port, because printing to the serial port is slow
> and blocking.
>

Thanks alot for your tips!

Indeed, we currently have rq_affinity=1.

Are there any risks involved?

I mean, this is a complex box .. pls see below.

Also: sadly, not each of the NUMA sockets has exactly 2 NVMes (due to 
mainboard / slot limitations). So wouldn't enforcing IO affinity be a 
problem with this?

Cheers,
/Tobias

PS: The mainboard is

https://www.supermicro.nl/products/motherboard/Xeon/C600/X10QBI.cfm

Yeah, I know, no offense - this particular piece isn't HPE;)


The current settings / hardware:


oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/rq_affinity
1
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/scheduler
none
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/optimal_io_size
0
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/iostats
1
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
128
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/hw_sector_size
4096
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/physical_block_size
4096
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/nomerges
0
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/io_poll
1
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/minimum_io_size
4096
oberstet@svr-psql19:~$ cat /sys/block/nvme0n1/queue/write_cache
write through



oberstet@svr-psql19:~$ cat /proc/cpuinfo | grep "Intel(R) Xeon(R) CPU 
E7-8880 v4 @ 2.20GHz" | wc -l
176
oberstet@svr-psql19:~$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 88 
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
node 0 size: 773944 MB
node 0 free: 770949 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 
42 43 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 
126 127 128 129 130 131
node 1 size: 774137 MB
node 1 free: 762335 MB
node 2 cpus: 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 
64 65 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 
148 149 150 151 152 153
node 2 size: 774126 MB
node 2 free: 763220 MB
node 3 cpus: 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 
170 171 172 173 174 175
node 3 size: 774136 MB
node 3 free: 770518 MB
node distances:
node   0   1   2   3
   0:  10  21  21  21
   1:  21  10  21  21
   2:  21  21  10  21
   3:  21  21  21  10



oberstet@svr-psql19:~$ find /sys/devices | egrep 'nvme[0-9][0-9]?$'
/sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/0000:08:02.0/0000:0a:00.0/nvme/nvme3
/sys/devices/pci0000:00/0000:00:03.0/0000:07:00.0/0000:08:01.0/0000:09:00.0/nvme/nvme2
/sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0/0000:04:01.0/0000:05:00.0/nvme/nvme0
/sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0/0000:04:02.0/0000:06:00.0/nvme/nvme1
/sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/0000:84:02.0/0000:86:00.0/nvme/nvme9
/sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/0000:84:01.0/0000:85:00.0/nvme/nvme8
/sys/devices/pci0000:40/0000:40:03.2/0000:46:00.0/0000:47:01.0/0000:48:00.0/nvme/nvme6
/sys/devices/pci0000:40/0000:40:03.2/0000:46:00.0/0000:47:02.0/0000:49:00.0/nvme/nvme7
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.0/0000:42:02.0/0000:44:00.0/nvme/nvme5
/sys/devices/pci0000:40/0000:40:02.0/0000:41:00.0/0000:42:01.0/0000:43:00.0/nvme/nvme4
/sys/devices/pci0000:c0/0000:c0:02.2/0000:c5:00.0/0000:c6:02.0/0000:c8:00.0/nvme/nvme13
/sys/devices/pci0000:c0/0000:c0:02.2/0000:c5:00.0/0000:c6:01.0/0000:c7:00.0/nvme/nvme12
/sys/devices/pci0000:c0/0000:c0:02.0/0000:c1:00.0/0000:c2:01.0/0000:c3:00.0/nvme/nvme10
/sys/devices/pci0000:c0/0000:c0:02.0/0000:c1:00.0/0000:c2:02.0/0000:c4:00.0/nvme/nvme11
/sys/devices/pci0000:c0/0000:c0:03.0/0000:c9:00.0/0000:ca:02.0/0000:cc:00.0/nvme/nvme15
/sys/devices/pci0000:c0/0000:c0:03.0/0000:c9:00.0/0000:ca:01.0/0000:cb:00.0/nvme/nvme14


oberstet@svr-psql19:~$ egrep -H '.*' /sys/bus/pci/slots/*/address
/sys/bus/pci/slots/0/address:0000:01:00
/sys/bus/pci/slots/10/address:0000:c5:00
/sys/bus/pci/slots/11/address:0000:c9:00
/sys/bus/pci/slots/1/address:0000:03:00
/sys/bus/pci/slots/2/address:0000:07:00
/sys/bus/pci/slots/3/address:0000:46:00
/sys/bus/pci/slots/4/address:0000:41:00
/sys/bus/pci/slots/5/address:0000:45:00
/sys/bus/pci/slots/6/address:0000:81:00
/sys/bus/pci/slots/7/address:0000:82:00
/sys/bus/pci/slots/8/address:0000:c1:00
/sys/bus/pci/slots/9/address:0000:83:00



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-01-26 17:52 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 16:26 4x lower IOPS: Linux MD vs indiv. devices - why? Tobias Oberstein
     [not found] ` <CANvN+en2ihATNgrbgzwNXAK87wNh+6jXHinmg2-VmHon31AJzA@mail.gmail.com>
2017-01-23 17:52   ` Tobias Oberstein
     [not found]     ` <CANvN+em0cjWRnQWccdORKFEJk0OSeQOrZq+XE6kzPmqMPB--4g@mail.gmail.com>
2017-01-23 18:33       ` Tobias Oberstein
2017-01-23 19:10         ` Kudryavtsev, Andrey O
2017-01-23 19:26           ` Tobias Oberstein
2017-01-23 19:13         ` Sitsofe Wheeler
2017-01-23 19:40           ` Tobias Oberstein
2017-01-23 20:24             ` Sitsofe Wheeler
2017-01-23 21:22               ` Tobias Oberstein
     [not found]                 ` <CANvN+emLjb9idri9r42V3W9ia6v0EDGdJYFfhzq6rAuzGWec8Q@mail.gmail.com>
2017-01-23 21:42                   ` Andrey Kuzmin
2017-01-23 23:51                     ` Tobias Oberstein
2017-01-24  8:21                       ` Andrey Kuzmin
2017-01-24  9:28                         ` Tobias Oberstein
2017-01-24  9:40                           ` Andrey Kuzmin
2017-01-24 22:51                             ` Tobias Oberstein
2017-01-25 16:23                               ` Elliott, Robert (Persistent Memory)
2017-01-26 17:52                                 ` Tobias Oberstein
     [not found]         ` <CANvN+emM2xeKtEgVofOyKri6WBtjqc_o1LMT8Sfawb_RMRXT0g@mail.gmail.com>
2017-01-23 20:10           ` Tobias Oberstein
     [not found]             ` <CANvN+e=ityWtQj_TJ3yZgTM7mr17VB=3OeyQEEQvdb5tR5AGLA@mail.gmail.com>
     [not found]               ` <CANvN+emUGQ=voye=E6g4jFRxbp5eS8cGVJb3vTSn-bD5Db2Ycw@mail.gmail.com>
2017-01-23 20:20                 ` Tobias Oberstein
     [not found]             ` <CANvN+e=ASW14ShvY6dmVvUDY3PJVWwY9oQSbOT9EiOnQbSZHzA@mail.gmail.com>
     [not found]               ` <CANvN+ek0DgHF4gFAVep9ygdi=4pi9O9Fp5u3-VOd0iEVCSS0=Q@mail.gmail.com>
2017-01-23 21:49                 ` Tobias Oberstein
2017-01-23 18:18 ` Kudryavtsev, Andrey O
2017-01-23 18:53   ` Tobias Oberstein
2017-01-23 19:06     ` Kudryavtsev, Andrey O
2017-01-24  9:46       ` Tobias Oberstein
2017-01-24  9:55       ` Tobias Oberstein
2017-01-24 10:03       ` Tobias Oberstein
2017-01-24 15:19       ` Tobias Oberstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.