All of lore.kernel.org
 help / color / mirror / Atom feed
* Measuring IOPS
@ 2011-07-29 15:37 Martin Steigerwald
  2011-07-29 16:14 ` Martin Steigerwald
  2011-08-03 19:31 ` Measuring IOPS Martin Steigerwald
  0 siblings, 2 replies; 20+ messages in thread
From: Martin Steigerwald @ 2011-07-29 15:37 UTC (permalink / raw)
  To: fio, Jens Axboe

Hi!

I am currently writing an article about fio for a german print magazine 
after having packaged it for Debian and using it in performance analysis & 
tuning trainings.

After introducting into the concepts of fio with some basic job files I´d 
like how to do meaningful IOPS measurements that also work with SSDs that 
compress.

For some first tests I came up with:

martin@merkaba:~[…]> cat iops.job 
[global]
size=2G
bsrange=2-16k
filename=iops1
numjobs=1
iodepth=1
# Zufällige Daten für SSDs, die komprimieren
refill_buffers=1

[zufälligschreiben]
rw=randwrite
stonewall
[sequentiellschreiben]
rw=write
stonewall

[zufälliglesen]
rw=randread
stonewall
[sequentielllesen]
rw=read

(small german dictionary:
- zufällig => random
- lesen => read
- schreiben => write;)

This takes the following into account:
- It is recommended to just use one process. Why actually? Why not just 
filling the device with as much requests as possible and see what it can 
handle?

- I do instruct fio first to write random data by even refilling the buffer 
with different random data for each write - thats for compressing SSDs, 
those with newer sandforce chips

- I let it do sync I/O cause I want to measure the device, not the cache 
speed. I considered direct I/O, but at least with sync I/O engine it does 
not work on Linux 3.0 with Ext4 on an LVM: invalid request. This may or 
may not be expected. I wondering whether direct I/O is for complete 
devices, not for filesystems.


Things I didn´t consider:
- I do not use the complete device. For obvious reasons here: I tested on 
a SSD that I use for production work as well ;).
  - Thus for a harddisk this might not be realistic enough, cause a 
harddisk has different speeds at different cylinders. I think for 2-16 KB 
request it shouldn´t matter tough.
  - I am considering a read test on the complete device

- The test does not go directly to the device, so there might be some Ext4 
/ LVM overhead. On the ThinkPad T520 with Intel i5 Sandybridge Dual Core 
CPU I think this is negligible.

- 2 GB might not be enough for reliable measurements


Do you think the above job file could give realistic results? Any 
suggestions?


I got these results:

martin@merkaba:~[…]> ./fio iops.job 
zufälligschreiben: (g=0): rw=randwrite, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
sequentiellschreiben: (g=1): rw=write, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
zufälliglesen: (g=2): rw=randread, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
sequentielllesen: (g=2): rw=read, bs=2-16K/2-16K, ioengine=sync, iodepth=1
fio 1.57
Starting 4 processes
Jobs: 1 (f=1): [__r_] [100.0% done] [561.9M/0K /s] [339K/0  iops] [eta 
00m:00s]                  
zufälligschreiben: (groupid=0, jobs=1): err= 0: pid=23221
  write: io=2048.0MB, bw=16971KB/s, iops=5190 , runt=123573msec
    clat (usec): min=0 , max=275675 , avg=183.76, stdev=989.34
     lat (usec): min=0 , max=275675 , avg=184.02, stdev=989.36
    bw (KB/s) : min=  353, max=94417, per=99.87%, avg=16947.64, 
stdev=11562.05
  cpu          : usr=5.39%, sys=14.47%, ctx=344861, majf=0, minf=30
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=0/641383/0, short=0/0/0
     lat (usec): 2=4.48%, 4=23.03%, 10=27.46%, 20=5.22%, 50=2.17%
     lat (usec): 100=0.08%, 250=10.16%, 500=21.35%, 750=4.79%, 1000=0.06%
     lat (msec): 2=0.13%, 4=0.64%, 10=0.40%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%, 500=0.01%
sequentiellschreiben: (groupid=1, jobs=1): err= 0: pid=23227
  write: io=2048.0MB, bw=49431KB/s, iops=6172 , runt= 42426msec
    clat (usec): min=0 , max=83105 , avg=134.18, stdev=1286.14
     lat (usec): min=0 , max=83105 , avg=134.53, stdev=1286.14
    bw (KB/s) : min=    0, max=73767, per=109.57%, avg=54162.16, 
stdev=22989.92
  cpu          : usr=10.29%, sys=22.17%, ctx=232818, majf=0, minf=33
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=0/261869/0, short=0/0/0
     lat (usec): 2=0.10%, 4=1.16%, 10=9.97%, 20=1.14%, 50=0.09%
     lat (usec): 100=27.31%, 250=59.37%, 500=0.61%, 750=0.04%, 1000=0.02%
     lat (msec): 2=0.04%, 4=0.06%, 10=0.01%, 20=0.06%, 50=0.01%
     lat (msec): 100=0.03%
zufälliglesen: (groupid=2, jobs=1): err= 0: pid=23564
  read : io=2048.0MB, bw=198312KB/s, iops=60635 , runt= 10575msec
    clat (usec): min=0 , max=103758 , avg=14.46, stdev=1058.66
     lat (usec): min=0 , max=103758 , avg=14.50, stdev=1058.66
    bw (KB/s) : min=   98, max=1996998, per=54.76%, avg=217197.79, 
stdev=563543.94
  cpu          : usr=11.20%, sys=8.25%, ctx=513, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=641220/0/0, short=0/0/0
     lat (usec): 2=77.54%, 4=21.11%, 10=1.19%, 20=0.09%, 50=0.01%
     lat (usec): 100=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%
sequentielllesen: (groupid=2, jobs=1): err= 0: pid=23565
  read : io=2048.0MB, bw=235953KB/s, iops=29458 , runt=  8888msec
    clat (usec): min=0 , max=71904 , avg=30.61, stdev=278.25
     lat (usec): min=0 , max=71904 , avg=30.71, stdev=278.25
    bw (KB/s) : min=    2, max=266240, per=59.04%, avg=234162.53, 
stdev=63283.64
  cpu          : usr=3.42%, sys=16.70%, ctx=8326, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=261826/0/0, short=0/0/0
     lat (usec): 2=28.95%, 4=46.75%, 10=19.20%, 20=1.80%, 50=0.17%
     lat (usec): 100=0.11%, 250=0.05%, 500=0.05%, 750=0.24%, 1000=2.44%
     lat (msec): 2=0.15%, 4=0.08%, 10=0.01%, 20=0.01%, 100=0.01%

Run status group 0 (all jobs):
  WRITE: io=2048.0MB, aggrb=16970KB/s, minb=17378KB/s, maxb=17378KB/s, 
mint=123573msec, maxt=123573msec

Run status group 1 (all jobs):
  WRITE: io=2048.0MB, aggrb=49430KB/s, minb=50617KB/s, maxb=50617KB/s, 
mint=42426msec, maxt=42426msec

Run status group 2 (all jobs):
   READ: io=4096.0MB, aggrb=396624KB/s, minb=203071KB/s, maxb=241616KB/s, 
mint=8888msec, maxt=10575msec

Disk stats (read/write):
  dm-2: ios=577687/390944, merge=0/0, ticks=141180/6046100, 
in_queue=6187964, util=76.63%, aggrios=577469/390258, aggrmerge=216/761, 
aggrticks=140576/6004336, aggrin_queue=6144016, aggrutil=76.38%
    sda: ios=577469/390258, merge=216/761, ticks=140576/6004336, 
in_queue=6144016, util=76.38%

Which looks quite fine, I believe ;). I didn´t yet run this test on a 
harddisk.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-07-29 15:37 Measuring IOPS Martin Steigerwald
@ 2011-07-29 16:14 ` Martin Steigerwald
  2011-08-02 14:32   ` Measuring IOPS (solved, I think) Martin Steigerwald
  2011-08-03 19:31 ` Measuring IOPS Martin Steigerwald
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-07-29 16:14 UTC (permalink / raw)
  To: fio; +Cc: Jens Axboe

Am Freitag, 29. Juli 2011 schrieb Martin Steigerwald:
> Hi!
> 
> I am currently writing an article about fio for a german print magazine
> after having packaged it for Debian and using it in performance
> analysis & tuning trainings.
> 
> After introducting into the concepts of fio with some basic job files
> I´d like how to do meaningful IOPS measurements that also work with
> SSDs that compress.
> 
> For some first tests I came up with:
> 
> martin@merkaba:~[…]> cat iops.job
> [global]
> size=2G
> bsrange=2-16k
> filename=iops1
> numjobs=1
> iodepth=1
> # Zufällige Daten für SSDs, die komprimieren
> refill_buffers=1
> 
> [zufälligschreiben]
> rw=randwrite
> stonewall
> [sequentiellschreiben]
> rw=write
> stonewall
> 
> [zufälliglesen]
> rw=randread
> stonewall
> [sequentielllesen]
> rw=read
> 
> (small german dictionary:
> - zufällig => random
> - lesen => read
> - schreiben => write;)
[...]
> Do you think the above job file could give realistic results? Any
> suggestions?
> 
> 
> I got these results:

With a simpler read job I have different results that puzzle me:

martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/fio> cat zweierlei-
lesen-2gb-variable-blockgrößen.job
[global]
rw=randread
size=2g
bsrange=2-16k

[zufälliglesen]
stonewall
[sequentielllesen]
rw=read

martin@merkaba:~[...]> ./fio zweierlei-lesen-2gb-variable-blockgrößen.job
zufälliglesen: (g=0): rw=randread, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
sequentielllesen: (g=0): rw=read, bs=2-16K/2-16K, ioengine=sync, iodepth=1
fio 1.57
Starting 2 processes
Jobs: 1 (f=1): [r_] [100.0% done] [96146K/0K /s] [88.3K/0  iops] [eta 
00m:00s]  
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=29273
  read : io=2048.0MB, bw=20915KB/s, iops=6389 , runt=100269msec
    clat (usec): min=0 , max=103772 , avg=150.09, stdev=1042.77
     lat (usec): min=0 , max=103772 , avg=150.34, stdev=1042.79
    bw (KB/s) : min=  131, max=112571, per=50.31%, avg=21045.54, 
stdev=13225.53
  cpu          : usr=4.66%, sys=11.24%, ctx=262203, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=640622/0/0, short=0/0/0
     lat (usec): 2=23.94%, 4=26.21%, 10=7.86%, 20=1.39%, 50=0.15%
     lat (usec): 100=0.01%, 250=14.76%, 500=21.53%, 750=3.77%, 1000=0.10%
     lat (msec): 2=0.16%, 4=0.09%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%
sequentielllesen: (groupid=0, jobs=1): err= 0: pid=29274
  read : io=2048.0MB, bw=254108KB/s, iops=31748 , runt=  8253msec
    clat (usec): min=0 , max=4773 , avg=30.44, stdev=173.41
     lat (usec): min=0 , max=4773 , avg=30.54, stdev=173.41
    bw (KB/s) : min=229329, max=265720, per=607.79%, avg=254236.81, 
stdev=8940.36
  cpu          : usr=4.02%, sys=16.97%, ctx=8407, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=262021/0/0, short=0/0/0
     lat (usec): 2=30.07%, 4=46.83%, 10=17.84%, 20=1.91%, 50=0.19%
     lat (usec): 100=0.12%, 250=0.02%, 500=0.02%, 750=0.21%, 1000=2.52%
     lat (msec): 2=0.16%, 4=0.10%, 10=0.01%

Run status group 0 (all jobs):
   READ: io=4096.0MB, aggrb=41830KB/s, minb=21417KB/s, maxb=260206KB/s, 
mint=8253msec, maxt=100269msec

Disk stats (read/write):
  dm-2: ios=267216/204, merge=0/0, ticks=95188/36, in_queue=95240, 
util=80.52%, aggrios=266989/191, aggrmerge=267/175, aggrticks=94712/44, 
aggrin_queue=94312, aggrutil=80.18%
    sda: ios=266989/191, merge=267/175, ticks=94712/44, in_queue=94312, 
util=80.18%


What´s going on here?  Where does the difference between 6389 IOPS for this 
simpler read job file versus 60635 IOPS for the IOPS job file come from? 
These results compared to the results from the IOPS job do not make sense 
to me. Is it just random versus zeros? Which values are more realistic? I 
thought on an SSD random I/O versus sequential I/O should cause such a big 
difference.

Files are laid out as follows:

martin@merkaba:~[…]> sudo filefrag zufälliglesen.1.0 sequentielllesen.2.0 
iops1 
zufälliglesen.1.0: 17 extents found
sequentielllesen.2.0: 17 extents found
iops1: 258 extents found

Not that it should matter much on an SSD.

This is on an ThinkPad T520 with Intel i5 Sandybridge Dual Core, 8 GB of 
RAM and said Intel SSD 320. On Ext4 on LVM with Linux 3.0.0 debian 
package.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS (solved, I think)
  2011-07-29 16:14 ` Martin Steigerwald
@ 2011-08-02 14:32   ` Martin Steigerwald
  2011-08-02 19:48     ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-02 14:32 UTC (permalink / raw)
  To: fio; +Cc: Jens Axboe

[-- Attachment #1: Type: Text/Plain, Size: 12575 bytes --]

Am Freitag, 29. Juli 2011 schrieb Martin Steigerwald:
> Am Freitag, 29. Juli 2011 schrieb Martin Steigerwald:
> > Hi!
> > 
> > I am currently writing an article about fio for a german print
> > magazine after having packaged it for Debian and using it in
> > performance analysis & tuning trainings.
> > 
> > After introducting into the concepts of fio with some basic job files
> > I´d like how to do meaningful IOPS measurements that also work with
> > SSDs that compress.
> > 
> > For some first tests I came up with:
> > 
> > martin@merkaba:~[…]> cat iops.job
> > [global]
> > size=2G
> > bsrange=2-16k
> > filename=iops1
> > numjobs=1
> > iodepth=1
> > # Zufällige Daten für SSDs, die komprimieren
> > refill_buffers=1
> > 
> > [zufälligschreiben]
> > rw=randwrite
> > stonewall
> > [sequentiellschreiben]
> > rw=write
> > stonewall
> > 
> > [zufälliglesen]
> > rw=randread
> > stonewall
> > [sequentielllesen]
> > rw=read
> > 
> > (small german dictionary:
> > - zufällig => random
> > - lesen => read
> > - schreiben => write;)
> 
> [...]
> 
> > Do you think the above job file could give realistic results? Any
> > suggestions?
> 
> > I got these results:
> With a simpler read job I have different results that puzzle me:
> 
> martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/fio> cat
> zweierlei- lesen-2gb-variable-blockgrößen.job
> [global]
> rw=randread
> size=2g
> bsrange=2-16k
> 
> [zufälliglesen]
> stonewall
> [sequentielllesen]
> rw=read
[...]

> What´s going on here?  Where does the difference between 6389 IOPS for
> this simpler read job file versus 60635 IOPS for the IOPS job file
> come from? These results compared to the results from the IOPS job do
> not make sense to me. Is it just random versus zeros? Which values are
> more realistic? I thought on an SSD random I/O versus sequential I/O
> should cause such a big difference.
> 
> Files are laid out as follows:
> 
> martin@merkaba:~[…]> sudo filefrag zufälliglesen.1.0
> sequentielllesen.2.0 iops1
> zufälliglesen.1.0: 17 extents found
> sequentielllesen.2.0: 17 extents found
> iops1: 258 extents found
> 
> Not that it should matter much on an SSD.
> 
> This is on an ThinkPad T520 with Intel i5 Sandybridge Dual Core, 8 GB
> of RAM and said Intel SSD 320. On Ext4 on LVM with Linux 3.0.0 debian
> package.

I think I found it.

It depends on whether I specify a filename explicitely *and* globally or 
not. When I specify it the job runs fast. No matter whether there are 
zeros or random data in the file - as expected for this SSD. When I do not 
specify it or specify it in a section the job runs slow.


Now follows the complete investigation and explaination of why this was 
so:

The only difference between job files is:

martin@merkaba:~[...]> cat zweierlei-lesen-2gb-variable-blockgrößen-
jobfile-given.job
[global]
rw=randread
size=2g
bsrange=2-16k
filename=zufälliglesen.1.0

[zufälliglesen]
stonewall
[sequentielllesen]
rw=read

The only difference of the implicit-jobfile job is that I commented the 
filename option. But since the filename matches that what fio would choose by 
itself, fio in both cases should use *the same* file.


Steps to (hopefully) reproduce it:

1. do the following once to have the job files created: fio zweierlei-
lesen-2gb-variable-blockgrößen-jobfile-given.job

2. do 

su -c "echo 3 > /proc/sys/vm/drop_caches" ; fio zweierlei-lesen-2gb-
variable-blockgrößen-jobfile-implicit.job > zweierlei-lesen-2gb-variable-
blockgrößen-jobfile-implicit.results

fio runs slow.

3. do

su -c "echo 3 > /proc/sys/vm/drop_caches" ; fio zweierlei-lesen-2gb-
variable-blockgrößen-jobfile-given.job > zweierlei-lesen-2gb-variable-
blockgrößen-jobfile-given.results

fio runs fast.

4. Use kompare, vimdiff or other side by side diff to compare the results.


I do think that echo 3 > /proc/sys/vm/drop_caches is not needed, as far as 
I understand, fio clears caches if not told otherwise.


Aside from the speed difference I only found one difference that might 
explain this fast

  cpu          : usr=11.54%, sys=9.26%, ctx=3392, majf=0, minf=28

versus  this slow

  cpu          : usr=4.87%, sys=11.20%, ctx=261968, majf=0, minf=26


Any hints why giving or not giving the filename makes such a big difference?

I also tested whether this might be an UTF-8 issue and rewrote the job 
files to not use any umlauts. That didn´t make any difference.

I made it a bit more narrow even:

martin@merkaba:~[...]> diff -u zweierlei-lesen-2gb-variable-blockgroessen-
jobfile-given-in-section-no-utf8.job zweierlei-lesen-2gb-variable-
blockgroessen-jobfile-given-no-utf8.job 
--- zweierlei-lesen-2gb-variable-blockgroessen-jobfile-given-in-section-no-
utf8.job     2011-08-02 15:35:41.246226877 +0200
+++ zweierlei-lesen-2gb-variable-blockgroessen-jobfile-given-no-utf8.job        
2011-08-02 15:50:00.073095677 +0200
@@ -2,9 +2,9 @@
 rw=randread
 size=2g
 bsrange=2-16k
+filename=zufaelliglesen.1.0
 
 [zufaelliglesen]
-filename=zufaelliglesen.1.0
 stonewall
 [sequentielllesen]
 rw=read

makes the difference.


Okay, and now I understand it:

As I see from the progress display I see that fio runs the second job first. 
When the filename is in the global section, both jobs use the same file. And 
with the missing stonewall option in the second job section the sequential 
read job even runs in parallel. I wondered whether I had [rR] there.


Okay, then I know:

When I want to have mutiple jobs run one after another I need to do a 
stonewall argument in *each* job. *Also the last one*, cause in the notion 
of fio there is no last job, as fio sets up each job before it starts job 
execution. And it seems that everything it runs everything that has no 
stonewall option right away even if an earlier defined job file has a 
stonewall option. Fio only thinks sequentially for the jobs that have a 
stonewall option which might be a disadvantage, if I want to run groups of 
jobs one after another:

[readjob]
blabla

[parallelwritejob]
blabla
stonewall

[randomreadjob]
blabla

[sequentialreadjob]
blabla
stonewall

As far as I understand it, Fio would then run the both jobs without a 
stonewall option straight away while it also starts the first job with a 
stonewall option. Then, the jobs without the stonewall option might still 
running or not, fio starts the second job without the stonewall option.

So I understand it now. Personally I would prefer, if fio runs the first two 
jobs in parallel then does the stonewall, and then the second two jobs in 
parallel. Is this possible somehow?



With an additional "stonewall" for the last job in the iops.job file I made 
up it also works when the filename is specified globally.

Thus there we go:

martin@merkaba:~[...]> fio iops-done-right.job 
zufälligschreiben: (g=0): rw=randwrite, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
sequentiellschreiben: (g=1): rw=write, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
zufälliglesen: (g=2): rw=randread, bs=2-16K/2-16K, ioengine=sync, 
iodepth=1
sequentielllesen: (g=3): rw=read, bs=2-16K/2-16K, ioengine=sync, iodepth=1
fio 1.57
Starting 4 processes
Jobs: 1 (f=1): [___R] [100.0% done] [268.3M/0K /s] [33.6K/0  iops] [eta 
00m:00s]                
zufälligschreiben: (groupid=0, jobs=1): err= 0: pid=20474
  write: io=2048.0MB, bw=16686KB/s, iops=5096 , runt=125687msec
    clat (usec): min=0 , max=292792 , avg=188.16, stdev=933.01
     lat (usec): min=0 , max=292792 , avg=188.42, stdev=933.04
    bw (KB/s) : min=  320, max=63259, per=100.00%, avg=16684.86, 
stdev=9052.82
  cpu          : usr=4.60%, sys=13.58%, ctx=344489, majf=0, minf=31
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=0/640575/0, short=0/0/0
     lat (usec): 2=4.55%, 4=23.30%, 10=27.49%, 20=5.00%, 50=2.07%
     lat (usec): 100=0.09%, 250=12.02%, 500=20.41%, 750=3.46%, 1000=0.11%
     lat (msec): 2=0.31%, 4=0.78%, 10=0.40%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%, 500=0.01%
sequentiellschreiben: (groupid=1, jobs=1): err= 0: pid=20482
  write: io=2048.0MB, bw=49401KB/s, iops=6176 , runt= 42452msec
    clat (usec): min=0 , max=213632 , avg=132.32, stdev=1355.16
     lat (usec): min=1 , max=213632 , avg=132.65, stdev=1355.16
    bw (KB/s) : min=    2, max=79902, per=110.53%, avg=54600.95, 
stdev=24636.03
  cpu          : usr=10.83%, sys=21.02%, ctx=232933, majf=0, minf=34
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=0/262197/0, short=0/0/0
     lat (usec): 2=0.09%, 4=1.07%, 10=10.43%, 20=0.89%, 50=0.07%
     lat (usec): 100=35.03%, 250=51.57%, 500=0.57%, 750=0.03%, 1000=0.01%
     lat (msec): 2=0.04%, 4=0.08%, 10=0.02%, 20=0.06%, 50=0.01%
     lat (msec): 100=0.03%, 250=0.01%
zufälliglesen: (groupid=2, jobs=1): err= 0: pid=20484
  read : io=2048.0MB, bw=23151KB/s, iops=7050 , runt= 90584msec
    clat (usec): min=0 , max=70235 , avg=134.92, stdev=212.70
     lat (usec): min=0 , max=70236 , avg=135.16, stdev=212.79
    bw (KB/s) : min=    5, max=118959, per=100.36%, avg=23233.61, 
stdev=12885.69
  cpu          : usr=4.55%, sys=13.30%, ctx=259109, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=638666/0/0, short=0/0/0
     lat (usec): 2=21.65%, 4=30.12%, 10=7.42%, 20=0.69%, 50=0.06%
     lat (usec): 100=0.01%, 250=14.58%, 500=21.58%, 750=3.85%, 1000=0.01%
     lat (msec): 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%
sequentielllesen: (groupid=3, jobs=1): err= 0: pid=20820
  read : io=2048.0MB, bw=267392KB/s, iops=33432 , runt=  7843msec
    clat (usec): min=0 , max=4098 , avg=28.40, stdev=143.49
     lat (usec): min=0 , max=4098 , avg=28.51, stdev=143.49
    bw (KB/s) : min=176584, max=275993, per=100.04%, avg=267511.13, 
stdev=25285.86
  cpu          : usr=4.18%, sys=21.27%, ctx=8616, majf=0, minf=29
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued r/w/d: total=262210/0/0, short=0/0/0
     lat (usec): 2=26.04%, 4=47.22%, 10=21.85%, 20=1.61%, 50=0.13%
     lat (usec): 100=0.02%, 250=0.01%, 500=0.01%, 750=0.14%, 1000=2.96%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
  WRITE: io=2048.0MB, aggrb=16685KB/s, minb=17085KB/s, maxb=17085KB/s, 
mint=125687msec, maxt=125687msec

Run status group 1 (all jobs):
  WRITE: io=2048.0MB, aggrb=49400KB/s, minb=50586KB/s, maxb=50586KB/s, 
mint=42452msec, maxt=42452msec

Run status group 2 (all jobs):
   READ: io=2048.0MB, aggrb=23151KB/s, minb=23707KB/s, maxb=23707KB/s, 
mint=90584msec, maxt=90584msec

Run status group 3 (all jobs):
   READ: io=2048.0MB, aggrb=267391KB/s, minb=273808KB/s, maxb=273808KB/s, 
mint=7843msec, maxt=7843msec

Disk stats (read/write):
  dm-2: ios=832862/416089, merge=0/0, ticks=206456/6699696, 
in_queue=6907728, util=78.36%, aggrios=833046/418069, aggrmerge=95/663, 
aggrticks=206032/6668768, aggrin_queue=6873712, aggrutil=77.80%
    sda: ios=833046/418069, merge=95/663, ticks=206032/6668768, 
in_queue=6873712, util=77.80%

martin@merkaba:~[...]> diff -u iops.job iops-done-right.job 
--- iops.job    2011-07-29 16:40:41.776809061 +0200
+++ iops-done-right.job 2011-08-02 16:15:06.055626894 +0200
@@ -19,4 +19,5 @@
 stonewall
 [sequentielllesen]
 rw=read
+stonewall


So always question results that don´t make sense.

May this serve as pointer should anyone stumple upon something like this 
;)

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: zweierlei-lesen-2gb-variable-blockgrößen-jobfile-given.job --]
[-- Type: text/plain, Size: 136 bytes --]

[global]
rw=randread
size=2g
bsrange=2-16k
filename=zufälliglesen.1.0

[zufälliglesen]
stonewall
[sequentielllesen]
rw=read

[-- Attachment #3: zweierlei-lesen-2gb-variable-blockgrößen-jobfile-given.results --]
[-- Type: text/plain, Size: 2428 bytes --]

zufälliglesen: (g=0): rw=randread, bs=2-16K/2-16K, ioengine=sync, iodepth=1
sequentielllesen: (g=0): rw=read, bs=2-16K/2-16K, ioengine=sync, iodepth=1
fio 1.57
Starting 2 processes

zufälliglesen: (groupid=0, jobs=1): err= 0: pid=14723
  read : io=2048.0MB, bw=186066KB/s, iops=56794 , runt= 11271msec
    clat (usec): min=0 , max=103559 , avg=15.57, stdev=1069.35
     lat (usec): min=0 , max=103559 , avg=15.62, stdev=1069.35
    bw (KB/s) : min=  243, max=843245, per=53.37%, avg=198607.29, stdev=327979.47
  cpu          : usr=11.54%, sys=9.26%, ctx=3392, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=640132/0/0, short=0/0/0
     lat (usec): 2=67.88%, 4=25.43%, 10=5.79%, 20=0.36%, 50=0.05%
     lat (usec): 100=0.01%, 250=0.15%, 500=0.26%, 750=0.04%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%
sequentielllesen: (groupid=0, jobs=1): err= 0: pid=14724
  read : io=2048.0MB, bw=255813KB/s, iops=31989 , runt=  8198msec
    clat (usec): min=0 , max=22658 , avg=30.46, stdev=171.40
     lat (usec): min=0 , max=22658 , avg=30.56, stdev=171.39
    bw (KB/s) : min=240308, max=264524, per=68.85%, avg=256220.81, stdev=6556.93
  cpu          : usr=4.39%, sys=17.18%, ctx=8630, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=262251/0/0, short=0/0/0
     lat (usec): 2=26.88%, 4=48.21%, 10=19.94%, 20=1.59%, 50=0.16%
     lat (usec): 100=0.09%, 250=0.04%, 500=0.05%, 750=0.16%, 1000=2.60%
     lat (msec): 2=0.21%, 4=0.06%, 50=0.01%

Run status group 0 (all jobs):
   READ: io=4096.0MB, aggrb=372132KB/s, minb=190531KB/s, maxb=261952KB/s, mint=8198msec, maxt=11271msec

Disk stats (read/write):
  dm-2: ios=11567/15, merge=0/0, ticks=23592/508, in_queue=24100, util=79.23%, aggrios=11473/353, aggrmerge=156/14, aggrticks=24556/2332, aggrin_queue=26864, aggrutil=77.77%
    sda: ios=11473/353, merge=156/14, ticks=24556/2332, in_queue=26864, util=77.77%

[-- Attachment #4: zweierlei-lesen-2gb-variable-blockgrößen-jobfile-implicit.job --]
[-- Type: text/plain, Size: 137 bytes --]

[global]
rw=randread
size=2g
bsrange=2-16k
#filename=zufälliglesen.1.0

[zufälliglesen]
stonewall
[sequentielllesen]
rw=read

[-- Attachment #5: zweierlei-lesen-2gb-variable-blockgrößen-jobfile-implicit.results --]
[-- Type: text/plain, Size: 2410 bytes --]

zufälliglesen: (g=0): rw=randread, bs=2-16K/2-16K, ioengine=sync, iodepth=1
sequentielllesen: (g=0): rw=read, bs=2-16K/2-16K, ioengine=sync, iodepth=1
fio 1.57
Starting 2 processes

zufälliglesen: (groupid=0, jobs=1): err= 0: pid=14745
  read : io=2048.0MB, bw=21516KB/s, iops=6567 , runt= 97469msec
    clat (usec): min=0 , max=103566 , avg=146.45, stdev=1099.54
     lat (usec): min=0 , max=103566 , avg=146.68, stdev=1099.56
    bw (KB/s) : min=  132, max=118508, per=49.82%, avg=21437.43, stdev=13479.42
  cpu          : usr=4.87%, sys=11.20%, ctx=261968, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=640153/0/0, short=0/0/0
     lat (usec): 2=21.55%, 4=30.22%, 10=7.16%, 20=0.58%, 50=0.06%
     lat (usec): 100=0.01%, 250=15.55%, 500=21.14%, 750=3.68%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.03%, 20=0.01%, 50=0.01%, 100=0.01%
     lat (msec): 250=0.01%
sequentielllesen: (groupid=0, jobs=1): err= 0: pid=14746
  read : io=2048.0MB, bw=256909KB/s, iops=32115 , runt=  8163msec
    clat (usec): min=0 , max=3979 , avg=30.33, stdev=166.59
     lat (usec): min=0 , max=3979 , avg=30.44, stdev=166.58
    bw (KB/s) : min=245618, max=267008, per=597.43%, avg=257087.19, stdev=6569.89
  cpu          : usr=3.97%, sys=17.30%, ctx=8475, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=262162/0/0, short=0/0/0
     lat (usec): 2=28.30%, 4=47.36%, 10=19.41%, 20=1.64%, 50=0.15%
     lat (usec): 100=0.06%, 250=0.01%, 500=0.01%, 750=0.11%, 1000=2.69%
     lat (msec): 2=0.20%, 4=0.06%

Run status group 0 (all jobs):
   READ: io=4096.0MB, aggrb=43032KB/s, minb=22032KB/s, maxb=263075KB/s, mint=8163msec, maxt=97469msec

Disk stats (read/write):
  dm-2: ios=267158/84, merge=0/0, ticks=94852/676, in_queue=95528, util=82.25%, aggrios=267040/962, aggrmerge=164/37, aggrticks=94356/5816, aggrin_queue=99820, aggrutil=81.77%
    sda: ios=267040/962, merge=164/37, ticks=94356/5816, in_queue=99820, util=81.77%

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS (solved, I think)
  2011-08-02 14:32   ` Measuring IOPS (solved, I think) Martin Steigerwald
@ 2011-08-02 19:48     ` Jens Axboe
  2011-08-02 21:28       ` Martin Steigerwald
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2011-08-02 19:48 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: fio


That's a long email! The stonewall should be put in the job section that
has to wait for previous jobs. So, ala:

[job1]
something

[job2]
stonewall       # will wait for job1 to finish
something

[job3]
something       # will run in parallel with job2

[job4]
stonewall       # will run when job2+3 are finished
something

If that's not the case, something is broken. A quick test here seems to
show that it works.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS (solved, I think)
  2011-08-02 19:48     ` Jens Axboe
@ 2011-08-02 21:28       ` Martin Steigerwald
  2011-08-03  7:17         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-02 21:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

Am Dienstag, 2. August 2011 schrieben Sie:
> That's a long email! The stonewall should be put in the job section
> that has to wait for previous jobs. So, ala:
> 
> [job1]
> something
> 
> [job2]
> stonewall       # will wait for job1 to finish
> something
> 
> [job3]
> something       # will run in parallel with job2
> 
> [job4]
> stonewall       # will run when job2+3 are finished
> something
> 
> If that's not the case, something is broken. A quick test here seems to
> show that it works.

Its documented. From the manpage that I read several times by now:

Wait for preceding jobs in the job file to exit before starting this one.  
stonewall implies new_group.


Somehow despite my reading of manpage, README, HOWTO I came to the thought 
that it tells fio to wait for the current job to finish, thus I had the 
stonewall options misordered.

I expect that it works exactly as you said and try it this way. Instead of 
omitting the last stonewall option in my iops job file I could omit the 
first for the first job. Cause the first job does not need to wait for a 
previous job.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS (solved, I think)
  2011-08-02 21:28       ` Martin Steigerwald
@ 2011-08-03  7:17         ` Jens Axboe
  2011-08-03  9:03           ` Martin Steigerwald
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2011-08-03  7:17 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: fio

On 2011-08-02 23:28, Martin Steigerwald wrote:
> Am Dienstag, 2. August 2011 schrieben Sie:
>> That's a long email! The stonewall should be put in the job section
>> that has to wait for previous jobs. So, ala:
>>
>> [job1]
>> something
>>
>> [job2]
>> stonewall       # will wait for job1 to finish
>> something
>>
>> [job3]
>> something       # will run in parallel with job2
>>
>> [job4]
>> stonewall       # will run when job2+3 are finished
>> something
>>
>> If that's not the case, something is broken. A quick test here seems to
>> show that it works.
> 
> Its documented. From the manpage that I read several times by now:
> 
> Wait for preceding jobs in the job file to exit before starting this one.  
> stonewall implies new_group.
> 
> 
> Somehow despite my reading of manpage, README, HOWTO I came to the thought 
> that it tells fio to wait for the current job to finish, thus I had the 
> stonewall options misordered.
> 
> I expect that it works exactly as you said and try it this way. Instead of 
> omitting the last stonewall option in my iops job file I could omit the 
> first for the first job. Cause the first job does not need to wait for a 
> previous job.

Good, that makes me feel a little better :-)

Perhaps the name isn't that great? I'll gladly put in an alias for that
option, "wait_for_previous" or "barrier" or something like that. Fence?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS (solved, I think)
  2011-08-03  7:17         ` Jens Axboe
@ 2011-08-03  9:03           ` Martin Steigerwald
  2011-08-03 10:34             ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-03  9:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

Am Mittwoch, 3. August 2011 schrieben Sie:
> On 2011-08-02 23:28, Martin Steigerwald wrote:
> > Am Dienstag, 2. August 2011 schrieben Sie:
> >> That's a long email! The stonewall should be put in the job section
> >> that has to wait for previous jobs. So, ala:
> >> 
> >> [job1]
> >> something
> >> 
> >> [job2]
> >> stonewall       # will wait for job1 to finish
> >> something
> >> 
> >> [job3]
> >> something       # will run in parallel with job2
> >> 
> >> [job4]
> >> stonewall       # will run when job2+3 are finished
> >> something
> >> 
> >> If that's not the case, something is broken. A quick test here seems
> >> to show that it works.
> > 
> > Its documented. From the manpage that I read several times by now:
> > 
> > Wait for preceding jobs in the job file to exit before starting this
> > one. stonewall implies new_group.
> > 
> > 
> > Somehow despite my reading of manpage, README, HOWTO I came to the
> > thought that it tells fio to wait for the current job to finish,
> > thus I had the stonewall options misordered.
> > 
> > I expect that it works exactly as you said and try it this way.
> > Instead of omitting the last stonewall option in my iops job file I
> > could omit the first for the first job. Cause the first job does not
> > need to wait for a previous job.
> 
> Good, that makes me feel a little better :-)

What did you feel bad about? I didn´t intend to trigger bad feelings.

There was nothing wrong with fio. Behavior was documented.

> Perhaps the name isn't that great? I'll gladly put in an alias for that
> option, "wait_for_previous" or "barrier" or something like that. Fence?

wait_before? But then "wait_for_previous" might be the clearest 
description. "wait_before" would make sense with an "wait_after" that 
waits after the job for its completion. But two options for basically the 
same thing might complicate matters even more.

So "wait_for_previous" or maybe "finish_previous_first" or just 
"finish_previous" would be fine with me.

But then this doesn´t imply that fio does a cache flush. But that could be 
documented in the manpage with an additional hint on this option. I will 
think about it and possibly provide a patch.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS (solved, I think)
  2011-08-03  9:03           ` Martin Steigerwald
@ 2011-08-03 10:34             ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2011-08-03 10:34 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: fio

On 2011-08-03 11:03, Martin Steigerwald wrote:
>> Perhaps the name isn't that great? I'll gladly put in an alias for that
>> option, "wait_for_previous" or "barrier" or something like that. Fence?
> 
> wait_before? But then "wait_for_previous" might be the clearest 
> description. "wait_before" would make sense with an "wait_after" that 
> waits after the job for its completion. But two options for basically the 
> same thing might complicate matters even more.

Yes, I'm not going to add another option where only the placement of it
would make a difference. I'll add wait_for_previous.

> So "wait_for_previous" or maybe "finish_previous_first" or just 
> "finish_previous" would be fine with me.
> 
> But then this doesn´t imply that fio does a cache flush. But that could be 
> documented in the manpage with an additional hint on this option. I will 
> think about it and possibly provide a patch.

Not really impacted by that, those are controlled on a job by job basis
anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-07-29 15:37 Measuring IOPS Martin Steigerwald
  2011-07-29 16:14 ` Martin Steigerwald
@ 2011-08-03 19:31 ` Martin Steigerwald
  2011-08-03 20:22   ` Jeff Moyer
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-03 19:31 UTC (permalink / raw)
  To: fio

[-- Attachment #1: Type: Text/Plain, Size: 1661 bytes --]

Am Freitag, 29. Juli 2011 schrieben Sie:
> Hi!
> 
> I am currently writing an article about fio for a german print magazine
> after having packaged it for Debian and using it in performance
> analysis & tuning trainings.
> 
> After introducting into the concepts of fio with some basic job files
> I´d like how to do meaningful IOPS measurements that also work with
> SSDs that compress.
> 
> For some first tests I came up with:
> 
> martin@merkaba:~[…]> cat iops.job
> [global]
> size=2G
> bsrange=2-16k
> filename=iops1
> numjobs=1
> iodepth=1
> # Zufällige Daten für SSDs, die komprimieren
> refill_buffers=1
> 
> [zufälligschreiben]
> rw=randwrite
> stonewall
> [sequentiellschreiben]
> rw=write
> stonewall
> 
> [zufälliglesen]
> rw=randread
> stonewall
> [sequentielllesen]
> rw=read

Even with the additional stonewall this still isn´t accurate. I found this 
by getting completely bogus values with a SoftRAID 1 on two SAS disks.

It needs the following additional changes:

- ioengine=libaio
- direct=1
- and then due to direct I/O alignment requirement: bsrange=2k-16k

So I now also fully understand that ioengine=sync just refers to the 
synchronous nature of the system calls used, not on whether the I/Os are 
issued synchronously via sync=1 or by circumventing the page cache via 
direct=1

Attached are results that bring down IOPS on read drastically! I first let 
sequentiell.job write out the complete 2 gb with random data and then ran 
the iops.job.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: iops.job --]
[-- Type: text/plain, Size: 523 bytes --]

[global]
ioengine=libaio
direct=1
# Für zufällige Daten vorher den Job sequentiell
# laufen lassen
# Wichtig für SSDs, die komprimieren
filename=testdatei
size=2G
bsrange=2k-16k

# Das, was hier geschrieben wird, soll natürlich
# auch wieder zufällig sein
refill_buffers=1

[zufälliglesen]
stonewall
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

[zufälligschreiben]
stonewall
rw=randwrite
runtime=60

[sequentiellschreiben]
stonewall
rw=write
runtime=60


[-- Attachment #3: iops.log --]
[-- Type: text/x-log, Size: 4940 bytes --]

zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
fio 1.57
Starting 4 processes

zufälliglesen: (groupid=0, jobs=1): err= 0: pid=6954
  read : io=1322.9MB, bw=22563KB/s, iops=3194 , runt= 60001msec
    slat (usec): min=6 , max=1763 , avg=29.52, stdev=12.62
    clat (usec): min=2 , max=7206 , avg=274.52, stdev=114.08
     lat (usec): min=128 , max=7246 , avg=304.68, stdev=116.81
    bw (KB/s) : min=18844, max=25304, per=100.15%, avg=22596.20, stdev=1740.26
  cpu          : usr=4.15%, sys=10.50%, ctx=193490, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=191664/0/0, short=0/0/0
     lat (usec): 4=0.01%, 10=0.01%, 50=0.01%, 100=0.01%, 250=49.32%
     lat (usec): 500=48.57%, 750=1.98%, 1000=0.05%
     lat (msec): 2=0.05%, 4=0.02%, 10=0.01%
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=6956
  read : io=2048.0MB, bw=72598KB/s, iops=8066 , runt= 28887msec
    slat (usec): min=5 , max=1909 , avg=26.76, stdev= 8.98
    clat (usec): min=1 , max=4631 , avg=91.18, stdev=36.03
     lat (usec): min=40 , max=4644 , avg=118.51, stdev=37.86
    bw (KB/s) : min=70224, max=77412, per=100.09%, avg=72663.79, stdev=1589.19
  cpu          : usr=6.47%, sys=24.83%, ctx=234568, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=233021/0/0, short=0/0/0
     lat (usec): 2=0.01%, 4=0.01%, 10=0.01%, 50=0.71%, 100=65.76%
     lat (usec): 250=33.16%, 500=0.29%, 750=0.05%, 1000=0.02%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=6958
  write: io=2048.0MB, bw=36083KB/s, iops=6594 , runt= 58121msec
    slat (usec): min=6 , max=1952 , avg=31.79, stdev= 9.51
    clat (usec): min=0 , max=19882 , avg=113.47, stdev=216.71
     lat (usec): min=44 , max=19949 , avg=145.84, stdev=217.32
    bw (KB/s) : min=14000, max=58580, per=100.12%, avg=36125.51, stdev=10544.88
  cpu          : usr=5.66%, sys=23.66%, ctx=386270, majf=0, minf=17
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/383305/0, short=0/0/0
     lat (usec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=1.92%
     lat (usec): 100=61.91%, 250=30.73%, 500=5.22%, 750=0.12%, 1000=0.04%
     lat (msec): 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%
sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=6959
  write: io=2048.0MB, bw=63465KB/s, iops=7050 , runt= 33044msec
    slat (usec): min=6 , max=2854 , avg=30.54, stdev=11.23
    clat (usec): min=1 , max=19371 , avg=104.68, stdev=190.45
     lat (usec): min=43 , max=19417 , avg=135.81, stdev=191.17
    bw (KB/s) : min=22984, max=68224, per=100.07%, avg=63511.21, stdev=5443.51
  cpu          : usr=6.16%, sys=24.62%, ctx=234687, majf=0, minf=19
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/232969/0, short=0/0/0
     lat (usec): 2=0.01%, 4=0.01%, 10=0.01%, 50=0.68%, 100=51.92%
     lat (usec): 250=46.97%, 500=0.22%, 750=0.07%, 1000=0.07%
     lat (msec): 2=0.04%, 4=0.01%, 10=0.01%, 20=0.01%

Run status group 0 (all jobs):
   READ: io=1322.9MB, aggrb=22563KB/s, minb=23104KB/s, maxb=23104KB/s, mint=60001msec, maxt=60001msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=72598KB/s, minb=74340KB/s, maxb=74340KB/s, mint=28887msec, maxt=28887msec

Run status group 2 (all jobs):
  WRITE: io=2048.0MB, aggrb=36082KB/s, minb=36948KB/s, maxb=36948KB/s, mint=58121msec, maxt=58121msec

Run status group 3 (all jobs):
  WRITE: io=2048.0MB, aggrb=63465KB/s, minb=64988KB/s, maxb=64988KB/s, mint=33044msec, maxt=33044msec

Disk stats (read/write):
  dm-2: ios=424704/615629, merge=0/0, ticks=70028/59768, in_queue=129796, util=71.90%, aggrios=424704/616498, aggrmerge=0/60, aggrticks=69568/60584, aggrin_queue=128920, aggrutil=71.33%
    sda: ios=424704/616498, merge=0/60, ticks=69568/60584, in_queue=128920, util=71.33%

[-- Attachment #4: sequentiell.job --]
[-- Type: text/plain, Size: 221 bytes --]

[global]
ioengine=libaio
direct=1
filename=testdatei
size=2g
bs=4m

# Vollständig zufällige Daten für SSDs, die komprimieren
refill_buffers=1

[schreiben]
stonewall
rw=write

[lesen]
stonewall
rw=read

[-- Attachment #5: sequentiell.log --]
[-- Type: text/x-log, Size: 2432 bytes --]

[global]
ioengine=libaio
direct=1
filename=testdatei
size=2g
bs=4m

[schreiben]
stonewall
rw=write

[lesen]
stonewall
rw=read
schreiben: (g=0): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=1
lesen: (g=1): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=1
fio 1.57
Starting 2 processes
schreiben: Laying out IO file(s) (1 file(s) / 2048MB)

schreiben: (groupid=0, jobs=1): err= 0: pid=5855
  write: io=2048.0MB, bw=220150KB/s, iops=53 , runt=  9526msec
    slat (usec): min=239 , max=1328 , avg=452.88, stdev=182.05
    clat (msec): min=17 , max=22 , avg=18.14, stdev= 1.10
     lat (msec): min=17 , max=23 , avg=18.59, stdev= 1.12
    bw (KB/s) : min=216422, max=223128, per=100.08%, avg=220331.94, stdev=2205.18
  cpu          : usr=0.17%, sys=2.44%, ctx=557, majf=0, minf=19
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/512/0, short=0/0/0

     lat (msec): 20=93.55%, 50=6.45%
lesen: (groupid=1, jobs=1): err= 0: pid=5856
  read : io=2048.0MB, bw=267460KB/s, iops=65 , runt=  7841msec
    slat (usec): min=251 , max=4071 , avg=581.06, stdev=300.62
    clat (usec): min=14517 , max=17700 , avg=14724.38, stdev=340.74
     lat (usec): min=14906 , max=20094 , avg=15306.37, stdev=451.23
    bw (KB/s) : min=264000, max=270336, per=100.07%, avg=267634.87, stdev=1787.07
  cpu          : usr=0.10%, sys=3.78%, ctx=569, majf=0, minf=1045
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=512/0/0, short=0/0/0

     lat (msec): 20=100.00%

Run status group 0 (all jobs):
  WRITE: io=2048.0MB, aggrb=220150KB/s, minb=225433KB/s, maxb=225433KB/s, mint=9526msec, maxt=9526msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=267459KB/s, minb=273878KB/s, maxb=273878KB/s, mint=7841msec, maxt=7841msec

Disk stats (read/write):
  dm-2: ios=3991/4196, merge=0/0, ticks=33880/43220, in_queue=77124, util=96.78%, aggrios=4112/4143, aggrmerge=0/56, aggrticks=34944/42968, aggrin_queue=77904, aggrutil=96.79%
    sda: ios=4112/4143, merge=0/56, ticks=34944/42968, in_queue=77904, util=96.79%

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-03 19:31 ` Measuring IOPS Martin Steigerwald
@ 2011-08-03 20:22   ` Jeff Moyer
  2011-08-03 20:33     ` Martin Steigerwald
  2011-08-03 20:42     ` Martin Steigerwald
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Moyer @ 2011-08-03 20:22 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: fio

Martin Steigerwald <Martin@lichtvoll.de> writes:

> - ioengine=libaio
> - direct=1
> - and then due to direct I/O alignment requirement: bsrange=2k-16k
>
> So I now also fully understand that ioengine=sync just refers to the 
> synchronous nature of the system calls used, not on whether the I/Os are 
> issued synchronously via sync=1 or by circumventing the page cache via 
> direct=1
>
> Attached are results that bring down IOPS on read drastically! I first let 
> sequentiell.job write out the complete 2 gb with random data and then ran 
> the iops.job.

If you want to measure the maximum iops, then you should consider
driving iodepths > 1.  Assuming you are testing a sata ssd, try using a
depth of 64 (twice the NCQ depth).

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-03 20:22   ` Jeff Moyer
@ 2011-08-03 20:33     ` Martin Steigerwald
  2011-08-04  7:50       ` Jens Axboe
  2011-08-03 20:42     ` Martin Steigerwald
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-03 20:33 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: fio

Am Mittwoch, 3. August 2011 schrieb Jeff Moyer:
> Martin Steigerwald <Martin@lichtvoll.de> writes:
> > - ioengine=libaio
> > - direct=1
> > - and then due to direct I/O alignment requirement: bsrange=2k-16k
> > 
> > So I now also fully understand that ioengine=sync just refers to the
> > synchronous nature of the system calls used, not on whether the I/Os
> > are issued synchronously via sync=1 or by circumventing the page
> > cache via direct=1
> > 
> > Attached are results that bring down IOPS on read drastically! I
> > first let sequentiell.job write out the complete 2 gb with random
> > data and then ran the iops.job.
> 
> If you want to measure the maximum iops, then you should consider
> driving iodepths > 1.  Assuming you are testing a sata ssd, try using a
> depth of 64 (twice the NCQ depth).

Yes, I thought about that too, but then also read about the 
"recommendation" to use an iodepth of one in a post here:

http://www.spinics.net/lists/fio/msg00502.html

What will be used in regular workloads - say Linux desktop on an SSD here? 
I would bet that Linux uses what it can get? What about server workloads 
like mail processing on SAS disks or fileserver on SATA disks and such 
like?


Twice of

merkaba:~> hdparm -I /dev/sda | grep -i queue
        Queue depth: 32
           *    Native Command Queueing (NCQ)

?

Why twice?

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-03 20:22   ` Jeff Moyer
  2011-08-03 20:33     ` Martin Steigerwald
@ 2011-08-03 20:42     ` Martin Steigerwald
  2011-08-03 20:50       ` Martin Steigerwald
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-03 20:42 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: fio

Am Mittwoch, 3. August 2011 schrieben Sie:
> Martin Steigerwald <Martin@lichtvoll.de> writes:
> > - ioengine=libaio
> > - direct=1
> > - and then due to direct I/O alignment requirement: bsrange=2k-16k
> > 
> > So I now also fully understand that ioengine=sync just refers to the
> > synchronous nature of the system calls used, not on whether the I/Os
> > are issued synchronously via sync=1 or by circumventing the page
> > cache via direct=1
> > 
> > Attached are results that bring down IOPS on read drastically! I
> > first let sequentiell.job write out the complete 2 gb with random
> > data and then ran the iops.job.
> 
> If you want to measure the maximum iops, then you should consider
> driving iodepths > 1.  Assuming you are testing a sata ssd, try using a
> depth of 64 (twice the NCQ depth).

And additionally?

Does using iodepth > 1 need ioengine=libaio? Let�s see the manpage:

       iodepth=int
              Number  of I/O units to keep in flight against the
              file. Note that increasing iodepth beyond  1  will
              not affect synchronous ioengines (except for small
              degress when verify_async is in use).  Even  async
              engines  my  impose  OS  restrictions  causing the
              desired depth not to be achieved.  This may happen
              on   Linux  when  using  libaio  and  not  setting
              direct=1, since buffered IO is not async  on  that
              OS.  Keep  an  eye on the IO depth distribution in
              the fio output to verify that the  achieved  depth
              is as expected. Default: 1.

Okay, yes, it does. I start getting a hang on it. Its a bit puzzling to 
have two concepts of synchronous I/O around:

1) synchronous system call interfaces aka fio I/O engine

2) synchronous I/O requests aka O_SYNC

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-03 20:42     ` Martin Steigerwald
@ 2011-08-03 20:50       ` Martin Steigerwald
  2011-08-04  8:51         ` Martin Steigerwald
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-03 20:50 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: fio

Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> Am Mittwoch, 3. August 2011 schrieben Sie:
> > Martin Steigerwald <Martin@lichtvoll.de> writes:
[...]
> Does using iodepth > 1 need ioengine=libaio? Let�s see the manpage:
> 
>        iodepth=int
>               Number  of I/O units to keep in flight against the
>               file. Note that increasing iodepth beyond  1  will
>               not affect synchronous ioengines (except for small
>               degress when verify_async is in use).  Even  async
>               engines  my  impose  OS  restrictions  causing the
>               desired depth not to be achieved.  This may happen
>               on   Linux  when  using  libaio  and  not  setting
>               direct=1, since buffered IO is not async  on  that
>               OS.  Keep  an  eye on the IO depth distribution in
>               the fio output to verify that the  achieved  depth
>               is as expected. Default: 1.
> 
> Okay, yes, it does. I start getting a hang on it. Its a bit puzzling to
> have two concepts of synchronous I/O around:
> 
> 1) synchronous system call interfaces aka fio I/O engine
> 
> 2) synchronous I/O requests aka O_SYNC

But isn�t this a case for iodepth=1 if buffered I/O on Linux is 
synchronous? I bet most regular applications except some databases use 
buffered I/O.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-03 20:33     ` Martin Steigerwald
@ 2011-08-04  7:50       ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2011-08-04  7:50 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Jeff Moyer, fio

On 2011-08-03 22:33, Martin Steigerwald wrote:
> Am Mittwoch, 3. August 2011 schrieb Jeff Moyer:
>> Martin Steigerwald <Martin@lichtvoll.de> writes:
>>> - ioengine=libaio
>>> - direct=1
>>> - and then due to direct I/O alignment requirement: bsrange=2k-16k
>>>
>>> So I now also fully understand that ioengine=sync just refers to the
>>> synchronous nature of the system calls used, not on whether the I/Os
>>> are issued synchronously via sync=1 or by circumventing the page
>>> cache via direct=1
>>>
>>> Attached are results that bring down IOPS on read drastically! I
>>> first let sequentiell.job write out the complete 2 gb with random
>>> data and then ran the iops.job.
>>
>> If you want to measure the maximum iops, then you should consider
>> driving iodepths > 1.  Assuming you are testing a sata ssd, try using a
>> depth of 64 (twice the NCQ depth).
> 
> Yes, I thought about that too, but then also read about the 
> "recommendation" to use an iodepth of one in a post here:
> 
> http://www.spinics.net/lists/fio/msg00502.html
> 
> What will be used in regular workloads - say Linux desktop on an SSD here? 
> I would bet that Linux uses what it can get? What about server workloads 
> like mail processing on SAS disks or fileserver on SATA disks and such 
> like?
> 
> 
> Twice of
> 
> merkaba:~> hdparm -I /dev/sda | grep -i queue
>         Queue depth: 32
>            *    Native Command Queueing (NCQ)
> 
> ?
> 
> Why twice?

Twice is a good rule of thumb, since it allows both the drive some
freedom for scheduling to reduce rotational latencies, but it also
allows the OS to work on a larger range of requests. This is beneficial
mostly for merging of sequential requests, but also for scheduling
purposes.

So at least depth + a_few, 2*depth is a good default.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-03 20:50       ` Martin Steigerwald
@ 2011-08-04  8:51         ` Martin Steigerwald
  2011-08-04  8:58           ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-04  8:51 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: fio, Jens Axboe

Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> > Am Mittwoch, 3. August 2011 schrieben Sie:
> > > Martin Steigerwald <Martin@lichtvoll.de> writes:
> [...]
> 
> > Does using iodepth > 1 need ioengine=libaio? Let´s see the manpage:
> >        iodepth=int
> >        
> >               Number  of I/O units to keep in flight against the
> >               file. Note that increasing iodepth beyond  1  will
> >               not affect synchronous ioengines (except for small
> >               degress when verify_async is in use).  Even  async
> >               engines  my  impose  OS  restrictions  causing the
> >               desired depth not to be achieved.  This may happen
> >               on   Linux  when  using  libaio  and  not  setting
> >               direct=1, since buffered IO is not async  on  that
> >               OS.  Keep  an  eye on the IO depth distribution in
> >               the fio output to verify that the  achieved  depth
> >               is as expected. Default: 1.
> > 
> > Okay, yes, it does. I start getting a hang on it. Its a bit puzzling
> > to have two concepts of synchronous I/O around:
> > 
> > 1) synchronous system call interfaces aka fio I/O engine
> > 
> > 2) synchronous I/O requests aka O_SYNC
> 
> But isn´t this a case for iodepth=1 if buffered I/O on Linux is
> synchronous? I bet most regular applications except some databases use
> buffered I/O.

Thanks a lot for your answers, Jens, Jeff, DongJin.

Now what about the above one?

In what cases is iodepth > 1 relevant, when Linux buffered I/O is 
synchronous? For mutiple threads or processes?

One process / thread can only submit one I/O at a time with synchronous 
system call I/O, but the function returns when the stuff is in the page 
cache. So first why can´t Linux use iodepth > 1 when there is lots of stuff 
in the page cache to be written out? That should help the single process 
case.

On the mutiple process/threadsa case Linux gets several I/O requests from 
mutiple processes/threads and thus iodepth > 1 does make sense?

Maybe it helps getting clear where in the stack iodepth is located at, is 
it

process / thread
systemcall
pagecache
blocklayer
iodepth
device driver
device

? If so, why can´t Linux  not make use of iodepth > 1 with synchronous 
system call I/O? Or is it further up on the system call level? But then 
what sense would it make there, when using system calls that are 
asynchronous already?
(Is that ordering above correct at all?)

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-04  8:51         ` Martin Steigerwald
@ 2011-08-04  8:58           ` Jens Axboe
  2011-08-04  9:34             ` Martin Steigerwald
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2011-08-04  8:58 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Jeff Moyer, fio

On 2011-08-04 10:51, Martin Steigerwald wrote:
> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
>> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
>>> Am Mittwoch, 3. August 2011 schrieben Sie:
>>>> Martin Steigerwald <Martin@lichtvoll.de> writes:
>> [...]
>>
>>> Does using iodepth > 1 need ioengine=libaio? Let�s see the manpage:
>>>        iodepth=int
>>>        
>>>               Number  of I/O units to keep in flight against the
>>>               file. Note that increasing iodepth beyond  1  will
>>>               not affect synchronous ioengines (except for small
>>>               degress when verify_async is in use).  Even  async
>>>               engines  my  impose  OS  restrictions  causing the
>>>               desired depth not to be achieved.  This may happen
>>>               on   Linux  when  using  libaio  and  not  setting
>>>               direct=1, since buffered IO is not async  on  that
>>>               OS.  Keep  an  eye on the IO depth distribution in
>>>               the fio output to verify that the  achieved  depth
>>>               is as expected. Default: 1.
>>>
>>> Okay, yes, it does. I start getting a hang on it. Its a bit puzzling
>>> to have two concepts of synchronous I/O around:
>>>
>>> 1) synchronous system call interfaces aka fio I/O engine
>>>
>>> 2) synchronous I/O requests aka O_SYNC
>>
>> But isn�t this a case for iodepth=1 if buffered I/O on Linux is
>> synchronous? I bet most regular applications except some databases use
>> buffered I/O.
> 
> Thanks a lot for your answers, Jens, Jeff, DongJin.
> 
> Now what about the above one?
> 
> In what cases is iodepth > 1 relevant, when Linux buffered I/O is 
> synchronous? For mutiple threads or processes?

iodepth controls what depth fio operates at, not the OS. You are right
in that with iodepth=1, for buffered writes you could be seeing a much
higher depth on the device side.

So think of iodepth as how many IO units fio can have in flight, nothing
else.

> One process / thread can only submit one I/O at a time with synchronous 
> system call I/O, but the function returns when the stuff is in the page 
> cache. So first why can�t Linux use iodepth > 1 when there is lots of stuff 
> in the page cache to be written out? That should help the single process 
> case.

Since the IO unit is done when the system call returns, you can never
have more than the one in flight for a sync engine. So iodepth > 1 makes
no sense for a sync engine.

> On the mutiple process/threadsa case Linux gets several I/O requests from 
> mutiple processes/threads and thus iodepth > 1 does make sense?

No.

> Maybe it helps getting clear where in the stack iodepth is located at, is 
> it
> 
> process / thread
> systemcall
> pagecache
> blocklayer
> iodepth
> device driver
> device
> 
> ? If so, why can�t Linux  not make use of iodepth > 1 with synchronous 
> system call I/O? Or is it further up on the system call level? But then 

Because it is sync. The very nature of the sync system calls is that
submission and completion are one event. For libaio, you could submit a
bunch of requests before retrieving or waiting for completion of any one
of them.

The only example where a sync engine could drive a higher queue depth on
the device side is buffered writes. For any other case (reads, direct
writes), you need async submission to build up a higher queue depth.

> what sense would it make there, when using system calls that are 
> asynchronous already?
> (Is that ordering above correct at all?)

Your ordering looks OK. Now consider where and how you end up waiting
for issued IO, that should tell you where queue depth could build up or
not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-04  8:58           ` Jens Axboe
@ 2011-08-04  9:34             ` Martin Steigerwald
  2011-08-04 10:02               ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-04  9:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, fio

Am Donnerstag, 4. August 2011 schrieb Jens Axboe:
> On 2011-08-04 10:51, Martin Steigerwald wrote:
> > Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> >> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> >>> Am Mittwoch, 3. August 2011 schrieben Sie:
> >>>> Martin Steigerwald <Martin@lichtvoll.de> writes:
> >> [...]
> >> 
> >>> Does using iodepth > 1 need ioengine=libaio? Let´s see the manpage:
> >>>        iodepth=int
> >>>        
> >>>               Number  of I/O units to keep in flight against the
> >>>               file. Note that increasing iodepth beyond  1  will
> >>>               not affect synchronous ioengines (except for small
> >>>               degress when verify_async is in use).  Even  async
> >>>               engines  my  impose  OS  restrictions  causing the
> >>>               desired depth not to be achieved.  This may happen
> >>>               on   Linux  when  using  libaio  and  not  setting
> >>>               direct=1, since buffered IO is not async  on  that
> >>>               OS.  Keep  an  eye on the IO depth distribution in
> >>>               the fio output to verify that the  achieved  depth
> >>>               is as expected. Default: 1.
> >>> 
> >>> Okay, yes, it does. I start getting a hang on it. Its a bit
> >>> puzzling to have two concepts of synchronous I/O around:
> >>> 
> >>> 1) synchronous system call interfaces aka fio I/O engine
> >>> 
> >>> 2) synchronous I/O requests aka O_SYNC
> >> 
> >> But isn´t this a case for iodepth=1 if buffered I/O on Linux is
> >> synchronous? I bet most regular applications except some databases
> >> use buffered I/O.
> > 
> > Thanks a lot for your answers, Jens, Jeff, DongJin.
> > 
> > Now what about the above one?
> > 
> > In what cases is iodepth > 1 relevant, when Linux buffered I/O is
> > synchronous? For mutiple threads or processes?
> 
> iodepth controls what depth fio operates at, not the OS. You are right
> in that with iodepth=1, for buffered writes you could be seeing a much
> higher depth on the device side.
> 
> So think of iodepth as how many IO units fio can have in flight,
> nothing else.

Ah okay. So when using iodepth=64 and ioengine=libaio with fio then fio 
issues 64 I/O requests at once before it bothers waiting for I/O requests 
to complete. And as the block layer completes I/O requests fio fills up the 
64 I/O requests queue. Right?

Now when I do have two jobs running at once and iodepth=64, will each 
process submit 64 I/O requests before waiting thus having at most 128 I/O 
requests in flight? Or will each process use 32 I/O requests? My bet is 
that iodepth is per job, per process.

> > One process / thread can only submit one I/O at a time with
> > synchronous system call I/O, but the function returns when the stuff
> > is in the page cache. So first why can´t Linux use iodepth > 1 when
> > there is lots of stuff in the page cache to be written out? That
> > should help the single process case.
> 
> Since the IO unit is done when the system call returns, you can never
> have more than the one in flight for a sync engine. So iodepth > 1
> makes no sense for a sync engine.

Makes perfect sense then I understand that iodepth option related to what 
the fio processes do.

> > On the mutiple process/threadsa case Linux gets several I/O requests
> > from mutiple processes/threads and thus iodepth > 1 does make sense?
> 
> No.

Since each synchronous system call I/O fio job still submits one I/O at a 
time...

> > Maybe it helps getting clear where in the stack iodepth is located
> > at, is it
> > 
> > process / thread
> > systemcall
> > pagecache
> > blocklayer
> > iodepth
> > device driver
> > device
> > 
> > ? If so, why can´t Linux  not make use of iodepth > 1 with
> > synchronous system call I/O? Or is it further up on the system call
> > level? But then
> 
> Because it is sync. The very nature of the sync system calls is that
> submission and completion are one event. For libaio, you could submit a
> bunch of requests before retrieving or waiting for completion of any
> one of them.
> 
> The only example where a sync engine could drive a higher queue depth
> on the device side is buffered writes. For any other case (reads,
> direct writes), you need async submission to build up a higher queue
> depth.

Great! I think that makes it pretty clear.

Thus when I want to read subsequent blocks 1, 2, 3, 4, 5, 6, 7, 8, 9 and 
10 from a file at once and then wait I need async I/O.  Block might be of 
arbitrary size.

What when I use 10 processes, each reading one of these blocks as once? 
Couldn´t this fill up the queue at the device level? But then different 
processes usually read different files...

... my question hints at how I/O depths might accumulate at the device 
level, when several processes are issuing read and/or write requests at 
once.

> > what sense would it make there, when using system calls that are
> > asynchronous already?
> > (Is that ordering above correct at all?)
> 
> Your ordering looks OK. Now consider where and how you end up waiting
> for issued IO, that should tell you where queue depth could build up or
> not.

So we have several levels of queue depth.

- queue depth at the system call level 
- queue depth at device level

=== sync I/O engines ===
queue depth at the system call level = 1

== reads ==
queue depth at the device level = 1
since read() returns when the data is in RAM and thus is synchronous I/O 
on the lower level by nature

page cache will be used unless direct=1, so one might be measuring RAM / 
read ahead performance, especially when several read jobs are running 
concurrently. 

writes might not hit the device unless direct=1 and thus one should use 
larger than RAM file size.

== writes ==
queue depth at the device level = depending on the workload upto what the 
device supports

unless direct=1, cause then write() is doing synchronous I/O on the lower 
level and only returns when data is at least in drive cache


=== libaio ===
queue depth at the system call level = iodepth option of fio

as long as direct=1, since libaio falls back to synchronous system calls 
with buffered writes

queue depth at the device level = same

fio submits as much I/Os as specified by iodepth and only then waits. As the 
block layer completes I/Os fio fills up the queue.


conclusion:

thus when I want to measure higher I/O depths at read I need libaio and 
direct=1. but then I am measuring something that does not have any 
practical effect on processes that use synchronous system call I/O.

so for regular applications ioengine=sync + iodepth=64 gives more 
realistic results - even when its then just I/O depth 1 for reads - and 
for databases that use direct I/O ioengine=libaio makes sense and will 
cause higher I/O depths on the device side if it supports it.

anything without direct=1 (or the slower sync=1) is potentially measuring 
RAM performance. direct=1 omits the page cache. sync=1 basically disables 
caching on the device / controller side as well.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-04  9:34             ` Martin Steigerwald
@ 2011-08-04 10:02               ` Jens Axboe
  2011-08-04 10:23                 ` Martin Steigerwald
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2011-08-04 10:02 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Jeff Moyer, fio

On 2011-08-04 11:34, Martin Steigerwald wrote:
> Am Donnerstag, 4. August 2011 schrieb Jens Axboe:
>> On 2011-08-04 10:51, Martin Steigerwald wrote:
>>> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
>>>> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
>>>>> Am Mittwoch, 3. August 2011 schrieben Sie:
>>>>>> Martin Steigerwald <Martin@lichtvoll.de> writes:
>>>> [...]
>>>>
>>>>> Does using iodepth > 1 need ioengine=libaio? Let�s see the manpage:
>>>>>        iodepth=int
>>>>>        
>>>>>               Number  of I/O units to keep in flight against the
>>>>>               file. Note that increasing iodepth beyond  1  will
>>>>>               not affect synchronous ioengines (except for small
>>>>>               degress when verify_async is in use).  Even  async
>>>>>               engines  my  impose  OS  restrictions  causing the
>>>>>               desired depth not to be achieved.  This may happen
>>>>>               on   Linux  when  using  libaio  and  not  setting
>>>>>               direct=1, since buffered IO is not async  on  that
>>>>>               OS.  Keep  an  eye on the IO depth distribution in
>>>>>               the fio output to verify that the  achieved  depth
>>>>>               is as expected. Default: 1.
>>>>>
>>>>> Okay, yes, it does. I start getting a hang on it. Its a bit
>>>>> puzzling to have two concepts of synchronous I/O around:
>>>>>
>>>>> 1) synchronous system call interfaces aka fio I/O engine
>>>>>
>>>>> 2) synchronous I/O requests aka O_SYNC
>>>>
>>>> But isn�t this a case for iodepth=1 if buffered I/O on Linux is
>>>> synchronous? I bet most regular applications except some databases
>>>> use buffered I/O.
>>>
>>> Thanks a lot for your answers, Jens, Jeff, DongJin.
>>>
>>> Now what about the above one?
>>>
>>> In what cases is iodepth > 1 relevant, when Linux buffered I/O is
>>> synchronous? For mutiple threads or processes?
>>
>> iodepth controls what depth fio operates at, not the OS. You are right
>> in that with iodepth=1, for buffered writes you could be seeing a much
>> higher depth on the device side.
>>
>> So think of iodepth as how many IO units fio can have in flight,
>> nothing else.
> 
> Ah okay. So when using iodepth=64 and ioengine=libaio with fio then fio 
> issues 64 I/O requests at once before it bothers waiting for I/O requests 
> to complete. And as the block layer completes I/O requests fio fills up the 
> 64 I/O requests queue. Right?

Not quite right, the iodepth=64 will mean that fio can have 64
_pending_, not that it necessarily submits or retrieves that many at the
time. The latter two are controlled by the iodepth_batch (and
iodepth_batch_*) settings.

> Now when I do have two jobs running at once and iodepth=64, will each 
> process submit 64 I/O requests before waiting thus having at most 128 I/O 
> requests in flight? Or will each process use 32 I/O requests? My bet is 
> that iodepth is per job, per process.

iodepth is per job/process/thread. So each will have 64 requests.

>>> One process / thread can only submit one I/O at a time with
>>> synchronous system call I/O, but the function returns when the stuff
>>> is in the page cache. So first why can�t Linux use iodepth > 1 when
>>> there is lots of stuff in the page cache to be written out? That
>>> should help the single process case.
>>
>> Since the IO unit is done when the system call returns, you can never
>> have more than the one in flight for a sync engine. So iodepth > 1
>> makes no sense for a sync engine.
> 
> Makes perfect sense then I understand that iodepth option related to what 
> the fio processes do.
> 
>>> On the mutiple process/threadsa case Linux gets several I/O requests
>>> from mutiple processes/threads and thus iodepth > 1 does make sense?
>>
>> No.
> 
> Since each synchronous system call I/O fio job still submits one I/O at a 
> time...

Because each sync system call returns with the IO completed already, not
just queued for completion.

>>> Maybe it helps getting clear where in the stack iodepth is located
>>> at, is it
>>>
>>> process / thread
>>> systemcall
>>> pagecache
>>> blocklayer
>>> iodepth
>>> device driver
>>> device
>>>
>>> ? If so, why can�t Linux  not make use of iodepth > 1 with
>>> synchronous system call I/O? Or is it further up on the system call
>>> level? But then
>>
>> Because it is sync. The very nature of the sync system calls is that
>> submission and completion are one event. For libaio, you could submit a
>> bunch of requests before retrieving or waiting for completion of any
>> one of them.
>>
>> The only example where a sync engine could drive a higher queue depth
>> on the device side is buffered writes. For any other case (reads,
>> direct writes), you need async submission to build up a higher queue
>> depth.
> 
> Great! I think that makes it pretty clear.
> 
> Thus when I want to read subsequent blocks 1, 2, 3, 4, 5, 6, 7, 8, 9 and 
> 10 from a file at once and then wait I need async I/O.  Block might be of 
> arbitrary size.
> 
> What when I use 10 processes, each reading one of these blocks as once? 
> Couldn�t this fill up the queue at the device level? But then different 
> processes usually read different files...

Yes, you could get the same IO on the device side with just more
processes instead of using async IO. It would not be as efficient,
though.

> ... my question hints at how I/O depths might accumulate at the device 
> level, when several processes are issuing read and/or write requests at 
> once.

Various things can impact that, ultimately the IO scheduler decides when
to dispatch more requests to the driver.

>>> what sense would it make there, when using system calls that are
>>> asynchronous already?
>>> (Is that ordering above correct at all?)
>>
>> Your ordering looks OK. Now consider where and how you end up waiting
>> for issued IO, that should tell you where queue depth could build up or
>> not.
> 
> So we have several levels of queue depth.
> 
> - queue depth at the system call level 
> - queue depth at device level

Not sure I like the 'system call level' title, but yes. Lets call it
application and device level.

> === sync I/O engines ===
> queue depth at the system call level = 1
> 
> == reads ==
> queue depth at the device level = 1
> since read() returns when the data is in RAM and thus is synchronous I/O 
> on the lower level by nature
> 
> page cache will be used unless direct=1, so one might be measuring RAM / 
> read ahead performance, especially when several read jobs are running 
> concurrently. 
> 
> writes might not hit the device unless direct=1 and thus one should use 
> larger than RAM file size.
> 
> == writes ==
> queue depth at the device level = depending on the workload upto what the 
> device supports
> 
> unless direct=1, cause then write() is doing synchronous I/O on the lower 
> level and only returns when data is at least in drive cache

Correct, or unless O_SYNC is used.

> === libaio ===
> queue depth at the system call level = iodepth option of fio
> 
> as long as direct=1, since libaio falls back to synchronous system calls 
> with buffered writes
> 
> queue depth at the device level = same

Not necessarily the same, up to the same.

> fio submits as much I/Os as specified by iodepth and only then waits. As the 
> block layer completes I/Os fio fills up the queue.

That's not true, see earlier comment on what controls how many IOs are
submitted in one go and completed in one go.

> conclusion:
> 
> thus when I want to measure higher I/O depths at read I need libaio and 
> direct=1. but then I am measuring something that does not have any 
> practical effect on processes that use synchronous system call I/O.
> 
> so for regular applications ioengine=sync + iodepth=64 gives more 
> realistic results - even when its then just I/O depth 1 for reads - and 
> for databases that use direct I/O ioengine=libaio makes sense and will 
> cause higher I/O depths on the device side if it supports it.

iodepth > 1 makes no sense for sync engines...

> anything without direct=1 (or the slower sync=1) is potentially measuring 
> RAM performance. direct=1 omits the page cache. sync=1 basically disables 
> caching on the device / controller side as well.

Not quite measuring RAM (or copy) performance, at some point fio will be
blocked by the OS and prevented from dirtying more memory. At that point
it'll either just wait, or participate in flushing out dirty data. For
any buffered write workload, it'll quickly de-generate into that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-04 10:02               ` Jens Axboe
@ 2011-08-04 10:23                 ` Martin Steigerwald
  2011-08-05  7:28                   ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Steigerwald @ 2011-08-04 10:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, fio

[-- Attachment #1: Type: Text/Plain, Size: 3798 bytes --]

Am Donnerstag, 4. August 2011 schrieb Jens Axboe:
> On 2011-08-04 11:34, Martin Steigerwald wrote:
> > Am Donnerstag, 4. August 2011 schrieb Jens Axboe:
> >> On 2011-08-04 10:51, Martin Steigerwald wrote:
[...]
> >> The only example where a sync engine could drive a higher queue
> >> depth on the device side is buffered writes. For any other case
> >> (reads, direct writes), you need async submission to build up a
> >> higher queue depth.
> > 
> > Great! I think that makes it pretty clear.
> > 
> > Thus when I want to read subsequent blocks 1, 2, 3, 4, 5, 6, 7, 8, 9
> > and 10 from a file at once and then wait I need async I/O.  Block
> > might be of arbitrary size.
> > 
> > What when I use 10 processes, each reading one of these blocks as
> > once? Couldn´t this fill up the queue at the device level? But then
> > different processes usually read different files...
> 
> Yes, you could get the same IO on the device side with just more
> processes instead of using async IO. It would not be as efficient,
> though.

Just tried this. See attached file.

> > ... my question hints at how I/O depths might accumulate at the
> > device level, when several processes are issuing read and/or write
> > requests at once.
> 
> Various things can impact that, ultimately the IO scheduler decides
> when to dispatch more requests to the driver.

Okay. I think not as efficient as asynchronous I/O on the application level 
will do it for now ;).

> > conclusion:
> > 
> > thus when I want to measure higher I/O depths at read I need libaio
> > and direct=1. but then I am measuring something that does not have
> > any practical effect on processes that use synchronous system call
> > I/O.
> > 
> > so for regular applications ioengine=sync + iodepth=64 gives more
> > realistic results - even when its then just I/O depth 1 for reads -
> > and for databases that use direct I/O ioengine=libaio makes sense
> > and will cause higher I/O depths on the device side if it supports
> > it.
> 
> iodepth > 1 makes no sense for sync engines...

I mixed it up again, sorry.

Yes, so for regular application using sync which implies iodepth=1 might 
still give me a higher iodepth at the device level for buffered writes. A 
higher iodepth on the device side is only possible with having more than 
one process with sync engine running at the same time, but this would not 
be as efficient as asynchronous I/O.

> > anything without direct=1 (or the slower sync=1) is potentially
> > measuring RAM performance. direct=1 omits the page cache. sync=1
> > basically disables caching on the device / controller side as well.
> 
> Not quite measuring RAM (or copy) performance, at some point fio will
> be blocked by the OS and prevented from dirtying more memory. At that
> point it'll either just wait, or participate in flushing out dirty
> data. For any buffered write workload, it'll quickly de-generate into
> that.

Which depends on the size of the job, cause I for bet 1 GB/s with 250000 
IOPS I need some PCI express based SSD solution - a SATA-300 SSD like the 
Intel SSD 320 in use here can´t reach this (see attached file). It seems 
with 8 GB of RAM I need more than one GB to write in order to get 
meaningful results (related to raw SSD performance). With Ext4 delayed 
allocation a subsequent rm might even cause the file to not be written at 
all.

For the application side of thing it can make perfect sense to measure 
buffered writes. But one should go with a large enough data set in order to 
get meaningful results. At least when the application uses a large dataset 
too ;).

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: parallel-lesen.txt --]
[-- Type: text/plain, Size: 5110 bytes --]

martin@merkaba:~/Zeit> fio --name massiveparallelreads --ioengine=sync --direct=1 --rw=randread --size=1g --filename=testfile --group_reporting --numjobs=1 --runtime=60
massiveparallelreads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [17316K/0K /s] [4227 /0  iops] [eta 00m:00s]
massiveparallelreads: (groupid=0, jobs=1): err= 0: pid=19234
  read : io=987.22MB, bw=16848KB/s, iops=4212 , runt= 60001msec
    clat (usec): min=170 , max=8073 , avg=225.79, stdev=66.71
     lat (usec): min=170 , max=8073 , avg=226.21, stdev=66.73
    bw (KB/s) : min=15992, max=17248, per=100.02%, avg=16852.15, stdev=184.38
  cpu          : usr=5.29%, sys=11.79%, ctx=254656, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=252725/0/0, short=0/0/0
     lat (usec): 250=88.02%, 500=11.85%, 750=0.04%, 1000=0.02%
     lat (msec): 2=0.04%, 4=0.03%, 10=0.01%

Run status group 0 (all jobs):
   READ: io=987.22MB, aggrb=16848KB/s, minb=17252KB/s, maxb=17252KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  dm-2: ios=252641/34, merge=0/0, ticks=47496/0, in_queue=47492, util=79.16%, aggrios=252725/26, aggrmerge=0/15, aggrticks=47152/8, aggrin_queue=46888, aggrutil=78.09%
    sda: ios=252725/26, merge=0/15, ticks=47152/8, in_queue=46888, util=78.09%
martin@merkaba:~/Zeit> fio --name massiveparallelreads --ioengine=sync --direct=1 --rw=randread --size=1g --filename=testfile --group_reporting --numjobs=8 --runtime=60 
massiveparallelreads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
...
massiveparallelreads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 8 processes
Jobs: 8 (f=8): [rrrrrrrr] [100.0% done] [101.9M/0K /s] [25.5K/0  iops] [eta 00m:00s]
massiveparallelreads: (groupid=0, jobs=8): err= 0: pid=19237
  read : io=5855.1MB, bw=99925KB/s, iops=24981 , runt= 60001msec
    clat (usec): min=171 , max=83857 , avg=313.75, stdev=73.37
     lat (usec): min=171 , max=83858 , avg=314.15, stdev=73.36
    bw (KB/s) : min=10008, max=13504, per=12.49%, avg=12485.41, stdev=94.90
  cpu          : usr=1.81%, sys=10.25%, ctx=1514639, majf=0, minf=214
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=1498905/0/0, short=0/0/0
     lat (usec): 250=16.99%, 500=81.51%, 750=1.28%, 1000=0.09%
     lat (msec): 2=0.09%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%

Run status group 0 (all jobs):
   READ: io=5855.1MB, aggrb=99925KB/s, minb=102323KB/s, maxb=102323KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  dm-2: ios=1497855/17, merge=0/0, ticks=405216/448, in_queue=405812, util=99.81%, aggrios=1498905/30, aggrmerge=0/11, aggrticks=405948/468, aggrin_queue=405140, aggrutil=99.77%
    sda: ios=1498905/30, merge=0/11, ticks=405948/468, in_queue=405140, util=99.77%
martin@merkaba:~/Zeit> fio --name massiveparallelreads --ioengine=sync --direct=1 --rw=randread --size=1g --filename=testfile --group_reporting --numjobs=32 --runtime=60 
massiveparallelreads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
...
massiveparallelreads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 32 processes
Jobs: 32 (f=32): [rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr] [100.0% done] [157.1M/0K /s] [39.5K/0  iops] [eta 00m:00s]
massiveparallelreads: (groupid=0, jobs=32): err= 0: pid=19583
  read : io=9199.6MB, bw=157003KB/s, iops=39250 , runt= 60001msec
    clat (usec): min=173 , max=1016.5K, avg=818.37, stdev=326.10
     lat (usec): min=174 , max=1016.5K, avg=818.60, stdev=326.10
    bw (KB/s) : min=    3, max= 9464, per=3.11%, avg=4885.51, stdev=82.96
  cpu          : usr=0.51%, sys=2.77%, ctx=2399581, majf=0, minf=878
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=2355087/0/0, short=0/0/0
     lat (usec): 250=0.03%, 500=0.97%, 750=33.84%, 1000=59.32%
     lat (msec): 2=5.79%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (msec): 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%, 2000=0.01%

Run status group 0 (all jobs):
   READ: io=9199.6MB, aggrb=157003KB/s, minb=160771KB/s, maxb=160771KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  dm-2: ios=2352589/34, merge=0/0, ticks=1820480/101988, in_queue=1924564, util=99.91%, aggrios=2355082/52, aggrmerge=0/24, aggrticks=1822620/98852, aggrin_queue=1920740, aggrutil=99.87%
    sda: ios=2355082/52, merge=0/24, ticks=1822620/98852, in_queue=1920740, util=99.87%
martin@merkaba:~/Zeit> 

[-- Attachment #3: cachedwrite.txt --]
[-- Type: text/plain, Size: 7381 bytes --]

martin@merkaba:~/Zeit> fio -name cachedwrite --ioengine=sync --buffered=1 --rw write --size=100   
cachedwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process
cachedwrite: Laying out IO file(s) (1 file(s) / 0MB)


Run status group 0 (all jobs):

Disk stats (read/write):
  dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
martin@merkaba:~/Zeit> fio -name cachedwrite --ioengine=sync --buffered=1 --rw write --size=100m
cachedwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process
cachedwrite: Laying out IO file(s) (1 file(s) / 100MB)

cachedwrite: (groupid=0, jobs=1): err= 0: pid=20543
  write: io=102400KB, bw=1000.0MB/s, iops=256000 , runt=   100msec
    clat (usec): min=2 , max=75 , avg= 3.33, stdev= 1.44
     lat (usec): min=2 , max=76 , avg= 3.42, stdev= 1.53
  cpu          : usr=20.20%, sys=72.73%, ctx=9, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/25600/0, short=0/0/0
     lat (usec): 4=82.50%, 10=15.28%, 20=2.21%, 50=0.01%, 100=0.01%


Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=1000.0MB/s, minb=1024.0MB/s, maxb=1024.0MB/s, mint=100msec, maxt=100msec

Disk stats (read/write):
  dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

martin@merkaba:~/Zeit> fio -name cachedwrite --ioengine=sync --buffered=1 --rw write --size=100m
cachedwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process

cachedwrite: (groupid=0, jobs=1): err= 0: pid=20552
  write: io=102400KB, bw=966038KB/s, iops=241509 , runt=   106msec
    clat (usec): min=2 , max=36 , avg= 2.88, stdev= 0.74
     lat (usec): min=2 , max=36 , avg= 2.94, stdev= 0.72
  cpu          : usr=15.24%, sys=80.00%, ctx=11, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/25600/0, short=0/0/0
     lat (usec): 4=92.35%, 10=7.57%, 20=0.05%, 50=0.02%


Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=966037KB/s, minb=989222KB/s, maxb=989222KB/s, mint=106msec, maxt=106msec

Disk stats (read/write):
  dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
martin@merkaba:~/Zeit> fio -name cachedwrite --ioengine=sync --buffered=1 --rw write --size=100m
cachedwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process

cachedwrite: (groupid=0, jobs=1): err= 0: pid=20555
  write: io=102400KB, bw=223581KB/s, iops=55895 , runt=   458msec
    clat (usec): min=2 , max=74675 , avg=10.57, stdev=474.70
     lat (usec): min=2 , max=74675 , avg=10.67, stdev=474.70
  cpu          : usr=3.50%, sys=30.63%, ctx=77, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/25600/0, short=0/0/0
     lat (usec): 4=55.96%, 10=43.44%, 20=0.18%, 50=0.05%, 100=0.12%
     lat (usec): 250=0.02%
     lat (msec): 2=0.21%, 4=0.01%, 10=0.01%, 100=0.01%

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=223580KB/s, minb=228946KB/s, maxb=228946KB/s, mint=458msec, maxt=458msec

Disk stats (read/write):
  dm-2: ios=0/200, merge=0/0, ticks=0/31632, in_queue=39816, util=78.23%, aggrios=0/200, aggrmerge=0/0, aggrticks=0/40580, aggrin_queue=40580, aggrutil=78.05%
    sda: ios=0/200, merge=0/0, ticks=0/40580, in_queue=40580, util=78.05%
martin@merkaba:~/Zeit> fio -name cachedwrite --ioengine=sync --buffered=1 --rw write --size=1000m
cachedwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process
cachedwrite: Laying out IO file(s) (1 file(s) / 1000MB)
Jobs: 1 (f=1)
cachedwrite: (groupid=0, jobs=1): err= 0: pid=20558
  write: io=1000.0MB, bw=581488KB/s, iops=145371 , runt=  1761msec
    clat (usec): min=2 , max=28078 , avg= 6.29, stdev=236.66
     lat (usec): min=2 , max=28078 , avg= 6.37, stdev=236.66
    bw (KB/s) : min=214976, max=1004912, per=110.42%, avg=642049.67, stdev=398863.44
  cpu          : usr=12.05%, sys=50.23%, ctx=206, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/256000/0, short=0/0/0
     lat (usec): 4=62.17%, 10=36.27%, 20=1.44%, 50=0.10%, 100=0.01%
     lat (usec): 250=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%

Run status group 0 (all jobs):
  WRITE: io=1000.0MB, aggrb=581487KB/s, minb=595443KB/s, maxb=595443KB/s, mint=1761msec, maxt=1761msec

Disk stats (read/write):
  dm-2: ios=0/714, merge=0/0, ticks=0/152156, in_queue=173448, util=71.91%, aggrios=1/631, aggrmerge=0/1, aggrticks=76/174312, aggrin_queue=196172, aggrutil=73.97%
    sda: ios=1/631, merge=0/1, ticks=76/174312, in_queue=196172, util=73.97%
martin@merkaba:~/Zeit> fio -name cachedwrite --ioengine=sync --buffered=1 --rw write --size=1000m
cachedwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.57
Starting 1 process
Jobs: 1 (f=1): [W] [-.-% done] [0K/659.1M /s] [0 /165K iops] [eta 00m:00s]
cachedwrite: (groupid=0, jobs=1): err= 0: pid=20561
  write: io=1000.0MB, bw=316440KB/s, iops=79110 , runt=  3236msec
    clat (usec): min=1 , max=120823 , avg= 6.54, stdev=342.44
     lat (usec): min=1 , max=120823 , avg= 6.61, stdev=342.44
    bw (KB/s) : min=    2, max=1111080, per=154.15%, avg=487795.00, stdev=488926.65
  cpu          : usr=6.18%, sys=30.79%, ctx=236, majf=0, minf=24
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/256000/0, short=0/0/0
     lat (usec): 2=0.17%, 4=77.44%, 10=21.17%, 20=1.00%, 50=0.09%
     lat (usec): 100=0.10%, 250=0.01%, 500=0.01%, 750=0.01%
     lat (msec): 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 250=0.01%

Run status group 0 (all jobs):
  WRITE: io=1000.0MB, aggrb=316440KB/s, minb=324034KB/s, maxb=324034KB/s, mint=3236msec, maxt=3236msec

Disk stats (read/write):
  dm-2: ios=0/1463, merge=0/0, ticks=0/378048, in_queue=411168, util=93.56%, aggrios=0/1348, aggrmerge=0/0, aggrticks=0/392324, aggrin_queue=423772, aggrutil=93.79%
    sda: ios=0/1348, merge=0/0, ticks=0/392324, in_queue=423772, util=93.79%
martin@merkaba:~/Zeit> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Measuring IOPS
  2011-08-04 10:23                 ` Martin Steigerwald
@ 2011-08-05  7:28                   ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2011-08-05  7:28 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Jeff Moyer, fio

On 2011-08-04 12:23, Martin Steigerwald wrote:
>> Not quite measuring RAM (or copy) performance, at some point fio will
>> be blocked by the OS and prevented from dirtying more memory. At that
>> point it'll either just wait, or participate in flushing out dirty
>> data. For any buffered write workload, it'll quickly de-generate into
>> that.
> 
> Which depends on the size of the job, cause I for bet 1 GB/s with 250000 
> IOPS I need some PCI express based SSD solution - a SATA-300 SSD like the 
> Intel SSD 320 in use here can´t reach this (see attached file). It seems 

Right, you'll need something state-of-the-art to reach those numbers,
and nothing on a SATA/SAS bus will be able to do that. Latencies and
transport overhead are just too large.

> with 8 GB of RAM I need more than one GB to write in order to get 
> meaningful results (related to raw SSD performance). With Ext4 delayed 
> allocation a subsequent rm might even cause the file to not be written at 
> all.

Depending on the kernel, some percentage of total memory dirty will kick
off background writing. Some higher percentage will kick off direct
reclaim. So yes, the usual rule of thumb for buffered write performance
is that the job size should be at least twice that of RAM to yield
usable results.

> For the application side of thing it can make perfect sense to measure 
> buffered writes. But one should go with a large enough data set in order to 
> get meaningful results. At least when the application uses a large dataset 
> too ;).

Indeed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-08-05  7:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-29 15:37 Measuring IOPS Martin Steigerwald
2011-07-29 16:14 ` Martin Steigerwald
2011-08-02 14:32   ` Measuring IOPS (solved, I think) Martin Steigerwald
2011-08-02 19:48     ` Jens Axboe
2011-08-02 21:28       ` Martin Steigerwald
2011-08-03  7:17         ` Jens Axboe
2011-08-03  9:03           ` Martin Steigerwald
2011-08-03 10:34             ` Jens Axboe
2011-08-03 19:31 ` Measuring IOPS Martin Steigerwald
2011-08-03 20:22   ` Jeff Moyer
2011-08-03 20:33     ` Martin Steigerwald
2011-08-04  7:50       ` Jens Axboe
2011-08-03 20:42     ` Martin Steigerwald
2011-08-03 20:50       ` Martin Steigerwald
2011-08-04  8:51         ` Martin Steigerwald
2011-08-04  8:58           ` Jens Axboe
2011-08-04  9:34             ` Martin Steigerwald
2011-08-04 10:02               ` Jens Axboe
2011-08-04 10:23                 ` Martin Steigerwald
2011-08-05  7:28                   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.