fio.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Samsung PM863 SSD: surprisingly high Write IOPS measured using `fio`, over 4.6 times more than spec!?
@ 2022-02-14 14:29 Durval Menezes
  2022-02-14 19:50 ` Sitsofe Wheeler
  0 siblings, 1 reply; 4+ messages in thread
From: Durval Menezes @ 2022-02-14 14:29 UTC (permalink / raw)
  To: fio; +Cc: jmmml

Hello everyone,

I've arrived at a very surprising number measuring IOPS write performance
on my SSDs' "bare metal" (ie, straight on the /dev/$DISK, no filesystem
involved):

	export COMMON_OPTIONS='--ioengine=libaio --direct=1 --runtime=120 --time_based --group_reporting'

	ls -l /dev/disk/by-id | grep 'ata-.*sda'
		lrwxrwxrwx 1 root root  9 Feb 13 17:19 ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX -> ../../sda

	TANGO=/dev/disk/by-id/ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX
	sudo fio --filename=${TANGO} --name=device_iops_write --rw=randwrite --bs=4k  --iodepth=256 --numjobs=4 ${COMMON_OPTIONS}
		[...]
		write: *IOPS=83.1k*, BW=325MiB/s (341MB/s)(38.1GiB/120007msec)
		[...]

(please find the complete output at the end of this message, in case I should
have looked at some other lines and/or you are curious)

As per the official manufacturer specs (both in this whitepaper at their
website[1]), and also in this datasheet I found somewhere else[2]), it's
supposed to be only *18K IOPS*.

All the other base performance numbers I've measured (read IOPS, read and
write MB/s, read and write latencies) are at or very near the manufacturer
specs.

What's going on?

At first I thought that, despite `--direct=1` being explicitly indicated,
my machine's 64GB RAM (via the Linux buffer cache) could be caching the
writes (even if the number, in that case, should have been much higher)...
so, I tested it again with `--runtime=120` to saturate the buffer cache in
case it was really the 'culprit'... lo and behold, the result was:

	[...]
	write: IOPS=83.1k, BW=325MiB/s (341MB/s)(190GiB/600019msec)
	[...]


So, the surprising over-4.6x-times-the-spec Write IOPS is mantained, even
for 190GiB total data.

And with 190GiB data written (about 10% the total device capacity), I do
not believe it's any kind of cache (RAM, MLC or whatever) inside the SSD
either.

I even considered that I could have got some kind of 'unicorn' device, so I
repeated all tests on my other SSD (same model and firmware, but a little
older -- date of manufacture on the paper label about 3 months earlier),
and got almost the exact same results (less than 1% variation). I do not
believe I got *two* over-4.6x-times-faster-than-spec 'unicorns' out of a
used eBay SSD sale...

So, what gives? With me being no `fio` expert, the obvious answer is that I
made some kind of mistake in its command-line above, but if so, for the
life of me I can't see it.

Thanks in advance for all tips and hints and cluebats from all you `fio` connoiseurs...
and please have no mercy, in case I messed up both the face and the palm here are ready to meet each other... ;-)

PS: in case it matters, this was running Ubuntu 18.04.6 with kernel 4.15.0-167-generic and fio-3.1 installed via
`apt-get install` from the official distro repo.

[1] https://www.samsung.com/semiconductor/global.semi.static/PM863_White_Paper-0.pdf,
p.4: "4 KB Random R/*W* (IOPs) Up to 99,000 / *18,000 IOPS*";

[2] https://www.compuram.de/documents/datasheet/PM863_SAMSUNG.pdf,
p.6: "Random Write IOPS (4 KB) [then, in the column for the 1,920GB model] 18K IOPS"

Cheers,
-- 
   Durval.

$ export COMMON_OPTIONS='--ioengine=libaio --direct=1 --runtime=120 --time_based --group_reporting'
$ ls -l /dev/disk/by-id | grep 'ata-.*sda'
lrwxrwxrwx 1 root root  9 Feb 13 17:19 ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX -> ../../sda
$ TANGO=/dev/disk/by-id/ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX
$ sudo fio --filename=${TANGO} --name=device_iops_write --rw=randwrite            --bs=4k  --iodepth=256 --numjobs=4 ${COMMON_OPTIONS}
device_iops_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.1
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][r=0KiB/s,w=326MiB/s][r=0,w=83.4k IOPS][eta 00m:00s]
device_iops_write: (groupid=0, jobs=4): err= 0: pid=27042: Sun Feb 13 16:19:10 2022
  write: IOPS=83.1k, BW=325MiB/s (341MB/s)(38.1GiB/120007msec)
    slat (nsec): min=1504, max=13545k, avg=46653.86, stdev=230086.16
    clat (usec): min=769, max=39675, avg=12267.87, stdev=3172.09
     lat (usec): min=772, max=39681, avg=12314.59, stdev=3187.43
    clat percentiles (usec):
     |  1.00th=[ 6521],  5.00th=[ 7963], 10.00th=[ 8717], 20.00th=[ 9765],
     | 30.00th=[10421], 40.00th=[11207], 50.00th=[11863], 60.00th=[12518],
     | 70.00th=[13435], 80.00th=[14484], 90.00th=[16319], 95.00th=[17957],
     | 99.00th=[22414], 99.50th=[23987], 99.90th=[26870], 99.95th=[28181],
     | 99.99th=[30802]
   bw (  KiB/s): min=58880, max=101216, per=25.00%, avg=83130.06, stdev=8366.08, samples=959
   iops        : min=14720, max=25304, avg=20782.51, stdev=2091.52, samples=959
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=23.35%, 20=74.16%, 50=2.47%
  cpu          : usr=3.21%, sys=9.27%, ctx=425656, majf=0, minf=28
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwt: total=0,9977029,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
  WRITE: bw=325MiB/s (341MB/s), 325MiB/s-325MiB/s (341MB/s-341MB/s), io=38.1GiB (40.9GB), run=120007-120007msec
$ 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Samsung PM863 SSD: surprisingly high Write IOPS measured using `fio`, over 4.6 times more than spec!?
  2022-02-14 14:29 Samsung PM863 SSD: surprisingly high Write IOPS measured using `fio`, over 4.6 times more than spec!? Durval Menezes
@ 2022-02-14 19:50 ` Sitsofe Wheeler
  0 siblings, 0 replies; 4+ messages in thread
From: Sitsofe Wheeler @ 2022-02-14 19:50 UTC (permalink / raw)
  To: Durval Menezes; +Cc: fio

Hi,

On Mon, 14 Feb 2022 at 18:44, Durval Menezes <jmmml@durval.com> wrote:
>
> Hello everyone,
>
> I've arrived at a very surprising number measuring IOPS write performance
> on my SSDs' "bare metal" (ie, straight on the /dev/$DISK, no filesystem
> involved):
>
>         export COMMON_OPTIONS='--ioengine=libaio --direct=1 --runtime=120 --time_based --group_reporting'
>
>         ls -l /dev/disk/by-id | grep 'ata-.*sda'
>                 lrwxrwxrwx 1 root root  9 Feb 13 17:19 ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX -> ../../sda
>
>         TANGO=/dev/disk/by-id/ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX
>         sudo fio --filename=${TANGO} --name=device_iops_write --rw=randwrite --bs=4k  --iodepth=256 --numjobs=4 ${COMMON_OPTIONS}
>                 [...]
>                 write: *IOPS=83.1k*, BW=325MiB/s (341MB/s)(38.1GiB/120007msec)
>                 [...]
>
> (please find the complete output at the end of this message, in case I should
> have looked at some other lines and/or you are curious)
>
> As per the official manufacturer specs (both in this whitepaper at their
> website[1]), and also in this datasheet I found somewhere else[2]), it's
> supposed to be only *18K IOPS*.
>
> All the other base performance numbers I've measured (read IOPS, read and
> write MB/s, read and write latencies) are at or very near the manufacturer
> specs.
>
> What's going on?
>
> At first I thought that, despite `--direct=1` being explicitly indicated,
> my machine's 64GB RAM (via the Linux buffer cache) could be caching the
> writes (even if the number, in that case, should have been much higher)...
> so, I tested it again with `--runtime=120` to saturate the buffer cache in
> case it was really the 'culprit'... lo and behold, the result was:
>
>         [...]
>         write: IOPS=83.1k, BW=325MiB/s (341MB/s)(190GiB/600019msec)
>         [...]
>
>
> So, the surprising over-4.6x-times-the-spec Write IOPS is mantained, even
> for 190GiB total data.
>
> And with 190GiB data written (about 10% the total device capacity), I do
> not believe it's any kind of cache (RAM, MLC or whatever) inside the SSD
> either.

You're running your workload for a comparatively short time and
additionally we don't know how "fresh" your SSD is. The 18K IOPS value
might be when the drive has been fully written and there are no
pre-erased blocks available (via so-called preconditioning)... I'll
also note the whitepaper [1] mentions this:

> SSD Precondition: Sustained state (or steady state)

[...]

> It's important to note that all performance items mentioned in this white paper have been measured at the sustained state, except the sequential read/write performance

I notice that your SSD appears to be SATA (sda) so I'd be surprised
that a total queue depth greater than 32 makes a difference (your
total queue depth is 1024). Do you get a similar result with just the
one job with an iodepth=32?

It's unlikely but if the jobs were submitting I/O to the same areas as
other jobs at the same time then some of the I/O could be elided but
given what you've posted this should not be the case.

-- 
Sitsofe

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Samsung PM863 SSD: surprisingly high Write IOPS measured using `fio`, over 4.6 times more than spec!?
@ 2022-02-15 16:48 Durval Menezes
  0 siblings, 0 replies; 4+ messages in thread
From: Durval Menezes @ 2022-02-15 16:48 UTC (permalink / raw)
  To: fio; +Cc: Sitsofe Wheeler


Hi Sitsofe,

On Tue, Feb 15, 2022 at 12:32 PM Durval Menezes (MML) <jmmml@durval.com> wrote:
> On Mon, Feb 14, 2022 at 4:51 PM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> > [...]
> > The 18K IOPS value
> > might be when the drive has been fully written and there are no
> > pre-erased blocks available (via so-called preconditioning)... I'll
> > also note the whitepaper [1] mentions this:
> >
> > 	SSD Precondition: Sustained state (or steady state)
> > 	[...]
> > 	It's important to note that all performance items mentioned in this
> > 	white paper have been measured at the sustained state, except the
> > 	sequential read/write performance
>
> Thanks for going through the whitepaper and picking this up. It passed
> right by me...
>
> Anyway.... hummmrmrmrmr... I did a full "Secure erase" on the drive before
> starting these tests... perhaps that was it?
>
> Anyway, I went through the whitepaper again, and found this:
>
> 	The sustained state in this document refers to the status that a
> 	128 KB sequential write has been completed equal to the drive capacity and
> 	then 4 KB random write has completed twice as much as the drive capacity
>
> OK, so at least there's a "recipe" for this preconditioning. I will try it
> and come back later to report.

That nailed it! Here's what I did to implement the "recipe":

a) Wrote random data in 128KB-sized blocks sequentially to the drive, until reaching the end of the device:

	date; openssl enc -rc4-40 -pass "pass:`dd bs=128 count=1 </dev/urandom 2>/dev/null`" </dev/zero | dd bs=128K of=/dev/sda oflag=direct iflag=fullblock; date
		Mon Feb 14 17:42:03 -03 2022
		dd: error writing '/dev/sda': No space left on device
		14651363+0 records in
		14651362+0 records out
		error writing output file
		1920383410176 bytes (1.9 TB, 1.7 TiB) copied, 6460.56 s, 297 MB/s
		Mon Feb 14 19:29:44 -03 2022

b) And then run a randomwrite FIO test with 4KB blocks and size equal to twice the SSD capacity:

	fdisk -l /dev/sda | grep ^Disk
		Disk /dev/sda: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
	export SIZE=1920383410176
	date; fio --filename=/dev/sda --name=device_iops_write --rw=randwrite --bs=4k --iodepth=32 --numjobs=1 --size=${SIZE} --io_size=`expr ${SIZE} \* 2` --ioengine=libaio --direct=1 --group_reporting
		Mon Feb 14 19:29:44 -03 2022
		device_iops_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
		fio-3.1
		Starting 1 process
		[...]
		Jobs: 1 (f=1): [w(1)][11.1%][r=0KiB/s,w=71.6MiB/s][r=0,w=18.3k IOPS][eta
		07h:51m:36s]
		[...]
		^C
		fio: terminating on signal 2

So, it didn't even take writing 2x the SSD capacity with random data to
bring write IOPS down to specs: just 11.1% of it was enough (and led me to
interrupt the test, no use eating up more write cycles out of its NAND
since the 'issue' is now explained).

Therefore, end of the case. Thank you very much for helping me nail this
down, I really like it when things make sense.

Cheers,
-- 
   Durval.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Samsung PM863 SSD: surprisingly high Write IOPS measured using `fio`, over 4.6 times more than spec!?
@ 2022-02-15 16:40 Durval Menezes
  0 siblings, 0 replies; 4+ messages in thread
From: Durval Menezes @ 2022-02-15 16:40 UTC (permalink / raw)
  To: fio; +Cc: Sitsofe Wheeler


Hi Sitsofe,

First of all, thanks for your detailed, thoughtful response. More, below:

On Mon, Feb 14, 2022 at 4:51 PM Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> On Mon, 14 Feb 2022 at 18:44, Durval Menezes <jmmml@durval.com> wrote:
> >
> > Hello everyone,
> >
> > I've arrived at a very surprising number measuring IOPS write performance
> > on my SSDs' "bare metal" (ie, straight on the /dev/$DISK, no filesystem
> > involved):
> >
> >         export COMMON_OPTIONS='--ioengine=libaio --direct=1
> --runtime=120 --time_based --group_reporting'
> >
> >         ls -l /dev/disk/by-id | grep 'ata-.*sda'
> >                 lrwxrwxrwx 1 root root  9 Feb 13 17:19
> ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX -> ../../sda
> >
> >
>  TANGO=/dev/disk/by-id/ata-SAMSUNG_MZ7LM1T9HCJM-00003_XXXXXXXXXXXXXX
> >         sudo fio --filename=${TANGO} --name=device_iops_write
> --rw=randwrite --bs=4k  --iodepth=256 --numjobs=4 ${COMMON_OPTIONS}
> >                 [...]
> >                 write: *IOPS=83.1k*, BW=325MiB/s
> (341MB/s)(38.1GiB/120007msec)
> >                 [...]
> >
> > (please find the complete output at the end of this message, in case I
> should
> > have looked at some other lines and/or you are curious)
> >
> > As per the official manufacturer specs (both in this whitepaper at their
> > website[1]), and also in this datasheet I found somewhere else[2]), it's
> > supposed to be only *18K IOPS*.
> >
> > All the other base performance numbers I've measured (read IOPS, read and
> > write MB/s, read and write latencies) are at or very near the
> manufacturer
> > specs.
> >
> > What's going on?
> >
> > At first I thought that, despite `--direct=1` being explicitly indicated,
> > my machine's 64GB RAM (via the Linux buffer cache) could be caching the
> > writes (even if the number, in that case, should have been much
> higher)...
> > so, I tested it again with `--runtime=120` to saturate the buffer cache
> in
> > case it was really the 'culprit'... lo and behold, the result was:
> >
> >         [...]
> >         write: IOPS=83.1k, BW=325MiB/s (341MB/s)(190GiB/600019msec)
> >         [...]
> >
> >
> > So, the surprising over-4.6x-times-the-spec Write IOPS is mantained, even
> > for 190GiB total data.
> >
> > And with 190GiB data written (about 10% the total device capacity), I do
> > not believe it's any kind of cache (RAM, MLC or whatever) inside the SSD
> > either.
> >
> You're running your workload for a comparatively short time

OK, I was able to find in the whitepaper (see below) the manufacturer
stating that the rand writes should be run for twice the capacity of the
disk. That will also imply a much longer test time...

> and additionally we don't know how "fresh" your SSD is.
 
Good point; here's its "freshness"-relevant data straight from `smartctl -a`:
	 
	   9 Power_On_Hours          0x0032   091   091   000    Old_age   Always
	     -       43694
	 177 Wear_Leveling_Count     0x0013   094   094   005    Pre-fail  Always
	     -       394
	 241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always
	     -       535797643689
	 242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always
	     -       1848967801660
	 
	 251 NAND_Writes             0x0032   100   100   000    Old_age   Always
	     -       1642499721864

So, I think it's a pretty 'mature' disk already (but hopefully with a lot
of 'life' still in it).

In other words, I don't think it's "fresh" enough to explain a 4x I/O increase.

Perhaps "freshness" in this case refers to it being recently
secure-erased (which I did prior to start testing)? 
 
> The 18K IOPS value
> might be when the drive has been fully written and there are no
> pre-erased blocks available (via so-called preconditioning)... I'll
> also note the whitepaper [1] mentions this:
>
> 	SSD Precondition: Sustained state (or steady state)
> 	[...]
> 	It's important to note that all performance items mentioned in this
> 	white paper have been measured at the sustained state, except the
> 	sequential read/write performance
>

Thanks for going through the whitepaper and picking this up. It passed
right by me...

I went through the whitepaper again, and found this:

	The sustained state in this document refers to the status that a
	128 KB sequential write has been completed equal to the drive capacity and
	then 4 KB random write has completed twice as much as the drive capacity
	
OK, so at least there's a "recipe" for this preconditioning. I will try it
and come back later to report.

> I notice that your SSD appears to be SATA (sda) so I'd be surprised
> that a total queue depth greater than 32 makes a difference (your
> total queue depth is 1024). Do you get a similar result with just the
> one job with an iodepth=32?

I tested with iodepth=32 (instead of 256) and got the same result, so I
guess you are not surprised ;-)

Just did it again, this time with   `--numjobs=1` (instead of 4) and here's
the result:

        write: IOPS=83.0k, BW=324MiB/s (340MB/s)(38.0GiB/120001msec)

So that's not it either.

> It's unlikely but if the jobs were submitting I/O to the same areas as
> other jobs at the same time then some of the I/O could be elided but
> given what you've posted this should not be the case.

Agreed.

Cheers,
-- 
   Durval.

> 
> 
> --
> Sitsofe
> >
> >

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-02-15 16:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-14 14:29 Samsung PM863 SSD: surprisingly high Write IOPS measured using `fio`, over 4.6 times more than spec!? Durval Menezes
2022-02-14 19:50 ` Sitsofe Wheeler
2022-02-15 16:40 Durval Menezes
2022-02-15 16:48 Durval Menezes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).