All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
       [not found] <ZB1JgJ2DxyTMVUHB@hornet>
@ 2023-03-24  8:43 ` Damien Le Moal
  2023-03-24 21:19   ` Alexander Shumakovitch
  2023-03-25  0:33   ` Alexander Shumakovitch
  2023-03-24 19:34 ` Keith Busch
  1 sibling, 2 replies; 8+ messages in thread
From: Damien Le Moal @ 2023-03-24  8:43 UTC (permalink / raw)
  To: Alexander Shumakovitch, linux-nvme

On 3/24/23 15:56, Alexander Shumakovitch wrote:
> [ please copy me on your replies since I'm not subscribed to this list ]
> 
> Hello all,
> 
> I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> speed from this SSD differs by a factor of 10 (ten!), depending on which
> physical CPU hdparm or dd is run on:
>       
>     # hdparm -t /dev/nvme0n1 

It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
benchmark an nvme device. At the very least, if you really want to measure the
drive performance, you should add the --direct option (see man hdparm).

But a better way to test would be to use fio with io_uring or libaio IO engine
doing multi-job & high QD --direct=1 IOs. That will give you the maximum
performance of your device. Then remove the --direct=1 option to do buffered
IOs, which will expose potential issues with your system memory bandwidth.

>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 510 MB in  3.01 seconds = 169.28 MB/sec
>     
>     # taskset -c 0-7 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 5252 MB in  3.00 seconds = 1750.28 MB/sec
>     
>     # taskset -c 8-15 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 496 MB in  3.01 seconds = 164.83 MB/sec
>     
>     # taskset -c 24-31 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 520 MB in  3.01 seconds = 172.65 MB/sec
> 
> Even more mysteriously, the writing speeds are consistent across all the
> CPUs at about 800MB/sec (see the output of dd attached). Please note that
> I'm not worrying about the fine tuning of the performance at this point,
> and in particular I'm perfectly fine with 1/2 of the theoretical reading
> speed. I just want to understand where 90% of the bandwidth gets lost.
> No error of any kind appears in the syslog.
> 
> I don't think this is NUMA related since the QPI interconnect runs as
> specced at 4GB/sec, when measured by Intel's Memory Latency Checker, more
> than enough for NVMe to run at full speed. Also, the CUDA benchmark test
> runs at expected speeds across the QPI.
> 
> Just in case, I'm attaching the output of lstopo to this message. Please
> note that this computer has a BIOS bug that doesn't let kernel populate
> the values of numa_node in /sys/devices/pci0000:* automatically, so I have
> to do this myself after each boot.
> 
> I've tried removing all other PCI add-on cards, moving the SSD to another
> slot, changing the number of polling queues for the nvme driver, and even
> setting dm-multipath up. But none of these makes any material difference
> in reading speed.
> 
> System info: Debian 11.6 (stable) running Linux 5.19.11 (config file attached)
> Output of "nvme list":
> 
>     Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
>     ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
>     /dev/nvme0n1     S58SNS0R705048H      Samsung SSD 970 EVO Plus 500GB           1           0.00   B / 500.11  GB    512   B +  0 B   2B2QEXM7
> 
> Output of "nvme list-subsys"":
> 
>     nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:144d144dS58SNS0R705048H     Samsung SSD 970 EVO Plus 500GB          
>     \
>      +- nvme0 pcie 0000:03:00.0 live 
> 
> I would be grateful if you could point me in the right direction. I'm
> attaching outputs of the following commands to this message: dmesg,
> "cat /proc/cpuinfo", "ls -vvv", lstopo, and dd (both for reading from
> and writing to this SSD). Please let me know if you need any other info
> from me.
> 
> Thank you,
> 
>    Alex Shumakovitch

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
       [not found] <ZB1JgJ2DxyTMVUHB@hornet>
  2023-03-24  8:43 ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Damien Le Moal
@ 2023-03-24 19:34 ` Keith Busch
  2023-03-24 21:38   ` Alexander Shumakovitch
  1 sibling, 1 reply; 8+ messages in thread
From: Keith Busch @ 2023-03-24 19:34 UTC (permalink / raw)
  To: Alexander Shumakovitch; +Cc: linux-nvme

On Fri, Mar 24, 2023 at 06:56:03AM +0000, Alexander Shumakovitch wrote:
> physical CPU hdparm or dd is run on:
>       
>     # hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 510 MB in  3.01 seconds = 169.28 MB/sec
>     
>     # taskset -c 0-7 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 5252 MB in  3.00 seconds = 1750.28 MB/sec
>     
>     # taskset -c 8-15 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 496 MB in  3.01 seconds = 164.83 MB/sec
>     
>     # taskset -c 24-31 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 520 MB in  3.01 seconds = 172.65 MB/sec
> 
> Even more mysteriously, the writing speeds are consistent across all the
> CPUs at about 800MB/sec (see the output of dd attached).

When writing host->dev, there is no cache coherency to consider so it'll always
be faster in NUMA situations. Reading dev->host does, and can have considerable
overhead, though 10x seems a bit high.

Retrying with Damien's O_DIRECT suggestion is a good idea.

Also, 'taskset' only pins the CPUs the process schedules on, but not the memory
node it allocates from. Try 'numactl' instead for local node allocations.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
  2023-03-24  8:43 ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Damien Le Moal
@ 2023-03-24 21:19   ` Alexander Shumakovitch
  2023-03-25  1:52     ` Damien Le Moal
  2023-03-25  0:33   ` Alexander Shumakovitch
  1 sibling, 1 reply; 8+ messages in thread
From: Alexander Shumakovitch @ 2023-03-24 21:19 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-nvme

Hi Damien,

Thanks a lot for your thoughtful reply. The main reason why I used hdparm
and dd to benchmark the performance is because they are included with every
live distro. I didn't want to install an OS before confirming that hardware
works as expected.

Back to the main topic, it didn't occur to me that the --direct option can
have such a profound impact on reading speeds, but it does. With this
option enabled, most of the discrepancies in reading speeds from different
nodes disappear. The same happens when using dd with "iflag=direct". This
should imply that the issue is with the access time to the kernel's read
cache, correct? On the other hand, MLC shows completely reasonable latency
and bandwidth numbers between the nodes, see below.

So what could be the culprit and in which direction should I continue
digging? If hdparm and dd have issues with accessing the read cache, then
so will every other read-intensive program. Could this happen because of
the lack of the (correct) NUMA affinity for certain IRQs? I understand that
this question might not be NVMe-specific anymore, but would be grateful for
any pointer.

Thank you,

  --- Alex.

# ./mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.10
Command line parameters: --bandwidth_matrix

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3
       0        25328.8  4131.8  4013.0  4541.0
       1         4180.3 24696.3  4501.2  3996.3
       2         4017.7  4535.5 25746.4  4105.7
       3         4488.1  4024.0  4157.0 25467.7

# ./mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.10
Command line parameters: --latency_matrix

Using buffer size of 200.000MiB
Measuring idle latencies for sequential access (in ns)...
                Numa node
Numa node            0       1       2       3
       0          71.7   245.9   257.5   239.5
       1         156.4    71.8   238.3   256.3
       2         250.6   237.9    71.8   245.1
       3         238.4   252.5   237.9    71.9


On Fri, Mar 24, 2023 at 05:43:42PM +0900, Damien Le Moal wrote:
> 
> On 3/24/23 15:56, Alexander Shumakovitch wrote:
> > [ please copy me on your replies since I'm not subscribed to this list ]
> >
> > Hello all,
> >
> > I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> > 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> > Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> > CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> > speed from this SSD differs by a factor of 10 (ten!), depending on which
> > physical CPU hdparm or dd is run on:
> >
> >     # hdparm -t /dev/nvme0n1
> 
> It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
> benchmark an nvme device. At the very least, if you really want to measure the
> drive performance, you should add the --direct option (see man hdparm).
> 
> But a better way to test would be to use fio with io_uring or libaio IO engine
> doing multi-job & high QD --direct=1 IOs. That will give you the maximum
> performance of your device. Then remove the --direct=1 option to do buffered
> IOs, which will expose potential issues with your system memory bandwidth.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
  2023-03-24 19:34 ` Keith Busch
@ 2023-03-24 21:38   ` Alexander Shumakovitch
  0 siblings, 0 replies; 8+ messages in thread
From: Alexander Shumakovitch @ 2023-03-24 21:38 UTC (permalink / raw)
  To: linux-nvme

Thank you, Keith. As I have just written to Damien, I've started testing
this hardware from a live USB stick distro, which included 'taskset', but
not 'numactl'. But given the large amount of RAM on the server in question,
the kernel should have taken care of the memory pinning anyway.

In any case, it looks like the main issue is indeed with access to the read
cache, so now I have to figure out what to do about it.

Thanks,

  --- Alex.

On Fri, Mar 24, 2023 at 01:34:51PM -0600, Keith Busch wrote:
> When writing host->dev, there is no cache coherency to consider so it'll
> always be faster in NUMA situations. Reading dev->host does, and can have
> considerable overhead, though 10x seems a bit high.
> 
> Retrying with Damien's O_DIRECT suggestion is a good idea.
> 
> Also, 'taskset' only pins the CPUs the process schedules on, but not the
> memory node it allocates from. Try 'numactl' instead for local node
> allocations.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
  2023-03-24  8:43 ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Damien Le Moal
  2023-03-24 21:19   ` Alexander Shumakovitch
@ 2023-03-25  0:33   ` Alexander Shumakovitch
  2023-03-25  1:56     ` Damien Le Moal
  1 sibling, 1 reply; 8+ messages in thread
From: Alexander Shumakovitch @ 2023-03-25  0:33 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-nvme

Hi Damien,

Just to add to my previous message, I've run the same set of tests on a
small SATA SSD boot drive (Kingston A400) attached to the same system, and
it turned out to be more or less node and I/O mode agnostic, producing
consistent reading speeds of about 450MB/sec in the direct I/O mode and
about 480MB/sec in the cached I/O mode. In particular, the cashed mode on
a "wrong" NUMA node was significantly faster for this SATA SSD drive than
for a NVMe one at about 170MB/sec (both drives are connected to CPU #0).

So my question becomes: why is the NVMe driver susceptible to (very) slow
cached reads, while the AHCI one is not? Are there some fundamental
differences in how AHCI and NVMe block devices handle page cache?

Thank you,

  --- Alex.

On Fri, Mar 24, 2023 at 05:43:42PM +0900, Damien Le Moal wrote:
> It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
> benchmark an nvme device. At the very least, if you really want to measure the
> drive performance, you should add the --direct option (see man hdparm).
> 
> But a better way to test would be to use fio with io_uring or libaio IO engine
> doing multi-job & high QD --direct=1 IOs. That will give you the maximum
> performance of your device. Then remove the --direct=1 option to do buffered
> IOs, which will expose potential issues with your system memory bandwidth.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
  2023-03-24 21:19   ` Alexander Shumakovitch
@ 2023-03-25  1:52     ` Damien Le Moal
  2023-03-31  7:53       ` Alexander Shumakovitch
  0 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2023-03-25  1:52 UTC (permalink / raw)
  To: Alexander Shumakovitch; +Cc: linux-nvme

On 3/25/23 06:19, Alexander Shumakovitch wrote:
> Hi Damien,
> 
> Thanks a lot for your thoughtful reply. The main reason why I used hdparm
> and dd to benchmark the performance is because they are included with every
> live distro. I didn't want to install an OS before confirming that hardware
> works as expected.

You could install the OS on a USB stick to add fio.

> 
> Back to the main topic, it didn't occur to me that the --direct option can
> have such a profound impact on reading speeds, but it does. With this
> option enabled, most of the discrepancies in reading speeds from different
> nodes disappear. The same happens when using dd with "iflag=direct". This
> should imply that the issue is with the access time to the kernel's read
> cache, correct? On the other hand, MLC shows completely reasonable latency
> and bandwidth numbers between the nodes, see below.
> 
> So what could be the culprit and in which direction should I continue
> digging? If hdparm and dd have issues with accessing the read cache, then
> so will every other read-intensive program. Could this happen because of
> the lack of the (correct) NUMA affinity for certain IRQs? I understand that
> this question might not be NVMe-specific anymore, but would be grateful for
> any pointer.

For fast block devices, the overhead of the page management and memory copies
done when using the page cache is very visible. Nothing that can be done about
that. Any application, fio included, will most of the time show slower
performance because of that overhead. Not always true though (e.g. sequential
read with read-ahead should be just fine), but at the very least you will see a
higher CPU load.

dd and hdparm will also exercise the drive at QD=1, far from ideal when trying
to measure the maximum throughput of a device, unless you one uses very large IO
sizes.

> # ./mlc --bandwidth_matrix
> Intel(R) Memory Latency Checker - v3.10
> Command line parameters: --bandwidth_matrix
> 
> Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
> Measuring Memory Bandwidths between nodes within system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
>                 Numa node
> Numa node            0       1       2       3
>        0        25328.8  4131.8  4013.0  4541.0
>        1         4180.3 24696.3  4501.2  3996.3
>        2         4017.7  4535.5 25746.4  4105.7
>        3         4488.1  4024.0  4157.0 25467.7

Here you can see that local copies are very fast, but 6x slower when crossing
NUMA nodes. So unless the application explicitly uses libnuma to do direct IOs
using same node memory, this difference will be apparent with the page cache due
to balancing of the page allocations between nodes. And there is the copy back
to user space itself, which doubles the memory bandwidth needed.

Use fio and see its options for pinning jobs to CPUs and using libnuma for IO
buffers. You can then run different benchmarks to see the effect of having to
cross NUMA nodes for IOs.

There are plenty of papers and information about this subject (NUMA memory
management and its effect on performance) all over the place...

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
  2023-03-25  0:33   ` Alexander Shumakovitch
@ 2023-03-25  1:56     ` Damien Le Moal
  0 siblings, 0 replies; 8+ messages in thread
From: Damien Le Moal @ 2023-03-25  1:56 UTC (permalink / raw)
  To: Alexander Shumakovitch; +Cc: linux-nvme

On 3/25/23 09:33, Alexander Shumakovitch wrote:
> Hi Damien,
> 
> Just to add to my previous message, I've run the same set of tests on a
> small SATA SSD boot drive (Kingston A400) attached to the same system, and
> it turned out to be more or less node and I/O mode agnostic, producing
> consistent reading speeds of about 450MB/sec in the direct I/O mode and
> about 480MB/sec in the cached I/O mode. In particular, the cashed mode on
> a "wrong" NUMA node was significantly faster for this SATA SSD drive than
> for a NVMe one at about 170MB/sec (both drives are connected to CPU #0).

That is because the device itself is slower. So the page cache and NUMA overhead
is not really impacting the results. Try and HDD and you will see that it is
almost impossible to measure any difference.

> So my question becomes: why is the NVMe driver susceptible to (very) slow
> cached reads, while the AHCI one is not? Are there some fundamental
> differences in how AHCI and NVMe block devices handle page cache?

Because the device latency is much lower. So relatively, the overhead of the
page cache an NUMA is much larger. That overhead in absolute is the same as for
any device, but compared to the device latency, it is a small % for slow
devices, but a high % of the overall IO latency for fast devices.


-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
  2023-03-25  1:52     ` Damien Le Moal
@ 2023-03-31  7:53       ` Alexander Shumakovitch
  0 siblings, 0 replies; 8+ messages in thread
From: Alexander Shumakovitch @ 2023-03-31  7:53 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-nvme

[-- Attachment #1: Type: text/plain, Size: 4222 bytes --]

Thanks a lot, Damien. This was very helpful indeed. As you have suggested,
I've run a few fio test with libaio and io_uring engines with QD=32 and
different number of jobs. The results were mostly consistent between the two
engines, except for the random reads in the cached mode. In the case of
libaio, there was virtually no difference between different nodes, and the
bandwidth steadily increased with the number of nodes. Which made sense to
me after your explanations.

But for io_uring, node #0 was getting progressively faster as the number
of jobs increased, but the other three were getting slower, see the summary
tables below. Does this make sense for you? I understand that libaio engine
might ignore the iodepth settings in the cached mode. But smaller QD should
make things slower, not faster, shouldn't it? For your information, I also
attach complete outputs for fio in a few boundary cases. 

The main things I'm still concerned about is that not all Linux subsystems
might be fully NUMA aware on this machine. As I wrote, it has a buggy BIOS
that doesn't tell the OS its NUMA configuration. I populate the values of
numa_node in /sys/devices/pci0000:* myself after each boot, but this might
not be enough.

Thank you,

  --- Alex.

Benchmarks for random reads: bs = 4k, iodepth = 32 (in MB/s):

         ||   libaio engine, cached mode  ||  io_uring engine, cached mode |
    jobs || CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   -------------------------------------------------------------------------
      1  ||  47.5 |  46.2 |  46.0 |  46.5 ||   330 |   285 |   281 |   252 |
      2  ||  94.2 |  91.8 |  90.9 |  91.8 ||   571 |   189 |   186 |   203 |
      4  ||   180 |   176 |   175 |   176 ||  1108 |   184 |   191 |   219 |
      8  ||   331 |   322 |   319 |   322 ||  1142 |   170 |   174 |   177 |
     16  ||   585 |   554 |   545 |   552 ||  1353 |   175 |   173 |   180 |
   -------------------------------------------------------------------------
   
         ||   libaio engine, direct mode  ||  io_uring engine, direct mode |
    jobs || CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   -------------------------------------------------------------------------
      1  ||   544 |   520 |   477 |  519  ||   506 |   558 |   532 |   476 |
      2  ||  1034 |   928 |   943 |  996  ||  1028 |   938 |  1023 |  1004 |
      4  ||  1139 |  1138 |  1138 | 1139  ||  1138 |  1138 |  1138 |  1138 |
      8  ||  1140 |  1141 |  1141 | 1141  ||  1142 |  1142 |  1141 |  1141 |
     16  ||  1141 |  1135 |  1112 | 1136  ||  1141 |  1130 |  1133 |  1135 |
   -------------------------------------------------------------------------
   

Benchmarks for sequential reads: bs = 256k, iodepth = 32, numjobs = 1.

   |   libaio engine, cached mode  ||  io_uring engine, cached mode |
   | CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   ------------------------------------------------------------------
   |  1411 |   160 |   159 |   163 ||  1355 |   160 |   159 |   163 |
   ------------------------------------------------------------------
   
   |   libaio engine, direct mode  ||  io_uring engine, direct mode |
   | CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   ------------------------------------------------------------------
   |  3627 |  2160 |  1637 |  2184 ||  3627 |  2076 |  1756 |  2167 |
   ------------------------------------------------------------------
   

On Sat, Mar 25, 2023 at 10:52:02AM +0900, Damien Le Moal wrote:
> For fast block devices, the overhead of the page management and memory
> copies done when using the page cache is very visible. Nothing that can be
> done about that. Any application, fio included, will most of the time show
> slower performance because of that overhead. Not always true though (e.g.
> sequential read with read-ahead should be just fine), but at the very least
> you will see a higher CPU load.
> 
> dd and hdparm will also exercise the drive at QD=1, far from ideal when
> trying to measure the maximum throughput of a device, unless you one uses
> very large IO sizes.
> 

[-- Attachment #2: fio-io_uring-iodepth_32-numjobs_16_cached-node0.txt --]
[-- Type: text/plain, Size: 1895 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 16 processes

nvme0: (groupid=0, jobs=16): err= 0: pid=15807: Sat Mar 25 21:16:24 2023
  read: IOPS=330k, BW=1290MiB/s (1353MB/s)(37.8GiB/30003msec)
    slat (nsec): min=1938, max=225749, avg=14434.23, stdev=6505.65
    clat (usec): min=4, max=5997, avg=1533.09, stdev=566.96
     lat (usec): min=7, max=6022, avg=1547.96, stdev=568.55
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  269], 10.00th=[  783], 20.00th=[ 1139],
     | 30.00th=[ 1352], 40.00th=[ 1483], 50.00th=[ 1582], 60.00th=[ 1680],
     | 70.00th=[ 1795], 80.00th=[ 1942], 90.00th=[ 2180], 95.00th=[ 2409],
     | 99.00th=[ 2900], 99.50th=[ 3130], 99.90th=[ 3589], 99.95th=[ 3785],
     | 99.99th=[ 4228]
   bw (  MiB/s): min= 1077, max= 2530, per=100.00%, avg=1291.45, stdev=38.12, samples=960
   iops        : min=275929, max=647897, avg=330606.98, stdev=9759.31, samples=960
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=4.43%
  lat (usec)   : 500=2.36%, 750=2.75%, 1000=5.40%
  lat (msec)   : 2=68.41%, 4=16.63%, 10=0.02%
  cpu          : usr=10.78%, sys=38.14%, ctx=4668658, majf=0, minf=932
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=9907939,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1290MiB/s (1353MB/s), 1290MiB/s-1290MiB/s (1353MB/s-1353MB/s), io=37.8GiB (40.6GB), run=30003-30003msec

Disk stats (read/write):
  nvme0n1: ios=9758021/0, merge=0/0, ticks=16205206/0, in_queue=16205206, util=99.88%

[-- Attachment #3: fio-io_uring-iodepth_32-numjobs_16_cached-node1.txt --]
[-- Type: text/plain, Size: 1848 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
fio-3.25
Starting 16 processes

nvme0: (groupid=0, jobs=16): err= 0: pid=15898: Sat Mar 25 21:17:02 2023
  read: IOPS=42.7k, BW=167MiB/s (175MB/s)(5008MiB/30014msec)
    slat (usec): min=7, max=395, avg=30.62, stdev=13.25
    clat (usec): min=170, max=27446, avg=11952.06, stdev=2938.27
     lat (usec): min=200, max=27462, avg=11983.34, stdev=2937.03
    clat percentiles (usec):
     |  1.00th=[ 2311],  5.00th=[ 6194], 10.00th=[ 7635], 20.00th=[10028],
     | 30.00th=[11469], 40.00th=[12387], 50.00th=[12911], 60.00th=[13304],
     | 70.00th=[13304], 80.00th=[13304], 90.00th=[14222], 95.00th=[15926],
     | 99.00th=[18744], 99.50th=[19530], 99.90th=[21890], 99.95th=[22938],
     | 99.99th=[25297]
   bw (  KiB/s): min=149937, max=201662, per=100.00%, avg=170916.13, stdev=605.68, samples=960
   iops        : min=37484, max=50414, avg=42728.37, stdev=151.41, samples=960
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.05%
  lat (msec)   : 2=0.73%, 4=1.25%, 10=17.50%, 20=80.14%, 50=0.35%
  cpu          : usr=2.90%, sys=11.16%, ctx=632800, majf=0, minf=931
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1281560,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=167MiB/s (175MB/s), 167MiB/s-167MiB/s (175MB/s-175MB/s), io=5008MiB (5251MB), run=30014-30014msec

Disk stats (read/write):
  nvme0n1: ios=1495933/0, merge=0/0, ticks=17792295/0, in_queue=17792295, util=99.91%

[-- Attachment #4: fio-io_uring-iodepth_32-numjobs_1_cached-node0.txt --]
[-- Type: text/plain, Size: 1782 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
fio-3.25
Starting 1 process

nvme0: (groupid=0, jobs=1): err= 0: pid=13491: Sat Mar 25 20:56:30 2023
  read: IOPS=80.5k, BW=314MiB/s (330MB/s)(9431MiB/30001msec)
    slat (usec): min=5, max=184, avg=10.63, stdev= 5.08
    clat (usec): min=114, max=1024, avg=385.48, stdev=31.45
     lat (usec): min=125, max=1031, avg=396.33, stdev=31.59
    clat percentiles (usec):
     |  1.00th=[  330],  5.00th=[  343], 10.00th=[  351], 20.00th=[  363],
     | 30.00th=[  371], 40.00th=[  375], 50.00th=[  383], 60.00th=[  388],
     | 70.00th=[  396], 80.00th=[  408], 90.00th=[  424], 95.00th=[  441],
     | 99.00th=[  490], 99.50th=[  510], 99.90th=[  570], 99.95th=[  611],
     | 99.99th=[  701]
   bw (  KiB/s): min=317616, max=325771, per=100.00%, avg=322393.75, stdev=1897.32, samples=60
   iops        : min=79404, max=81440, avg=80598.30, stdev=474.27, samples=60
  lat (usec)   : 250=0.01%, 500=99.32%, 750=0.67%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=16.53%, sys=83.37%, ctx=2371, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2414344,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=314MiB/s (330MB/s), 314MiB/s-314MiB/s (330MB/s-330MB/s), io=9431MiB (9889MB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=2783897/0, merge=0/0, ticks=227688/0, in_queue=227688, util=99.87%

[-- Attachment #5: fio-io_uring-iodepth_32-numjobs_1_cached-node1.txt --]
[-- Type: text/plain, Size: 1787 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
fio-3.25
Starting 1 process

nvme0: (groupid=0, jobs=1): err= 0: pid=13559: Sat Mar 25 20:57:08 2023
  read: IOPS=69.7k, BW=272MiB/s (285MB/s)(8166MiB/30001msec)
    slat (usec): min=5, max=184, avg=11.64, stdev= 5.99
    clat (usec): min=122, max=1266, avg=445.95, stdev=80.88
     lat (usec): min=145, max=1278, avg=457.85, stdev=80.70
    clat percentiles (usec):
     |  1.00th=[  359],  5.00th=[  379], 10.00th=[  388], 20.00th=[  400],
     | 30.00th=[  408], 40.00th=[  416], 50.00th=[  420], 60.00th=[  433],
     | 70.00th=[  445], 80.00th=[  465], 90.00th=[  523], 95.00th=[  685],
     | 99.00th=[  717], 99.50th=[  750], 99.90th=[  857], 99.95th=[  906],
     | 99.99th=[ 1012]
   bw (  KiB/s): min=244000, max=289154, per=100.00%, avg=279096.80, stdev=10540.54, samples=59
   iops        : min=61000, max=72286, avg=69773.97, stdev=2635.18, samples=59
  lat (usec)   : 250=0.02%, 500=86.92%, 750=12.60%, 1000=0.45%
  lat (msec)   : 2=0.01%
  cpu          : usr=14.47%, sys=83.01%, ctx=110446, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2090351,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s (285MB/s-285MB/s), io=8166MiB (8562MB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=2417164/0, merge=0/0, ticks=324253/0, in_queue=324253, util=99.86%

[-- Attachment #6: fio-libaio-iodepth_32-numjobs_16_cached-node0.txt --]
[-- Type: text/plain, Size: 1876 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 16 processes

nvme0: (groupid=0, jobs=16): err= 0: pid=7354: Sat Mar 25 02:39:08 2023
  read: IOPS=143k, BW=558MiB/s (585MB/s)(16.3GiB/30002msec)
    slat (usec): min=2, max=494, avg=107.63, stdev=38.76
    clat (usec): min=3, max=5129, avg=3474.62, stdev=718.16
     lat (usec): min=89, max=5239, avg=3582.75, stdev=740.04
    clat percentiles (usec):
     |  1.00th=[  139],  5.00th=[ 3195], 10.00th=[ 3326], 20.00th=[ 3425],
     | 30.00th=[ 3490], 40.00th=[ 3556], 50.00th=[ 3589], 60.00th=[ 3654],
     | 70.00th=[ 3720], 80.00th=[ 3785], 90.00th=[ 3884], 95.00th=[ 3949],
     | 99.00th=[ 4146], 99.50th=[ 4228], 99.90th=[ 4359], 99.95th=[ 4424],
     | 99.99th=[ 4555]
   bw (  KiB/s): min=537617, max=1295837, per=100.00%, avg=572538.79, stdev=30952.47, samples=945
   iops        : min=134400, max=323956, avg=143130.77, stdev=7738.13, samples=945
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=4.15%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=92.29%, 10=3.57%
  cpu          : usr=5.04%, sys=14.48%, ctx=4107578, majf=0, minf=932
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=4284741,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=558MiB/s (585MB/s), 558MiB/s-558MiB/s (585MB/s-585MB/s), io=16.3GiB (17.6GB), run=30002-30002msec

Disk stats (read/write):
  nvme0n1: ios=4661495/0, merge=0/0, ticks=444782/0, in_queue=444782, util=99.84%

[-- Attachment #7: fio-libaio-iodepth_32-numjobs_16_cached-node1.txt --]
[-- Type: text/plain, Size: 1870 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 16 processes

nvme0: (groupid=0, jobs=16): err= 0: pid=7271: Sat Mar 25 02:36:41 2023
  read: IOPS=135k, BW=528MiB/s (554MB/s)(15.5GiB/30001msec)
    slat (usec): min=2, max=668, avg=113.97, stdev=48.67
    clat (usec): min=4, max=10041, avg=3670.72, stdev=868.33
     lat (usec): min=90, max=10163, avg=3785.19, stdev=892.89
    clat percentiles (usec):
     |  1.00th=[  165],  5.00th=[ 3294], 10.00th=[ 3425], 20.00th=[ 3523],
     | 30.00th=[ 3621], 40.00th=[ 3654], 50.00th=[ 3720], 60.00th=[ 3785],
     | 70.00th=[ 3851], 80.00th=[ 3982], 90.00th=[ 4178], 95.00th=[ 4555],
     | 99.00th=[ 5669], 99.50th=[ 6325], 99.90th=[ 7767], 99.95th=[ 8455],
     | 99.99th=[ 9372]
   bw (  KiB/s): min=490230, max=1224562, per=100.00%, avg=541736.83, stdev=28522.06, samples=944
   iops        : min=122554, max=306131, avg=135429.10, stdev=7130.52, samples=944
  lat (usec)   : 10=0.01%, 100=0.01%, 250=4.39%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=77.74%, 10=17.87%, 20=0.01%
  cpu          : usr=4.63%, sys=15.84%, ctx=3878084, majf=0, minf=935
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=4055689,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=528MiB/s (554MB/s), 528MiB/s-528MiB/s (554MB/s-554MB/s), io=15.5GiB (16.6GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=4409225/0, merge=0/0, ticks=443138/0, in_queue=443138, util=99.84%

[-- Attachment #8: fio-libaio-iodepth_32-numjobs_1_cached-node0.txt --]
[-- Type: text/plain, Size: 1828 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.25
Starting 1 process

nvme0: (groupid=0, jobs=1): err= 0: pid=12162: Sat Mar 25 20:01:31 2023
  read: IOPS=11.6k, BW=45.3MiB/s (47.5MB/s)(1360MiB/30001msec)
    slat (usec): min=62, max=233, avg=83.01, stdev=10.20
    clat (usec): min=4, max=3633, avg=2672.83, stdev=83.73
     lat (usec): min=96, max=3717, avg=2756.19, stdev=84.99
    clat percentiles (usec):
     |  1.00th=[ 2540],  5.00th=[ 2573], 10.00th=[ 2606], 20.00th=[ 2606],
     | 30.00th=[ 2638], 40.00th=[ 2638], 50.00th=[ 2671], 60.00th=[ 2671],
     | 70.00th=[ 2704], 80.00th=[ 2704], 90.00th=[ 2737], 95.00th=[ 2802],
     | 99.00th=[ 2999], 99.50th=[ 3097], 99.90th=[ 3425], 99.95th=[ 3458],
     | 99.99th=[ 3589]
   bw (  KiB/s): min=44961, max=46781, per=100.00%, avg=46463.59, stdev=377.86, samples=59
   iops        : min=11240, max=11695, avg=11615.75, stdev=94.47, samples=59
  lat (usec)   : 10=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=100.00%
  cpu          : usr=4.90%, sys=14.71%, ctx=348133, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=348120,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=45.3MiB/s (47.5MB/s), 45.3MiB/s-45.3MiB/s (47.5MB/s-47.5MB/s), io=1360MiB (1426MB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=405265/0, merge=0/0, ticks=28620/0, in_queue=28620, util=99.83%

[-- Attachment #9: fio-libaio-iodepth_32-numjobs_1_cached-node1.txt --]
[-- Type: text/plain, Size: 1801 bytes --]

nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.25
Starting 1 process

nvme0: (groupid=0, jobs=1): err= 0: pid=12230: Sat Mar 25 20:02:07 2023
  read: IOPS=11.3k, BW=44.1MiB/s (46.2MB/s)(1322MiB/30001msec)
    slat (usec): min=64, max=478, avg=85.49, stdev=10.13
    clat (usec): min=5, max=3794, avg=2750.26, stdev=81.49
     lat (usec): min=109, max=3871, avg=2836.09, stdev=82.66
    clat percentiles (usec):
     |  1.00th=[ 2638],  5.00th=[ 2671], 10.00th=[ 2671], 20.00th=[ 2704],
     | 30.00th=[ 2704], 40.00th=[ 2737], 50.00th=[ 2737], 60.00th=[ 2737],
     | 70.00th=[ 2769], 80.00th=[ 2802], 90.00th=[ 2835], 95.00th=[ 2900],
     | 99.00th=[ 3064], 99.50th=[ 3130], 99.90th=[ 3359], 99.95th=[ 3458],
     | 99.99th=[ 3621]
   bw (  KiB/s): min=44793, max=45592, per=100.00%, avg=45152.73, stdev=156.54, samples=59
   iops        : min=11198, max=11398, avg=11287.93, stdev=39.13, samples=59
  lat (usec)   : 10=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=100.00%
  cpu          : usr=4.75%, sys=15.33%, ctx=338322, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=338317,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=44.1MiB/s (46.2MB/s), 44.1MiB/s-44.1MiB/s (46.2MB/s-46.2MB/s), io=1322MiB (1386MB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=394292/0, merge=0/0, ticks=28596/0, in_queue=28596, util=99.83%

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-31  7:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ZB1JgJ2DxyTMVUHB@hornet>
2023-03-24  8:43 ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Damien Le Moal
2023-03-24 21:19   ` Alexander Shumakovitch
2023-03-25  1:52     ` Damien Le Moal
2023-03-31  7:53       ` Alexander Shumakovitch
2023-03-25  0:33   ` Alexander Shumakovitch
2023-03-25  1:56     ` Damien Le Moal
2023-03-24 19:34 ` Keith Busch
2023-03-24 21:38   ` Alexander Shumakovitch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.