Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.

* Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
       [not found] <ZB1JgJ2DxyTMVUHB@hornet>
@ 2023-03-24  8:43 ` Damien Le Moal
  2023-03-24 21:19   ` Alexander Shumakovitch
  2023-03-25  0:33   ` Alexander Shumakovitch
  2023-03-24 19:34 ` Keith Busch
  1 sibling, 2 replies; 8+ messages in thread
From: Damien Le Moal @ 2023-03-24  8:43 UTC (permalink / raw)
  To: Alexander Shumakovitch, linux-nvme

On 3/24/23 15:56, Alexander Shumakovitch wrote:
> [ please copy me on your replies since I'm not subscribed to this list ]
> 
> Hello all,
> 
> I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> speed from this SSD differs by a factor of 10 (ten!), depending on which
> physical CPU hdparm or dd is run on:
>       
>     # hdparm -t /dev/nvme0n1 

It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
benchmark an nvme device. At the very least, if you really want to measure the
drive performance, you should add the --direct option (see man hdparm).

But a better way to test would be to use fio with io_uring or libaio IO engine
doing multi-job & high QD --direct=1 IOs. That will give you the maximum
performance of your device. Then remove the --direct=1 option to do buffered
IOs, which will expose potential issues with your system memory bandwidth.

>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 510 MB in  3.01 seconds = 169.28 MB/sec
>     
>     # taskset -c 0-7 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 5252 MB in  3.00 seconds = 1750.28 MB/sec
>     
>     # taskset -c 8-15 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 496 MB in  3.01 seconds = 164.83 MB/sec
>     
>     # taskset -c 24-31 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 520 MB in  3.01 seconds = 172.65 MB/sec
> 
> Even more mysteriously, the writing speeds are consistent across all the
> CPUs at about 800MB/sec (see the output of dd attached). Please note that
> I'm not worrying about the fine tuning of the performance at this point,
> and in particular I'm perfectly fine with 1/2 of the theoretical reading
> speed. I just want to understand where 90% of the bandwidth gets lost.
> No error of any kind appears in the syslog.
> 
> I don't think this is NUMA related since the QPI interconnect runs as
> specced at 4GB/sec, when measured by Intel's Memory Latency Checker, more
> than enough for NVMe to run at full speed. Also, the CUDA benchmark test
> runs at expected speeds across the QPI.
> 
> Just in case, I'm attaching the output of lstopo to this message. Please
> note that this computer has a BIOS bug that doesn't let kernel populate
> the values of numa_node in /sys/devices/pci0000:* automatically, so I have
> to do this myself after each boot.
> 
> I've tried removing all other PCI add-on cards, moving the SSD to another
> slot, changing the number of polling queues for the nvme driver, and even
> setting dm-multipath up. But none of these makes any material difference
> in reading speed.
> 
> System info: Debian 11.6 (stable) running Linux 5.19.11 (config file attached)
> Output of "nvme list":
> 
>     Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
>     ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
>     /dev/nvme0n1     S58SNS0R705048H      Samsung SSD 970 EVO Plus 500GB           1           0.00   B / 500.11  GB    512   B +  0 B   2B2QEXM7
> 
> Output of "nvme list-subsys"":
> 
>     nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:144d144dS58SNS0R705048H     Samsung SSD 970 EVO Plus 500GB          
>     \
>      +- nvme0 pcie 0000:03:00.0 live 
> 
> I would be grateful if you could point me in the right direction. I'm
> attaching outputs of the following commands to this message: dmesg,
> "cat /proc/cpuinfo", "ls -vvv", lstopo, and dd (both for reading from
> and writing to this SSD). Please let me know if you need any other info
> from me.
> 
> Thank you,
> 
>    Alex Shumakovitch

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread