All of lore.kernel.org
 help / color / mirror / Atom feed
* Question: t/io_uring performance
@ 2021-08-25 15:57 Hans-Peter Lehmann
  2021-08-26  7:27 ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-08-25 15:57 UTC (permalink / raw)
  To: fio


Hello,

I am currently trying to run the t/io_uring benchmark but I am unable to achieve the IOPS that I would expect. In 2019, Axboe achieved 1.6M IOPS [3] or 1.7M IOPS [1] using a single CPU core (4k random reads). On my machine (AMD EPYC 7702P, 2x Intel P4510 NVMe SSD, separate 3rd SSD for the OS), I can't even get close to those numbers.

Each of my SSDs can handle about 560k IOPS when running t/io_uring. Now, when I launch the benchmark with both SSDs, I still only get about 580k IOPS, from which each SSD gets about 300k IOPS. When I launch two separate t/io_uring instances, I get the full 560k IOPS on each device. To me, this sounds like the benchmark is CPU bound. Given that the CPU is quite decent, I am surprised that I only get half of the single-threaded IOPS that my SSDs could handle (and 1/3 of what Axboe got).

I am limited to using Linux 5.4.0 (Ubuntu 20.04) currently but the numbers from Axboe above are from 2019, when 5.4 was released. So while I don't expect to achieve insane numbers like Axboe in a more recent measurement [4], 580k seems way less than it should be. Does someone have an idea what could cause this significant difference? You can find some more measurement outputs below, for reference.

Best regards
Hans-Peter Lehmann

= Measurements =

Performance:
# t/io_uring -b 4096 /dev/nvme0n1 /dev/nvme1n1
i 3, argc 5
Added file /dev/nvme0n1 (submitter 0)
Added file /dev/nvme1n1 (submitter 0)
sq_ring ptr = 0x0x7f9643d92000
sqes ptr    = 0x0x7f9643d90000
cq_ring ptr = 0x0x7f9643d8e000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=1207502
IOPS=578400, IOS/call=32/31, inflight=102 (64, 38)
IOPS=582784, IOS/call=32/32, inflight=95 (31, 64)
IOPS=583040, IOS/call=32/31, inflight=125 (61, 64)
IOPS=584665, IOS/call=31/32, inflight=114 (64, 50)

Scheduler for both SSDs disabled:
# cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

Most time is spent in the kernel:
# time t/io_uring -b 4096 /dev/nvme0n1 /dev/nvme1n1
[...]
real    0m8.770s
user    0m0.156s
sys     0m8.514s

Call graph:
# perf report
- 93.90% io_ring_submit
   - [...]
     - 75.32% io_read
         - 67.13% blkdev_read_iter
           - 65.65% generic_file_read_iter
               - 63.20% blkdev_direct_IO
                 - 61.17% __blkdev_direct_IO
                     - 45.49% submit_bio
                       - 43.95% generic_make_request
                           - 33.30% blk_mq_make_request
                             + 8.52% blk_mq_get_request
                             + 8.02% blk_attempt_plug_merge
                             + 5.80% blk_flush_plug_list
                             + 1.48% __blk_queue_split
                             + 1.14% __blk_mq_sched_bio_merge
                             + [...]
                           + 7.90% generic_make_request_checks
                         0.62% blk_mq_make_request
                     + 8.50% bio_alloc_bioset

= References =

[1]: https://kernel.dk/io_uring.pdf
[2]: https://github.com/axboe/fio/issues/579#issuecomment-384345234
[3]: https://twitter.com/axboe/status/1174777844313911296
[4]: https://lore.kernel.org/io-uring/4af91b50-4a9c-8a16-9470-a51430bd7733@kernel.dk/T/#u


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-08-25 15:57 Question: t/io_uring performance Hans-Peter Lehmann
@ 2021-08-26  7:27 ` Erwan Velu
  2021-08-26 15:57   ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-08-26  7:27 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio


Le 25/08/2021 à 17:57, Hans-Peter Lehmann a écrit :
>
> Hello,
>
> I am currently trying to run the t/io_uring benchmark but I am unable 
> to achieve the IOPS that I would expect. In 2019, Axboe achieved 1.6M 
> IOPS [3] or 1.7M IOPS [1] using a single CPU core (4k random reads). 
> On my machine (AMD EPYC 7702P, 2x Intel P4510 NVMe SSD, separate 3rd 
> SSD for the OS), I can't even get close to those numbers.
>
> Each of my SSDs can handle about 560k IOPS when running t/io_uring. 
> Now, when I launch the benchmark with both SSDs, I still only get 
> about 580k IOPS, from which each SSD gets about 300k IOPS. When I 
> launch two separate t/io_uring instances, I get the full 560k IOPS on 
> each device. To me, this sounds like the benchmark is CPU bound. Given 
> that the CPU is quite decent, I am surprised that I only get half of 
> the single-threaded IOPS that my SSDs could handle (and 1/3 of what 
> Axboe got).
>
A few considerations here about your hardware.

You didn't mention the size of your P4510 and that's important as this 
will strongly defines the max you can achieve on this SSD. The 1TB model 
is limited at 465K read random, nearly 640K for the greater sizes.

These numbers are given for a QD set to 64 with 4 workers.

So in any way here to expect to reach what Jens did ;)


Did you checked how your NVMEs are connected via their PCI lanes ?

It's obvious here that you need multiple PCI-GEN3 lanes to reach 1.6M 
IOPS (I'd say two).

So if your disks are running on the same lane, then you'll have no 
chance getting higher than a single PCI GEN3 lane even with 2 NVMEs.


Then considering the EPYC processor, what's your current Numa 
configuration ? Are you NPS=1 ? 2 ? 4 ? (lscpu would give the answer)

If you want to run a single core benchmark, you should also ensure how 
the IRQs are pinned over the Cores and NUMA domains (even if it's a 
single socket CPU).

Depending on your server vendor, you should also considering tweaking 
the bios if you want to get the most of it. I'm especially thinking of 
the DRAM & IODie power management that are using set into 
powersaving/dynamic even if the cpu govenor is set to performance. This 
could influence the final result but that's not your main trouble here.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-08-26  7:27 ` Erwan Velu
@ 2021-08-26 15:57   ` Hans-Peter Lehmann
  2021-08-27  7:20     ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-08-26 15:57 UTC (permalink / raw)
  To: Erwan Velu, fio


Thank you very much for your reply.

> You didn't mention the size of your P4510

Sorry, the P4510 SSDs each have 2 TB.

> Did you checked how your NVMEs are connected via their PCI lanes? It's obvious here that you need multiple PCI-GEN3 lanes to reach 1.6M IOPS (I'd say two).

If I understand the lspci output (listed below) correctly, the SSDs are connected directly to the same PCIe root complex, each of them getting their maximum of x4 lanes. Given that I can saturate the SSDs when using 2 t/io_uring instances, I think the hardware-side connection should not be the limitation - or am I missing something?

> Then considering the EPYC processor, what's your current Numa configuration? 

The processor was configured to use one single Numa node (NPS=1). I just tried to switch to NPS=4 and ran the benchmark on a core belonging to the SSDs' Numa node (using numactl). It brought the IOPS from 580k to 590k. That's still nowhere near the values that Jens got.

> If you want to run a single core benchmark, you should also ensure how the IRQs are pinned over the Cores and NUMA domains (even if it's a single socket CPU).

Is IRQ pinning the "big thing" that will double the IOPS? To me, it sounds like there must be something else that is wrong. I will definitely try it, though.


= Details =

# lspci -tv
-+-[0000:c0]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  |           +- [...]
  +-[0000:80]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  |           +- [...]
  +-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
  |           +- [...]
  \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
              +- [...]
              +-03.1-[01]----00.0  Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
              +-03.2-[02]----00.0  Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]

# lspci -vv
01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
         Subsystem: Intel Corporation NVMe Datacenter SSD [3DNAND] SE 2.5" U.2 (P4510)
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 64 bytes
         Interrupt: pin A routed to IRQ 65
         NUMA node: 0
         [...]
         Capabilities: [60] Express (v2) Endpoint, MSI 00
                 LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns
                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                 [...]
02:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
         Subsystem: Intel Corporation NVMe Datacenter SSD [3DNAND] SE 2.5" U.2 (P4510)
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 64 bytes
         Interrupt: pin A routed to IRQ 67
         NUMA node: 0
         [...]
         Capabilities: [60] Express (v2) Endpoint, MSI 00
                 LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns
                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                 [...]


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-08-26 15:57   ` Hans-Peter Lehmann
@ 2021-08-27  7:20     ` Erwan Velu
  2021-09-01 10:36       ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-08-27  7:20 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio


Le 26/08/2021 à 17:57, Hans-Peter Lehmann a écrit :
>
> [...]
> Sorry, the P4510 SSDs each have 2 TB.

Ok so we could expect 640K each.

Please note that jens was using optane disks that have a lower latency 
than a 4510 but this doesn't explain your issue.

>
>> Did you checked how your NVMEs are connected via their PCI lanes? 
>> It's obvious here that you need multiple PCI-GEN3 lanes to reach 1.6M 
>> IOPS (I'd say two).
>
> If I understand the lspci output (listed below) correctly, the SSDs 
> are connected directly to the same PCIe root complex, each of them 
> getting their maximum of x4 lanes. Given that I can saturate the SSDs 
> when using 2 t/io_uring instances, I think the hardware-side 
> connection should not be the limitation - or am I missing something?

You are right but this question was important to sort out to ensure your 
setup was compatible with your expectations.


>
>> Then considering the EPYC processor, what's your current Numa 
>> configuration? 
>
> The processor was configured to use one single Numa node (NPS=1). I 
> just tried to switch to NPS=4 and ran the benchmark on a core 
> belonging to the SSDs' Numa node (using numactl). It brought the IOPS 
> from 580k to 590k. That's still nowhere near the values that Jens got.
>
>> If you want to run a single core benchmark, you should also ensure 
>> how the IRQs are pinned over the Cores and NUMA domains (even if it's 
>> a single socket CPU).
>
> Is IRQ pinning the "big thing" that will double the IOPS? To me, it 
> sounds like there must be something else that is wrong. I will 
> definitely try it, though.

I didn't say it was the big thing, said it was to be considered to do a 
full optmization ;)


Stupid question : what if you run two benchmarks, one per disk ?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-08-27  7:20     ` Erwan Velu
@ 2021-09-01 10:36       ` Hans-Peter Lehmann
  2021-09-01 13:17         ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-01 10:36 UTC (permalink / raw)
  To: Erwan Velu, fio

Sorry for the late reply.

> Stupid question : what if you run two benchmarks, one per disk ?

I did a few measurements with different configurations below. (The numbers come from "iostat -hy 1 1" because t/io_uring only shows the per-process numbers. The iostat numbers are the same that t/io_uring shows when only running one instance)

Single t/io_uring process with one disk
==> 570k IOPS total (SSD1 = 570k IOPS, SSD2 = 0 IOPS)
Single t/io_uring process with both disks
==> 570k IOPS total (SSD1 = 290k IOPS, SSD2 = 280k IOPS
)

Two t/io_uring processes, both on the same disk
==> 785k IOPS total (SSD1 = 785k IOPS, SSD2 = 0 IOPS
)
Two t/io_uring processes, each on both disks
==> 1135k IOPS total (SSD1 = 570k IOPS, SSD2 = 565k IOPS
)
Two t/io_uring processes, one per disk
==> 1130k IOPS total (SSD1 = 565k IOPS, SSD2 = 565k IOPS
)

Three t/io_uring processes, each on both disks
==> 1570k IOPS total (SSD1 = 785k IOPS, SSD2 = 785k IOPS
)
Four t/io_uring processes, each on both disks

==> 1570k IOPS total (SSD1 = 785k IOPS, SSD2 = 785k IOPS)

So apparently, I need at least 3 cores to fully saturate the SSDs, while Jens can get similar total IOPS using only a single core. I couldn't find details about Jens' processor frequency but I would be surprised if he had ~3 times the frequency of ours (2.0 GHz base, 3.2 GHz boost).

> If you want to run a single core benchmark, you should also ensure how the IRQs are pinned over the Cores and NUMA domains (even if it's a single socket CPU). 

I pinned the interrupts of nvme0q0 and nvme1q0 to the core that runs t/io_uring but that does not change the IOPS. Assigning the other nvme related interrupts (like nvme1q42, listed in /proc/interrupts) fails. I think that happens because the kernel uses IRQD_AFFINITY_MANAGED and I would need to re-compile the kernel to change that. t/io_uring uses polled IO by default, so are the interrupts actually relevant in that case?

As a next step I will try upgrading the kernel, after all (even though I hoped to be able to reproduce Jens' measurements with the same kernel).

Thanks again
Hans-Peter Lehmann

Am 27.08.21 um 09:20 schrieb Erwan Velu:
> 
> Le 26/08/2021 à 17:57, Hans-Peter Lehmann a écrit :
>>
>> [...]
>> Sorry, the P4510 SSDs each have 2 TB.
> 
> Ok so we could expect 640K each.
> 
> Please note that jens was using optane disks that have a lower latency than a 4510 but this doesn't explain your issue.
> 
>>
>>> Did you checked how your NVMEs are connected via their PCI lanes? It's obvious here that you need multiple PCI-GEN3 lanes to reach 1.6M IOPS (I'd say two).
>>
>> If I understand the lspci output (listed below) correctly, the SSDs are connected directly to the same PCIe root complex, each of them getting their maximum of x4 lanes. Given that I can saturate the SSDs when using 2 t/io_uring instances, I think the hardware-side connection should not be the limitation - or am I missing something?
> 
> You are right but this question was important to sort out to ensure your setup was compatible with your expectations.
> 
> 
>>
>>> Then considering the EPYC processor, what's your current Numa configuration? 
>>
>> The processor was configured to use one single Numa node (NPS=1). I just tried to switch to NPS=4 and ran the benchmark on a core belonging to the SSDs' Numa node (using numactl). It brought the IOPS from 580k to 590k. That's still nowhere near the values that Jens got.
>>
>>> If you want to run a single core benchmark, you should also ensure how the IRQs are pinned over the Cores and NUMA domains (even if it's a single socket CPU).
>>
>> Is IRQ pinning the "big thing" that will double the IOPS? To me, it sounds like there must be something else that is wrong. I will definitely try it, though.
> 
> I didn't say it was the big thing, said it was to be considered to do a full optmization ;)
> 
> 
> Stupid question : what if you run two benchmarks, one per disk ?
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-01 10:36       ` Hans-Peter Lehmann
@ 2021-09-01 13:17         ` Erwan Velu
  2021-09-01 14:02           ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-09-01 13:17 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio

Le 01/09/2021 à 12:36, Hans-Peter Lehmann a écrit :
> I couldn't find details about Jens' processor frequency but I would be 
> surprised if he had ~3 times the frequency of ours (2.0 GHz base, 3.2 
> GHz boost). 


Can you check how the processor performance goes during the run ?

This would help ensuring the processors clocks perfectly during your run.

To ease this part, you can use the exec module I committed recently in 
fio : https://github.com/axboe/fio/blob/master/examples/exec.fio

There is an example on how dumping the cpuperf in // of a particular 
workload.

If you don't want to use this, you can just run the turbostat cmdline 
I'm sharing in this example.


Cheers,

Erwan,



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-01 13:17         ` Erwan Velu
@ 2021-09-01 14:02           ` Hans-Peter Lehmann
  2021-09-01 14:05             ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-01 14:02 UTC (permalink / raw)
  To: Erwan Velu, fio

> Can you check how the processor performance goes during the run ?

I just executed turbostat manually, like it was in your config file. Looks like the processor performance works as expected - the output clearly shows the point where I started/stopped a single-threaded t/io_uring run. The machine has 128 cores, so a Busy% of 0.78 is exactly one core.

# turbostat -c package -qS --interval 5 -s Busy%,Bzy_MHz,Avg_MHz,CorWatt,PkgWatt,RAMWatt,PkgTmp
Avg_MHz Busy%   Bzy_MHz CorWatt PkgWatt
0       0.01    1500    0.01    68.91
0       0.01    1641    0.01    68.90
0       0.01    1621    0.01    68.91
0       0.03    1561    0.03    68.91
0       0.03    1547    0.02    68.93
0       0.01    1500    0.01    68.91
0       0.01    1657    0.01    68.89
26      0.78    3322    3.58    74.58
26      0.79    3328    3.63    74.72
26      0.79    3331    3.63    74.79
26      0.79    3321    3.66    74.87
26      0.79    3329    3.64    74.88
26      0.79    3328    3.65    74.92
26      0.79    3334    3.62    74.92
26      0.79    3329    3.62    74.94
23      0.69    3331    3.17    74.25
0       0.01    1664    0.00    69.12
0       0.01    1501    0.01    69.12
0       0.02    1566    0.01    69.10
0       0.00    1499    0.00    69.07
1       0.06    1558    0.04    69.10

Best regards,
Hans-Peter


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-01 14:02           ` Hans-Peter Lehmann
@ 2021-09-01 14:05             ` Erwan Velu
  2021-09-01 14:17               ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-09-01 14:05 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio


Le 01/09/2021 à 16:02, Hans-Peter Lehmann a écrit :
>> Can you check how the processor performance goes during the run ?
>
> I just executed turbostat manually, like it was in your config file. 
> Looks like the processor performance works as expected - the output 
> clearly shows the point where I started/stopped a single-threaded 
> t/io_uring run. The machine has 128 cores, so a Busy% of 0.78 is 
> exactly one core.
>
> # turbostat -c package -qS --interval 5 -s 
> Busy%,Bzy_MHz,Avg_MHz,CorWatt,PkgWatt,RAMWatt,PkgTmp
> Avg_MHz Busy%   Bzy_MHz CorWatt PkgWatt
> 0       0.01    1500    0.01    68.91
> 0       0.01    1641    0.01    68.90
> 0       0.01    1621    0.01    68.91
> 0       0.03    1561    0.03    68.91
> 0       0.03    1547    0.02    68.93
> 0       0.01    1500    0.01    68.91
> 0       0.01    1657    0.01    68.89
> 26      0.78    3322    3.58    74.58
> 26      0.79    3328    3.63    74.72
> 26      0.79    3331    3.63    74.79
> 26      0.79    3321    3.66    74.87
> 26      0.79    3329    3.64    74.88
> 26      0.79    3328    3.65    74.92
> 26      0.79    3334    3.62    74.92
> 26      0.79    3329    3.62    74.94
> 23      0.69    3331    3.17    74.25
> 0       0.01    1664    0.00    69.12
> 0       0.01    1501    0.01    69.12
> 0       0.02    1566    0.01    69.10
> 0       0.00    1499    0.00    69.07
> 1       0.06    1558    0.04    69.10
>

These numbers are perfect.

I rebuild fio on a 5.4 kernel on a high-end nvme which is 1M iops read 
capable and got stuck at the same number as you aka 580K iops.

So, I'd agree with you something is curious here.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-01 14:05             ` Erwan Velu
@ 2021-09-01 14:17               ` Erwan Velu
  2021-09-06 14:26                 ` Hans-Peter Lehmann
  2021-09-08 12:33                 ` Jens Axboe
  0 siblings, 2 replies; 26+ messages in thread
From: Erwan Velu @ 2021-09-01 14:17 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio


Le 01/09/2021 à 16:05, Erwan Velu a écrit :
>
> [...]
> These numbers are perfect.
>
> I rebuild fio on a 5.4 kernel on a high-end nvme which is 1M iops read 
> capable and got stuck at the same number as you aka 580K iops.
>
> So, I'd agree with you something is curious here.
>
Same on a 5.10.35.

Jens, did we missed something here ?

Erwan,



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-01 14:17               ` Erwan Velu
@ 2021-09-06 14:26                 ` Hans-Peter Lehmann
  2021-09-06 14:41                   ` Erwan Velu
  2021-09-08 11:53                   ` Sitsofe Wheeler
  2021-09-08 12:33                 ` Jens Axboe
  1 sibling, 2 replies; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-06 14:26 UTC (permalink / raw)
  To: Erwan Velu, fio

Hi Jens,

not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks?

Best regards
Hans-Peter

(I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-06 14:26                 ` Hans-Peter Lehmann
@ 2021-09-06 14:41                   ` Erwan Velu
  2021-09-08 11:53                   ` Sitsofe Wheeler
  1 sibling, 0 replies; 26+ messages in thread
From: Erwan Velu @ 2021-09-06 14:41 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio


Le 06/09/2021 à 16:26, Hans-Peter Lehmann a écrit :
> Hi Jens,
>
> not sure if you have read the emails in this thread - now trying to 
> address you directly. Both Erwan and me are unable to reproduce your 
> single-threaded IOPS measurements - we don't even get close to your 
> numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you 
> use some special configuration for your benchmarks?
>
> Best regards
> Hans-Peter
>
> (I have also reproduced the behavior with an Intel processor now - the 
> single-threaded throughput is also capped at around 580k IOPS, even 
> though the SSDs can handle more than that when using multiple threads)


I wonder if that's not linked to the optane 2 setup.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-06 14:26                 ` Hans-Peter Lehmann
  2021-09-06 14:41                   ` Erwan Velu
@ 2021-09-08 11:53                   ` Sitsofe Wheeler
  2021-09-08 12:22                     ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Sitsofe Wheeler @ 2021-09-08 11:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Erwan Velu, fio, Hans-Peter Lehmann

(CC'ing Jens directly in case he missed the previous messages)

On Mon, 6 Sept 2021 at 15:28, Hans-Peter Lehmann
<hans-peter.lehmann@kit.edu> wrote:
>
> Hi Jens,
>
> not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks?
>
> Best regards
> Hans-Peter
>
> (I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads)



-- 
Sitsofe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 11:53                   ` Sitsofe Wheeler
@ 2021-09-08 12:22                     ` Jens Axboe
  2021-09-08 12:41                       ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-09-08 12:22 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Erwan Velu, fio, Hans-Peter Lehmann

[-- Attachment #1: Type: text/plain, Size: 3330 bytes --]

On 9/8/21 5:53 AM, Sitsofe Wheeler wrote:
> (CC'ing Jens directly in case he missed the previous messages)
> 
> On Mon, 6 Sept 2021 at 15:28, Hans-Peter Lehmann
> <hans-peter.lehmann@kit.edu> wrote:
>>
>> Hi Jens,
>>
>> not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks?
>>
>> Best regards
>> Hans-Peter
>>
>> (I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads)

Thanks for CC'ing me, I don't always see the messages otherwise. 580K is
very low, but without having access to the system and being able to run
some basic profiling, hard for me to say what you're running into. I may
miss some details in the below, so please do ask followups if things are
missing/unclear.

1) I'm using a 3970X with a desktop board + box for my peak testing,
   specs on that can be found online.

2) Yes I do run a custom configuration on my kernel, I do kernel
   development after all :-). I'm attaching the one I'm using. This
   hasn't changed in a long time. I do turn off various things that I
   don't need and some of them do impact performance.

3) The options I run t/io_uring with have been posted multiple times,
   it's this one:

   taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme3n1

   which is QD=128, 32/32 submit/complete batching, polled IO,
   registered files and buffers. Note that you'll need to configure NVMe
   to properly use polling. I use 32 poll queues, number isn't really
   that important for single core testing, as long as there's enough to
   have a poll queue local to CPU being tested on. You'll see this in
   dmesg:

   nvme nvme3: 64/0/32 default/read/poll queues

4) Make sure your nvme device is using 'none' as the IO scheduler. I
   think this is a no-brainer, but mentioning it just in case.

5) I turn off iostats and merging for the device. iostats is the most
   important, depending on the platform getting accurate time stamps can
   be expensive:

   echo 0 > /sys/block/nvme3n1/queue/iostats
   echo 2 > /sys/block/nvme3n1/queue/nomerges

6) I do no special CPU frequency tuning. It's running stock settings,
   and the system is not overclocked or anything like that.

I think that's about it. The above gets me 3.5M+ per core using polled
IO and the current kernel, and around 2.3M per core if using IRQ driven
IO. Note that the current kernel is important here, we've improved
things a lot over the last year.

That said, 580K is crazy low, and I bet there's something basic that's
preventing it running faster. Is this a gen2 optane? One thing that
might be useful is to run my t/io_uring from above, it'll tell you what
the IO thread pid is:

[...]
submitter=2900332
[...]

and then run

# perf record -g -p 2900332 -- sleep 3

and afterwards do:

# perf report -g --no-children > output

and gzip the output and attach it here. With performance that low,
should be pretty trivial to figure out what is going on here.

-- 
Jens Axboe


[-- Attachment #2: amd-config.txt.gz --]
[-- Type: application/gzip, Size: 31774 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-01 14:17               ` Erwan Velu
  2021-09-06 14:26                 ` Hans-Peter Lehmann
@ 2021-09-08 12:33                 ` Jens Axboe
  2021-09-08 17:11                   ` Erwan Velu
  1 sibling, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-09-08 12:33 UTC (permalink / raw)
  To: Erwan Velu, Hans-Peter Lehmann, fio

On 9/1/21 8:17 AM, Erwan Velu wrote:
> 
> Le 01/09/2021 à 16:05, Erwan Velu a écrit :
>>
>> [...]
>> These numbers are perfect.
>>
>> I rebuild fio on a 5.4 kernel on a high-end nvme which is 1M iops read 
>> capable and got stuck at the same number as you aka 580K iops.
>>
>> So, I'd agree with you something is curious here.
>>
> Same on a 5.10.35.
> 
> Jens, did we missed something here ?

Read through the actual thread. It seriously sounds like the system is
hitting bandwidth limits, which is very odd. 580K is not very much. Just
curious, does it change anything if you use 512b blocks instead?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 12:22                     ` Jens Axboe
@ 2021-09-08 12:41                       ` Jens Axboe
  2021-09-08 16:12                         ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-09-08 12:41 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Erwan Velu, fio, Hans-Peter Lehmann

On 9/8/21 6:22 AM, Jens Axboe wrote:
> On 9/8/21 5:53 AM, Sitsofe Wheeler wrote:
>> (CC'ing Jens directly in case he missed the previous messages)
>>
>> On Mon, 6 Sept 2021 at 15:28, Hans-Peter Lehmann
>> <hans-peter.lehmann@kit.edu> wrote:
>>>
>>> Hi Jens,
>>>
>>> not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks?
>>>
>>> Best regards
>>> Hans-Peter
>>>
>>> (I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads)
> 
> Thanks for CC'ing me, I don't always see the messages otherwise. 580K is
> very low, but without having access to the system and being able to run
> some basic profiling, hard for me to say what you're running into. I may
> miss some details in the below, so please do ask followups if things are
> missing/unclear.
> 
> 1) I'm using a 3970X with a desktop board + box for my peak testing,
>    specs on that can be found online.
> 
> 2) Yes I do run a custom configuration on my kernel, I do kernel
>    development after all :-). I'm attaching the one I'm using. This
>    hasn't changed in a long time. I do turn off various things that I
>    don't need and some of them do impact performance.
> 
> 3) The options I run t/io_uring with have been posted multiple times,
>    it's this one:
> 
>    taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme3n1
> 
>    which is QD=128, 32/32 submit/complete batching, polled IO,
>    registered files and buffers. Note that you'll need to configure NVMe
>    to properly use polling. I use 32 poll queues, number isn't really
>    that important for single core testing, as long as there's enough to
>    have a poll queue local to CPU being tested on. You'll see this in
>    dmesg:
> 
>    nvme nvme3: 64/0/32 default/read/poll queues
> 
> 4) Make sure your nvme device is using 'none' as the IO scheduler. I
>    think this is a no-brainer, but mentioning it just in case.
> 
> 5) I turn off iostats and merging for the device. iostats is the most
>    important, depending on the platform getting accurate time stamps can
>    be expensive:
> 
>    echo 0 > /sys/block/nvme3n1/queue/iostats
>    echo 2 > /sys/block/nvme3n1/queue/nomerges
> 
> 6) I do no special CPU frequency tuning. It's running stock settings,
>    and the system is not overclocked or anything like that.
> 
> I think that's about it. The above gets me 3.5M+ per core using polled
> IO and the current kernel, and around 2.3M per core if using IRQ driven
> IO. Note that the current kernel is important here, we've improved
> things a lot over the last year.
> 
> That said, 580K is crazy low, and I bet there's something basic that's
> preventing it running faster. Is this a gen2 optane? One thing that
> might be useful is to run my t/io_uring from above, it'll tell you what
> the IO thread pid is:
> 
> [...]
> submitter=2900332
> [...]
> 
> and then run
> 
> # perf record -g -p 2900332 -- sleep 3
> 
> and afterwards do:
> 
> # perf report -g --no-children > output
> 
> and gzip the output and attach it here. With performance that low,
> should be pretty trivial to figure out what is going on here.

Followup - the below is specific to my peak-per-core testing, for
running much lower IOPS most of it isn't going to be required. For
example, polled IO is not going to be that useful at ~500K iops.

For the original poster, and I think this was already asked, but please
run the perf as indicated and also do a run with two threads:

taskset -c 0,1 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B1 -n2 /dev/nvmeXn1 /dev/nvmeYn1

just to see what happens. t/io_uring doesn't work very well with
driving polled IO for multiple devices, it's just a simple little
IO generator, nothing advanced.

I picked CPU0 and 1 here, but depending on the number of queues on your
device, you might be more limited and you should pick something that
causes a nice spread on your setup. /sys/kernel/debug/block/<dev> will
have information on how queues are spread out.

Again, not really something that should be needed at these kinds of
rates, unless the device is severly queue starved and you have a lot of
cores in your system. Regardless, strictly affinitizing the workload
helps with variance between runs and is always a good idea.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 12:41                       ` Jens Axboe
@ 2021-09-08 16:12                         ` Hans-Peter Lehmann
  2021-09-08 16:20                           ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-08 16:12 UTC (permalink / raw)
  To: axboe, fio

[-- Attachment #1: Type: text/plain, Size: 3018 bytes --]

Hi Jens,

thank you for your reply. Given that you have read the thread after the first reply, I think some of the questions of your first email are no longer relevant. I still answered them at the bottom for completeness, but I will answer the more interesting ones first.

> I turn off iostats and merging for the device.



Doing this helped quite a bit. The 512b reads went from 715K to 800K. The 4096b reads went from 570K to 630K.

> Note that you'll need to configure NVMe
  to properly use polling. I use 32 poll queues, number isn't really
  that important for single core testing, as long as there's enough to
  have a poll queue local to CPU being tested on.

My SSD was configured to use 128/0/0 default/read/poll queues. I added "nvme.poll_queues=32" to GRUB and rebooted, which changed it to 96/0/32. I now get 1.0M IOPS (512b blocks) and 790K IOPS (4096b blocks) using a single core. Thank you very much, this probably was the main bottleneck. Launching the benchmark two times with 512b blocks, I get 1.4M IOPS total.

Starting single-threaded t/io_uring with two SSDs still achieves "only" 1.0M IOPS, independently of the block size. In your benchmarks from 2019 [0] when Linux 5.4 (which I am using) was current, you achieved 1.6M IOPS (4096b blocks) using a single core. I get the full 1.6M IOPS for saturating both SSDs (4096b blocks) only when running t/io_uring with two threads. This makes me think that there is still another configuration option that I am missing. Most time is spent in the kernel.

# time taskset -c 48 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme0n1 /dev/nvme1n1
i 8, argc 10
Added file /dev/nvme0n1 (submitter 0)
Added file /dev/nvme1n1 (submitter 0)
sq_ring ptr = 0x0x7f78fb740000
sqes ptr    = 0x0x7f78fb73e000
cq_ring ptr = 0x0x7f78fb73c000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=2336
IOPS=1014252, IOS/call=31/31, inflight=102 (38, 64)
IOPS=1017984, IOS/call=31/31, inflight=123 (64, 59)
IOPS=1018220, IOS/call=31/31, inflight=102 (38, 64)
[...]
real    0m7.898s
user    0m0.144s
sys     0m7.661s

I attached a perf output to the email. It was generated using the same parameters as above (getting 1.0M IOPS).

Thank you very much for your help. I am looking forward to hearing from you again to be able fully reproduce your measurements soon.
Hans-Peter


=== Answers to (I think) no longer relevant questions ===

> The options I run t/io_uring with have been posted multiple times, it's this one

This is the same configuration that I also ran (I did not explicitly specify the parameters that are the same as the default).

> Make sure your nvme device is using 'none' as the IO scheduler.

The scheduler is set to 'none'.

> Is this a gen2 optane?

It is not an optane disk but I also do not expect to get insanely high numbers like in your recent benchmarks. Just more close to the old benchmarks but using two SSDs.


=== References ===

[0]: https://twitter.com/axboe/status/1174777844313911296

[-- Attachment #2: perf-output.gz --]
[-- Type: application/gzip, Size: 2529 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 16:12                         ` Hans-Peter Lehmann
@ 2021-09-08 16:20                           ` Jens Axboe
  2021-09-08 21:24                             ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-09-08 16:20 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio

On 9/8/21 10:12 AM, Hans-Peter Lehmann wrote:
> Hi Jens,
> 
> thank you for your reply. Given that you have read the thread after the first reply, I think some of the questions of your first email are no longer relevant. I still answered them at the bottom for completeness, but I will answer the more interesting ones first.
> 
>> I turn off iostats and merging for the device.
> 
> 
> 
> Doing this helped quite a bit. The 512b reads went from 715K to 800K. The 4096b reads went from 570K to 630K.
> 
>> Note that you'll need to configure NVMe
>   to properly use polling. I use 32 poll queues, number isn't really
>   that important for single core testing, as long as there's enough to
>   have a poll queue local to CPU being tested on.
> 
> My SSD was configured to use 128/0/0 default/read/poll queues. I added
> "nvme.poll_queues=32" to GRUB and rebooted, which changed it to
> 96/0/32. I now get 1.0M IOPS (512b blocks) and 790K IOPS (4096b
> blocks) using a single core. Thank you very much, this probably was
> the main bottleneck. Launching the benchmark two times with 512b
> blocks, I get 1.4M IOPS total.

Sounds like IRQs are expensive on your box, it does vary quite a bit
between systems.

What's the advertised peak random read performance of the devices you
are using?

> Starting single-threaded t/io_uring with two SSDs still achieves "only" 1.0M IOPS, independently of the block size. In your benchmarks from 2019 [0] when Linux 5.4 (which I am using) was current, you achieved 1.6M IOPS (4096b blocks) using a single core. I get the full 1.6M IOPS for saturating both SSDs (4096b blocks) only when running t/io_uring with two threads. This makes me think that there is still another configuration option that I am missing. Most time is spent in the kernel.
> 
> # time taskset -c 48 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme0n1 /dev/nvme1n1
> i 8, argc 10
> Added file /dev/nvme0n1 (submitter 0)
> Added file /dev/nvme1n1 (submitter 0)
> sq_ring ptr = 0x0x7f78fb740000
> sqes ptr    = 0x0x7f78fb73e000
> cq_ring ptr = 0x0x7f78fb73c000
> polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
> submitter=2336
> IOPS=1014252, IOS/call=31/31, inflight=102 (38, 64)
> IOPS=1017984, IOS/call=31/31, inflight=123 (64, 59)
> IOPS=1018220, IOS/call=31/31, inflight=102 (38, 64)
> [...]
> real    0m7.898s
> user    0m0.144s
> sys     0m7.661s
> 
> I attached a perf output to the email. It was generated using the same parameters as above (getting 1.0M IOPS).

Looking at the perf trace, it looks pretty apparent:

     7.54%  io_uring  [kernel.kallsyms]  [k] read_tsc                           

which means you're spending ~8% of the time of the worload just reading
time stamps. As is often the case once you get near core limits,
realistically that'll cut more than 8% of your perf. Did you turn off
iostats? If so, then there's a few things in the kernel config that can
cause this. One is BLK_CGROUP_IOCOST, is that enabled? Might be more if
you're still on that old kernel.

Would be handy to have -g enabled for your perf record and report, since
that would show us exactly who's calling the expensive bits. The next
one is memset(), which also looks suspect. But may be related to:

https://git.kernel.dk/cgit/linux-block/commit/block/bio.c?id=da521626ac620d8719d674a48b8ec3620eefd42a

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 12:33                 ` Jens Axboe
@ 2021-09-08 17:11                   ` Erwan Velu
  2021-09-08 22:37                     ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-09-08 17:11 UTC (permalink / raw)
  To: Jens Axboe, Hans-Peter Lehmann, fio


Le 08/09/2021 à 14:33, Jens Axboe a écrit :
> [..]
>> Same on a 5.10.35.
>>
>> Jens, did we missed something here ?
> Read through the actual thread. It seriously sounds like the system is
> hitting bandwidth limits, which is very odd. 580K is not very much. Just
> curious, does it change anything if you use 512b blocks instead?
>
That's clearly way better !


With default kernel config :

1 device : 512b : 674K

1 device: 512b + nomerge + nostats : 745K

1 device: 512b + nomerge + nostats + nopoll : 770K

2 devices, 2 thread + nopoll : taskset -c 0,1 t/io_uring -b512 -d128 
-c32 -s32 -p0 -F1 -B1 -n2 /dev/nvme2n1 /dev/nvme3n1 : 1.57M


Enabling 32 poll queues:

1 device: 512b: 884K

1 device: 512b + nostats : 915K

1 device: 512b + nomerge + nostats + nopoll : 715K

2 devices, 2 thread nostats : taskset -c 0,1 t/io_uring -b512 -d128 -c32 
-s32 -p1 -F1 -B1 -n2 /dev/nvme2n1 /dev/nvme3n1 : 1.84M

3 devices, 2 threads nostats : 2.2M

3 devices, 3 threads nostats : 2.8M

4 devices, 3 threads nostats 3.1M

4 devices, 4 threads nostats: 3.8M


Thanks for the insights !

Erwan,



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 16:20                           ` Jens Axboe
@ 2021-09-08 21:24                             ` Hans-Peter Lehmann
  2021-09-08 21:34                               ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-08 21:24 UTC (permalink / raw)
  To: axboe, fio

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

> What's the advertised peak random read performance of the devices you are using?

I use 2x Intel P4510 (2 TB) for the experiments (and a third SSD for the OS). The SSDs are advertised to have 640k IOPS (4k random reads). So when I get 1.6M IOPS using 2 threads, I already get a lot more than advertised. Still, I wonder why I cannot get that (or at least something like 1.3M IOPS) using a single core. Using 512b blocks should also be able to achieve a bit more than 1.0M IOPS.

> Sounds like IRQs are expensive on your box, it does vary quite a bit between systems.

That could definitely be the case, as the processor (EPYC 7702P) seems to have some Numa characteristics even when configuring it to be a single node. With NPS=1, I still get a difference of about 10K-50K IOPS when I use the cores that would belong to different Numa domains than the SSDs. In the measurements above, the interrupts and the benchmark are pinned to a core "near" the SSDs, though.

> Did you turn off iostats? If so, then there's a few things in the kernel config that can cause this. One is BLK_CGROUP_IOCOST, is that enabled?

Yes, I did turn off iostats for both drives but BLK_CGROUP_IOCOST is enabled.

> Might be more if you're still on that old kernel.

I'm on an old kernel but I am also comparing my results with results that you got on the same kernel back in 2019 (my target is ~1.6M like in [0], not something like the insane 2.5M you got recently [1]). I know that it's not a 100% fair comparison because of the different hardware but I still fear that there is some configuration option that I am missing.

> Would be handy to have -g enabled for your perf record and report, since that would show us exactly who's calling the expensive bits.



I did run it with -g (copied the commands from your previous email and just exchanged the pid). You also had the "--no-children" parameter in that command and I guess you were looking for the output without it. You can find the output from a simple "perf report -g" attached.

Thank you again for your help and have a nice day
Hans-Peter

[0]: https://twitter.com/axboe/status/1174777844313911296
[1]: https://lore.kernel.org/io-uring/4af91b50-4a9c-8a16-9470-a51430bd7733@kernel.dk

[-- Attachment #2: output.gz --]
[-- Type: application/gzip, Size: 28245 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 21:24                             ` Hans-Peter Lehmann
@ 2021-09-08 21:34                               ` Jens Axboe
  2021-09-10 11:25                                 ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2021-09-08 21:34 UTC (permalink / raw)
  To: Hans-Peter Lehmann, fio

On 9/8/21 3:24 PM, Hans-Peter Lehmann wrote:
>> What's the advertised peak random read performance of the devices you are using?
> 
> I use 2x Intel P4510 (2 TB) for the experiments (and a third SSD for
> the OS). The SSDs are advertised to have 640k IOPS (4k random reads).
> So when I get 1.6M IOPS using 2 threads, I already get a lot more than
> advertised. Still, I wonder why I cannot get that (or at least
> something like 1.3M IOPS) using a single core.

You probably could, if t/io_uring was improved to better handle multiple
files. But this is pure speculation, it's definitely more expensive to
drive two drives vs one for these kinds of tests. Just trying to manage
expectations :-)

That said, on my box, 1 drive vs 2, both are core limited:

sudo taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 /dev/nvme1n1
Added file /dev/nvme1n1 (submitter 0)
sq_ring ptr = 0x0x7f687f94d000
sqes ptr    = 0x0x7f687f94b000
cq_ring ptr = 0x0x7f687f949000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=2535
IOPS=3478176, IOS/call=32/31, inflight=(128)
IOPS=3491488, IOS/call=32/32, inflight=(128)
IOPS=3476224, IOS/call=32/32, inflight=(128)

and 2 drives, still using just one core:

Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme3n1 (submitter 0)
[...]
IOPS=3203648, IOS/call=32/31, inflight=(27 64)
IOPS=3173856, IOS/call=32/31, inflight=(64 53)
IOPS=3233344, IOS/call=32/31, inflight=(60 64)

vs using 2 files, but it's really the same drive:

Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme1n1 (submitter 0)
[...]
IOPS=3439776, IOS/call=32/31, inflight=(64 0)
IOPS=3444704, IOS/call=32/31, inflight=(51 64)
IOPS=3447776, IOS/call=32/31, inflight=(64 64)

That might change without polling, but it does show extra overhead for
polling 2 drives vs just one.

> Using 512b blocks should also be able to achieve a bit more than 1.0M
> IOPS.

Not necessarily, various controllers have different IOPS and bandwidth
limits. I don't have these particular drives myself, so cannot verify
unfortunately.

>> Sounds like IRQs are expensive on your box, it does vary quite a bit between systems.
> 
> That could definitely be the case, as the processor (EPYC 7702P) seems to have some Numa characteristics even when configuring it to be a single node. With NPS=1, I still get a difference of about 10K-50K IOPS when I use the cores that would belong to different Numa domains than the SSDs. In the measurements above, the interrupts and the benchmark are pinned to a core "near" the SSDs, though.
> 
>> Did you turn off iostats? If so, then there's a few things in the kernel config that can cause this. One is BLK_CGROUP_IOCOST, is that enabled?
> 
> Yes, I did turn off iostats for both drives but BLK_CGROUP_IOCOST is enabled.
> 
>> Might be more if you're still on that old kernel.
> 
> I'm on an old kernel but I am also comparing my results with results
> that you got on the same kernel back in 2019 (my target is ~1.6M like
> in [0], not something like the insane 2.5M you got recently [1]). I
> know that it's not a 100% fair comparison because of the different
> hardware but I still fear that there is some configuration option that
> I am missing.

No, you're running something from around that same time, not what I was
running. It'd be the difference between my custom kernel and a similarly
versioned distro kernel.

There's a bit of work to do to ensure that the standard options don't
add too much overhead, or at least that you can work-around it at
runtime.

>> Would be handy to have -g enabled for your perf record and report, since that would show us exactly who's calling the expensive bits.

> I did run it with -g (copied the commands from your previous email and
> just exchanged the pid). You also had the "--no-children" parameter in
> that command and I guess you were looking for the output without it.
> You can find the output from a simple "perf report -g" attached.

I really did want --no-children, the default is pretty useless imho...
But the callgraphs are a must!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 17:11                   ` Erwan Velu
@ 2021-09-08 22:37                     ` Erwan Velu
  2021-09-16 21:18                       ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-09-08 22:37 UTC (permalink / raw)
  To: Jens Axboe, Hans-Peter Lehmann, fio


Le 08/09/2021 à 19:11, Erwan Velu a écrit :
> [..]
> Enabling 32 poll queues:
> [...]
>
> 2 devices, 2 thread nostats : taskset -c 0,1 t/io_uring -b512 -d128 
> -c32 -s32 -p1 -F1 -B1 -n2 /dev/nvme2n1 /dev/nvme3n1 : 1.84M


Just realized something here, i'm using 2 devices with 2 cores for a 
good 1.84M, but cores 0 and 1 are not on the same physical core.

But I could also use 2 logical cores but this time located on the same 
physical core.

Thanks to the good Zen2 hyper-threading, this should give me nice numbers.


So I made a try !


2 devices on a single physical core brings me up to 1.23M (that's 67% of 
the previous result on two physical cores)

3 devices on a two physical cores bring me up to 1.85M

4 devices on a two physical cores bring me up to 2.48M

4 devices on a three physical cores bring me up to 3.77M


That's really great numbers ...



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 21:34                               ` Jens Axboe
@ 2021-09-10 11:25                                 ` Hans-Peter Lehmann
  2021-09-10 11:45                                   ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-10 11:25 UTC (permalink / raw)
  To: axboe, fio

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

> it's definitely more expensive to drive two drives vs one for these kinds of tests. [...] it does show extra overhead for polling 2 drives vs just one.

Oh, right. Sorry. Thanks a lot for the interesting measurements. Also, I somehow missed that you already are at 3.5M IOPS. Crazy.

> No, you're running something from around that same time, not what I was running

I upgraded the Kernel to 5.10.32 (that's the most recent one provided by Canonical for Ubuntu 20.04) and tried again. I'm still stuck at 1.0M IOPS.

> I really did want --no-children, the default is pretty useless imho... But the callgraphs are a must!

I think I now managed to export what you expected (somehow "-g" was not enough and I needed to do "-g graph", even though that should actually be the default). You can find the perf output attached. It was recorded on Kernel 5.10.32.

Hans-Peter

[-- Attachment #2: output.gz --]
[-- Type: application/gzip, Size: 4728 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-10 11:25                                 ` Hans-Peter Lehmann
@ 2021-09-10 11:45                                   ` Erwan Velu
  0 siblings, 0 replies; 26+ messages in thread
From: Erwan Velu @ 2021-09-10 11:45 UTC (permalink / raw)
  To: Hans-Peter Lehmann, axboe, fio


Le 10/09/2021 à 13:25, Hans-Peter Lehmann a écrit :
>> it's definitely more expensive to drive two drives vs one for these 
>> kinds of tests. [...] it does show extra overhead for polling 2 
>> drives vs just one.
>
> Oh, right. Sorry. Thanks a lot for the interesting measurements. Also, 
> I somehow missed that you already are at 3.5M IOPS. Crazy.
>
Spoiler, it went up to 6M/core with 3 Optane ...

https://twitter.com/axboe/status/1435741761897254916


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-08 22:37                     ` Erwan Velu
@ 2021-09-16 21:18                       ` Erwan Velu
  2021-09-21  7:05                         ` Erwan Velu
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-09-16 21:18 UTC (permalink / raw)
  To: Jens Axboe, Hans-Peter Lehmann, fio


On 09/09/2021 00:37, Erwan Velu wrote:
> [...]
> So I made a try !
>
>
> 2 devices on a single physical core brings me up to 1.23M (that's 67% 
> of the previous result on two physical cores)
>
> 3 devices on a two physical cores bring me up to 1.85M
>
> 4 devices on a two physical cores bring me up to 2.48M
>
> 4 devices on a three physical cores bring me up to 3.77M
>
To ease the reproducer, I created a script 
https://github.com/axboe/fio/pull/1272 that automate all this and fixes 
the usual config mistakes.

Feel free to test it !



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-16 21:18                       ` Erwan Velu
@ 2021-09-21  7:05                         ` Erwan Velu
  2021-09-22 14:45                           ` Hans-Peter Lehmann
  0 siblings, 1 reply; 26+ messages in thread
From: Erwan Velu @ 2021-09-21  7:05 UTC (permalink / raw)
  To: Jens Axboe, Hans-Peter Lehmann, fio


Le 16/09/2021 à 23:18, Erwan Velu a écrit :
>
> On 09/09/2021 00:37, Erwan Velu wrote:
>> [...]
>> So I made a try !
>>
>>
>> 2 devices on a single physical core brings me up to 1.23M (that's 67% 
>> of the previous result on two physical cores)
>>
>> 3 devices on a two physical cores bring me up to 1.85M
>>
>> 4 devices on a two physical cores bring me up to 2.48M
>>
>> 4 devices on a three physical cores bring me up to 3.77M
>>
> To ease the reproducer, I created a script 
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faxboe%2Ffio%2Fpull%2F1272&amp;data=04%7C01%7Ce.velu%40criteo.com%7C790c95d308a04c6e9c4308d9795785f5%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C0%7C637674239079752317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vBAwZnAJzkSAag3k6PW2kCDBStrap5rM1joFWuaG6MY%3D&amp;reserved=0 
> that automate all this and fixes the usual config mistakes.
>
> Feel free to test it !

Just merged by Jens : 
https://github.com/axboe/fio/commit/0c6c0ebb82504da41ff2aa7a382923d9713b40ab

Feel free to use t/one-core-peak.sh to test your setup, feel free to 
comment & patch.

Erwan,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Question: t/io_uring performance
  2021-09-21  7:05                         ` Erwan Velu
@ 2021-09-22 14:45                           ` Hans-Peter Lehmann
  0 siblings, 0 replies; 26+ messages in thread
From: Hans-Peter Lehmann @ 2021-09-22 14:45 UTC (permalink / raw)
  To: Erwan Velu, Jens Axboe, fio


Thank you for your help with reproducing the benchmarks, Jens and Erwan!

> feel free to comment & patch.

I just submitted a little PR :)

Hans-Peter

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-09-22 14:45 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-25 15:57 Question: t/io_uring performance Hans-Peter Lehmann
2021-08-26  7:27 ` Erwan Velu
2021-08-26 15:57   ` Hans-Peter Lehmann
2021-08-27  7:20     ` Erwan Velu
2021-09-01 10:36       ` Hans-Peter Lehmann
2021-09-01 13:17         ` Erwan Velu
2021-09-01 14:02           ` Hans-Peter Lehmann
2021-09-01 14:05             ` Erwan Velu
2021-09-01 14:17               ` Erwan Velu
2021-09-06 14:26                 ` Hans-Peter Lehmann
2021-09-06 14:41                   ` Erwan Velu
2021-09-08 11:53                   ` Sitsofe Wheeler
2021-09-08 12:22                     ` Jens Axboe
2021-09-08 12:41                       ` Jens Axboe
2021-09-08 16:12                         ` Hans-Peter Lehmann
2021-09-08 16:20                           ` Jens Axboe
2021-09-08 21:24                             ` Hans-Peter Lehmann
2021-09-08 21:34                               ` Jens Axboe
2021-09-10 11:25                                 ` Hans-Peter Lehmann
2021-09-10 11:45                                   ` Erwan Velu
2021-09-08 12:33                 ` Jens Axboe
2021-09-08 17:11                   ` Erwan Velu
2021-09-08 22:37                     ` Erwan Velu
2021-09-16 21:18                       ` Erwan Velu
2021-09-21  7:05                         ` Erwan Velu
2021-09-22 14:45                           ` Hans-Peter Lehmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.