Re: Question: t/io_uring performance - Hans-Peter Lehmann

From: Hans-Peter Lehmann <hans-peter.lehmann@kit.edu>
To: <axboe@kernel.dk>, "fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: Question: t/io_uring performance
Date: Wed, 8 Sep 2021 18:12:44 +0200	[thread overview]
Message-ID: <1cf066bb-aa71-1403-c80c-454ea87a9502@kit.edu> (raw)
In-Reply-To: <8d6acc34-5078-c023-fcc8-cb34b63e5112@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 3018 bytes --]

Hi Jens,

thank you for your reply. Given that you have read the thread after the first reply, I think some of the questions of your first email are no longer relevant. I still answered them at the bottom for completeness, but I will answer the more interesting ones first.

> I turn off iostats and merging for the device.

Doing this helped quite a bit. The 512b reads went from 715K to 800K. The 4096b reads went from 570K to 630K.

> Note that you'll need to configure NVMe
  to properly use polling. I use 32 poll queues, number isn't really
  that important for single core testing, as long as there's enough to
  have a poll queue local to CPU being tested on.

My SSD was configured to use 128/0/0 default/read/poll queues. I added "nvme.poll_queues=32" to GRUB and rebooted, which changed it to 96/0/32. I now get 1.0M IOPS (512b blocks) and 790K IOPS (4096b blocks) using a single core. Thank you very much, this probably was the main bottleneck. Launching the benchmark two times with 512b blocks, I get 1.4M IOPS total.

Starting single-threaded t/io_uring with two SSDs still achieves "only" 1.0M IOPS, independently of the block size. In your benchmarks from 2019 [0] when Linux 5.4 (which I am using) was current, you achieved 1.6M IOPS (4096b blocks) using a single core. I get the full 1.6M IOPS for saturating both SSDs (4096b blocks) only when running t/io_uring with two threads. This makes me think that there is still another configuration option that I am missing. Most time is spent in the kernel.

# time taskset -c 48 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme0n1 /dev/nvme1n1
i 8, argc 10
Added file /dev/nvme0n1 (submitter 0)
Added file /dev/nvme1n1 (submitter 0)
sq_ring ptr = 0x0x7f78fb740000
sqes ptr    = 0x0x7f78fb73e000
cq_ring ptr = 0x0x7f78fb73c000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=2336
IOPS=1014252, IOS/call=31/31, inflight=102 (38, 64)
IOPS=1017984, IOS/call=31/31, inflight=123 (64, 59)
IOPS=1018220, IOS/call=31/31, inflight=102 (38, 64)
[...]
real    0m7.898s
user    0m0.144s
sys     0m7.661s

I attached a perf output to the email. It was generated using the same parameters as above (getting 1.0M IOPS).

Thank you very much for your help. I am looking forward to hearing from you again to be able fully reproduce your measurements soon.
Hans-Peter

=== Answers to (I think) no longer relevant questions ===

> The options I run t/io_uring with have been posted multiple times, it's this one

This is the same configuration that I also ran (I did not explicitly specify the parameters that are the same as the default).

> Make sure your nvme device is using 'none' as the IO scheduler.

The scheduler is set to 'none'.

> Is this a gen2 optane?

It is not an optane disk but I also do not expect to get insanely high numbers like in your recent benchmarks. Just more close to the old benchmarks but using two SSDs.

=== References ===

[0]: https://twitter.com/axboe/status/1174777844313911296

[-- Attachment #2: perf-output.gz --]
[-- Type: application/gzip, Size: 2529 bytes --]