Re: Question: t/io_uring performance

From: Jens Axboe <axboe@kernel.dk>
To: Hans-Peter Lehmann <hans-peter.lehmann@kit.edu>,
	"fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: Question: t/io_uring performance
Date: Wed, 8 Sep 2021 10:20:27 -0600	[thread overview]
Message-ID: <4ce4addd-a7c7-f35f-ef3b-b0bf9966e224@kernel.dk> (raw)
In-Reply-To: <1cf066bb-aa71-1403-c80c-454ea87a9502@kit.edu>

On 9/8/21 10:12 AM, Hans-Peter Lehmann wrote:
> Hi Jens,
> 
> thank you for your reply. Given that you have read the thread after the first reply, I think some of the questions of your first email are no longer relevant. I still answered them at the bottom for completeness, but I will answer the more interesting ones first.
> 
>> I turn off iostats and merging for the device.
> 
> 
> 
> Doing this helped quite a bit. The 512b reads went from 715K to 800K. The 4096b reads went from 570K to 630K.
> 
>> Note that you'll need to configure NVMe
>   to properly use polling. I use 32 poll queues, number isn't really
>   that important for single core testing, as long as there's enough to
>   have a poll queue local to CPU being tested on.
> 
> My SSD was configured to use 128/0/0 default/read/poll queues. I added
> "nvme.poll_queues=32" to GRUB and rebooted, which changed it to
> 96/0/32. I now get 1.0M IOPS (512b blocks) and 790K IOPS (4096b
> blocks) using a single core. Thank you very much, this probably was
> the main bottleneck. Launching the benchmark two times with 512b
> blocks, I get 1.4M IOPS total.

Sounds like IRQs are expensive on your box, it does vary quite a bit
between systems.

What's the advertised peak random read performance of the devices you
are using?

> Starting single-threaded t/io_uring with two SSDs still achieves "only" 1.0M IOPS, independently of the block size. In your benchmarks from 2019 [0] when Linux 5.4 (which I am using) was current, you achieved 1.6M IOPS (4096b blocks) using a single core. I get the full 1.6M IOPS for saturating both SSDs (4096b blocks) only when running t/io_uring with two threads. This makes me think that there is still another configuration option that I am missing. Most time is spent in the kernel.
> 
> # time taskset -c 48 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme0n1 /dev/nvme1n1
> i 8, argc 10
> Added file /dev/nvme0n1 (submitter 0)
> Added file /dev/nvme1n1 (submitter 0)
> sq_ring ptr = 0x0x7f78fb740000
> sqes ptr    = 0x0x7f78fb73e000
> cq_ring ptr = 0x0x7f78fb73c000
> polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
> submitter=2336
> IOPS=1014252, IOS/call=31/31, inflight=102 (38, 64)
> IOPS=1017984, IOS/call=31/31, inflight=123 (64, 59)
> IOPS=1018220, IOS/call=31/31, inflight=102 (38, 64)
> [...]
> real    0m7.898s
> user    0m0.144s
> sys     0m7.661s
> 
> I attached a perf output to the email. It was generated using the same parameters as above (getting 1.0M IOPS).

Looking at the perf trace, it looks pretty apparent:

     7.54%  io_uring  [kernel.kallsyms]  [k] read_tsc                           

which means you're spending ~8% of the time of the worload just reading
time stamps. As is often the case once you get near core limits,
realistically that'll cut more than 8% of your perf. Did you turn off
iostats? If so, then there's a few things in the kernel config that can
cause this. One is BLK_CGROUP_IOCOST, is that enabled? Might be more if
you're still on that old kernel.

Would be handy to have -g enabled for your perf record and report, since
that would show us exactly who's calling the expensive bits. The next
one is memset(), which also looks suspect. But may be related to:

https://git.kernel.dk/cgit/linux-block/commit/block/bio.c?id=da521626ac620d8719d674a48b8ec3620eefd42a

-- 
Jens Axboe