Re: Question: t/io_uring performance

From: Jens Axboe <axboe@kernel.dk>
To: Hans-Peter Lehmann <hans-peter.lehmann@kit.edu>,
	"fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: Question: t/io_uring performance
Date: Wed, 8 Sep 2021 15:34:27 -0600	[thread overview]
Message-ID: <5668f23f-49b3-1c37-1029-dabe996f7bd0@kernel.dk> (raw)
In-Reply-To: <47a5597f-4b7a-fcfc-b57d-2b46c86c0817@kit.edu>

On 9/8/21 3:24 PM, Hans-Peter Lehmann wrote:
>> What's the advertised peak random read performance of the devices you are using?
> 
> I use 2x Intel P4510 (2 TB) for the experiments (and a third SSD for
> the OS). The SSDs are advertised to have 640k IOPS (4k random reads).
> So when I get 1.6M IOPS using 2 threads, I already get a lot more than
> advertised. Still, I wonder why I cannot get that (or at least
> something like 1.3M IOPS) using a single core.

You probably could, if t/io_uring was improved to better handle multiple
files. But this is pure speculation, it's definitely more expensive to
drive two drives vs one for these kinds of tests. Just trying to manage
expectations :-)

That said, on my box, 1 drive vs 2, both are core limited:

sudo taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 /dev/nvme1n1
Added file /dev/nvme1n1 (submitter 0)
sq_ring ptr = 0x0x7f687f94d000
sqes ptr    = 0x0x7f687f94b000
cq_ring ptr = 0x0x7f687f949000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=2535
IOPS=3478176, IOS/call=32/31, inflight=(128)
IOPS=3491488, IOS/call=32/32, inflight=(128)
IOPS=3476224, IOS/call=32/32, inflight=(128)

and 2 drives, still using just one core:

Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme3n1 (submitter 0)
[...]
IOPS=3203648, IOS/call=32/31, inflight=(27 64)
IOPS=3173856, IOS/call=32/31, inflight=(64 53)
IOPS=3233344, IOS/call=32/31, inflight=(60 64)

vs using 2 files, but it's really the same drive:

Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme1n1 (submitter 0)
[...]
IOPS=3439776, IOS/call=32/31, inflight=(64 0)
IOPS=3444704, IOS/call=32/31, inflight=(51 64)
IOPS=3447776, IOS/call=32/31, inflight=(64 64)

That might change without polling, but it does show extra overhead for
polling 2 drives vs just one.

> Using 512b blocks should also be able to achieve a bit more than 1.0M
> IOPS.

Not necessarily, various controllers have different IOPS and bandwidth
limits. I don't have these particular drives myself, so cannot verify
unfortunately.

>> Sounds like IRQs are expensive on your box, it does vary quite a bit between systems.
> 
> That could definitely be the case, as the processor (EPYC 7702P) seems to have some Numa characteristics even when configuring it to be a single node. With NPS=1, I still get a difference of about 10K-50K IOPS when I use the cores that would belong to different Numa domains than the SSDs. In the measurements above, the interrupts and the benchmark are pinned to a core "near" the SSDs, though.
> 
>> Did you turn off iostats? If so, then there's a few things in the kernel config that can cause this. One is BLK_CGROUP_IOCOST, is that enabled?
> 
> Yes, I did turn off iostats for both drives but BLK_CGROUP_IOCOST is enabled.
> 
>> Might be more if you're still on that old kernel.
> 
> I'm on an old kernel but I am also comparing my results with results
> that you got on the same kernel back in 2019 (my target is ~1.6M like
> in [0], not something like the insane 2.5M you got recently [1]). I
> know that it's not a 100% fair comparison because of the different
> hardware but I still fear that there is some configuration option that
> I am missing.

No, you're running something from around that same time, not what I was
running. It'd be the difference between my custom kernel and a similarly
versioned distro kernel.

There's a bit of work to do to ensure that the standard options don't
add too much overhead, or at least that you can work-around it at
runtime.

>> Would be handy to have -g enabled for your perf record and report, since that would show us exactly who's calling the expensive bits.

> I did run it with -g (copied the commands from your previous email and
> just exchanged the pid). You also had the "--no-children" parameter in
> that command and I guess you were looking for the output without it.
> You can find the output from a simple "perf report -g" attached.

I really did want --no-children, the default is pretty useless imho...
But the callgraphs are a must!

-- 
Jens Axboe