All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Hans-Peter Lehmann <hans-peter.lehmann@kit.edu>,
	"fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: Question: t/io_uring performance
Date: Wed, 8 Sep 2021 15:34:27 -0600	[thread overview]
Message-ID: <5668f23f-49b3-1c37-1029-dabe996f7bd0@kernel.dk> (raw)
In-Reply-To: <47a5597f-4b7a-fcfc-b57d-2b46c86c0817@kit.edu>

On 9/8/21 3:24 PM, Hans-Peter Lehmann wrote:
>> What's the advertised peak random read performance of the devices you are using?
> 
> I use 2x Intel P4510 (2 TB) for the experiments (and a third SSD for
> the OS). The SSDs are advertised to have 640k IOPS (4k random reads).
> So when I get 1.6M IOPS using 2 threads, I already get a lot more than
> advertised. Still, I wonder why I cannot get that (or at least
> something like 1.3M IOPS) using a single core.

You probably could, if t/io_uring was improved to better handle multiple
files. But this is pure speculation, it's definitely more expensive to
drive two drives vs one for these kinds of tests. Just trying to manage
expectations :-)

That said, on my box, 1 drive vs 2, both are core limited:

sudo taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 /dev/nvme1n1
Added file /dev/nvme1n1 (submitter 0)
sq_ring ptr = 0x0x7f687f94d000
sqes ptr    = 0x0x7f687f94b000
cq_ring ptr = 0x0x7f687f949000
polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256
submitter=2535
IOPS=3478176, IOS/call=32/31, inflight=(128)
IOPS=3491488, IOS/call=32/32, inflight=(128)
IOPS=3476224, IOS/call=32/32, inflight=(128)

and 2 drives, still using just one core:

Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme3n1 (submitter 0)
[...]
IOPS=3203648, IOS/call=32/31, inflight=(27 64)
IOPS=3173856, IOS/call=32/31, inflight=(64 53)
IOPS=3233344, IOS/call=32/31, inflight=(60 64)

vs using 2 files, but it's really the same drive:

Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme1n1 (submitter 0)
[...]
IOPS=3439776, IOS/call=32/31, inflight=(64 0)
IOPS=3444704, IOS/call=32/31, inflight=(51 64)
IOPS=3447776, IOS/call=32/31, inflight=(64 64)

That might change without polling, but it does show extra overhead for
polling 2 drives vs just one.

> Using 512b blocks should also be able to achieve a bit more than 1.0M
> IOPS.

Not necessarily, various controllers have different IOPS and bandwidth
limits. I don't have these particular drives myself, so cannot verify
unfortunately.

>> Sounds like IRQs are expensive on your box, it does vary quite a bit between systems.
> 
> That could definitely be the case, as the processor (EPYC 7702P) seems to have some Numa characteristics even when configuring it to be a single node. With NPS=1, I still get a difference of about 10K-50K IOPS when I use the cores that would belong to different Numa domains than the SSDs. In the measurements above, the interrupts and the benchmark are pinned to a core "near" the SSDs, though.
> 
>> Did you turn off iostats? If so, then there's a few things in the kernel config that can cause this. One is BLK_CGROUP_IOCOST, is that enabled?
> 
> Yes, I did turn off iostats for both drives but BLK_CGROUP_IOCOST is enabled.
> 
>> Might be more if you're still on that old kernel.
> 
> I'm on an old kernel but I am also comparing my results with results
> that you got on the same kernel back in 2019 (my target is ~1.6M like
> in [0], not something like the insane 2.5M you got recently [1]). I
> know that it's not a 100% fair comparison because of the different
> hardware but I still fear that there is some configuration option that
> I am missing.

No, you're running something from around that same time, not what I was
running. It'd be the difference between my custom kernel and a similarly
versioned distro kernel.

There's a bit of work to do to ensure that the standard options don't
add too much overhead, or at least that you can work-around it at
runtime.

>> Would be handy to have -g enabled for your perf record and report, since that would show us exactly who's calling the expensive bits.

> I did run it with -g (copied the commands from your previous email and
> just exchanged the pid). You also had the "--no-children" parameter in
> that command and I guess you were looking for the output without it.
> You can find the output from a simple "perf report -g" attached.

I really did want --no-children, the default is pretty useless imho...
But the callgraphs are a must!

-- 
Jens Axboe


  reply	other threads:[~2021-09-08 21:34 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-25 15:57 Question: t/io_uring performance Hans-Peter Lehmann
2021-08-26  7:27 ` Erwan Velu
2021-08-26 15:57   ` Hans-Peter Lehmann
2021-08-27  7:20     ` Erwan Velu
2021-09-01 10:36       ` Hans-Peter Lehmann
2021-09-01 13:17         ` Erwan Velu
2021-09-01 14:02           ` Hans-Peter Lehmann
2021-09-01 14:05             ` Erwan Velu
2021-09-01 14:17               ` Erwan Velu
2021-09-06 14:26                 ` Hans-Peter Lehmann
2021-09-06 14:41                   ` Erwan Velu
2021-09-08 11:53                   ` Sitsofe Wheeler
2021-09-08 12:22                     ` Jens Axboe
2021-09-08 12:41                       ` Jens Axboe
2021-09-08 16:12                         ` Hans-Peter Lehmann
2021-09-08 16:20                           ` Jens Axboe
2021-09-08 21:24                             ` Hans-Peter Lehmann
2021-09-08 21:34                               ` Jens Axboe [this message]
2021-09-10 11:25                                 ` Hans-Peter Lehmann
2021-09-10 11:45                                   ` Erwan Velu
2021-09-08 12:33                 ` Jens Axboe
2021-09-08 17:11                   ` Erwan Velu
2021-09-08 22:37                     ` Erwan Velu
2021-09-16 21:18                       ` Erwan Velu
2021-09-21  7:05                         ` Erwan Velu
2021-09-22 14:45                           ` Hans-Peter Lehmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5668f23f-49b3-1c37-1029-dabe996f7bd0@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=fio@vger.kernel.org \
    --cc=hans-peter.lehmann@kit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.