io_uring and Optane2

From: Jens Axboe <axboe@kernel.dk>
To: io-uring <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: io_uring and Optane2
Date: Fri, 21 Aug 2020 09:58:57 -0600	[thread overview]
Message-ID: <4af91b50-4a9c-8a16-9470-a51430bd7733@kernel.dk> (raw)

Hi,

One of the key elements for eeking out the very last bit of performance
with io_uring is being able to test your design and improvements. I had
a bit of help on that front since Intel got me some Gen2 Optane SSD
samples a while back, and I've been using those to guide improvements -
and vice versa, to see which changes end up being detrimental to
latencies or scalability. I haven't been able to share any numbers on
that until now. So without further ado, here's some insight into what is
possible with io_uring, and the Linux IO stack, today.

Test setup:

Kernel: 5.9.0-rc1
System: Intel Ice Lake-SP Next-Gen Xeon (https://www.servethehome.com/intel-ice-lake-sp-next-gen-xeon-architecture-at-hc32/)
Storage device: Single Gen2 Optane SSD (https://blocksandfiles.com/2020/08/14/intel-gen-2-optane-details/)
Benchmark: t/io_uring from fio
Workload: Single thread random 512b O_DIRECT reads

Note that this is utilizing a single core in the system, out of the many
available. t/io_uring is used for light overhead IO generation, and
we're using polled IO with io_uring, and registered buffers and files.
512b IOs are used to keep us well below the bandwidth ceiling.
Throughput is easy, IOPS and latency are harder. My goal here was to
demonstrate what is possible today with io_uring in terms of efficiency.

Results
-----------------------------------------------------------
QD128 :		2.58M IOPS per core (34.9 usec avg latency)
QD16  :		2.06M IOPS per core ( 6.9 usec avg latency)
QD1   :		 290K IOPS per core ( 3.4 usec avg latency)

Outside of showing what's possible with io_uring today, these results
are also a testament to the general Linux IO stack efficiency. The
introduction of blk-mq was as much about general efficiency as it was
about scalability. That was a design criteria for both blk-mq and
io_uring from day 1. Even if you aren't driving millions of IOPS, or
using tons of threads/cores, you still care about getting your work done
in the shortest amount of time, using the fewest amount of wasted
cycles.

More to come...

-- 
Jens Axboe