linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Towards more useful nvme passthrough
       [not found] <CGME20220228093018epcas5p137f53cb05ce95fed2ac173b8fddf2eee@epcas5p1.samsung.com>
@ 2022-02-28  9:25 ` Kanchan Joshi
  2022-03-02 23:40   ` Luis Chamberlain
  0 siblings, 1 reply; 2+ messages in thread
From: Kanchan Joshi @ 2022-02-28  9:25 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-nvme, linux-block

Background & Objective:
-----------------------
New storage interfaces/features, especially in NVMe, are emerging
fast. NVMe now has 3 command sets (NVM, ZNS and KV), and this is only
going to grow further (e.g. computational storage). Many of these new
commands do not fit well in the existing block abstraction and/or
syscalls. Be it somewhat specialized operation, or even a new way of
doing classical read/write (e.g. zone-append, copy command) - it takes
a good deal of consensus/time for a new device interface to climb the
ladders of kernel abstractions and become available for user-space
consumption. This presents challenges for early adopters of tech, and
leads to kernel-bypass at times.

Passthrough interface cuts through the abstractions and allows
applications to use any arbitrary nvme-command readily, similar to
kernel-bypass solutions. But passthrough does not scale as it travels
via sync ioctl interface, which is particularly painful for
fast/parallel NVMe storage.

Objective is to revamp the existing passthru interface and turn it
into something that applications can readily use to play with
new/emerging features of NVMe.

Current state of work:
----------------------
1. Block-interface is subject to compatibility of course. But now nvme
exposes a generic char interface (/dev/ng) as well which is not
subject to conditions [1]. When passthru is combined with this generic
char interface, applications get a sure-fire way to operate
nvme-device for any current/future command-set. This settles the
availability problem.

2. For scalability problem, we are discussing this new facility
“uring-cmd” that Jens proposed in io_uring [2]. This enables using
io_uring for any arbitrary command (ioctl, fsctl etc.) exposed by the
underlying component (driver, FS etc.).

3. I have posted patches combining nvme-passthru with uring-cmd [3].
This new uring-passthru path enables a bunch of capabilities – async
transport, fixed-buffer, async-polling, bio-cache etc. This scales well.
512b randread KIOPS comparing uring-passthru-over-char (/dev/ng0n1) to
uring-over-block (/dev/nvme0n1)

QD    uring    pt    uring-poll    pt-poll
8      538     589      831         902
64     967     1131     1351        1378
256    1043    1230     1376        1429

Discussion points:
------------------
I'd like a propose a session to go over:

- What are the issues in having the above work (uring-cmd and new nvme
passthru) merged?

- What would be other useful things to add in nvme-passthru. For
example- lack of vectored-io for passthru was one such missing piece.
That is covered from nvme 5.18 onwards [4]. But are there other things
that user-space would need before it starts treating this path as a
good alternative to kernel-bypass?

- Despite the numbers above, nvme passthru has more room for
efficiency e.g. unlike regular io, we do copy_to_user to fetch
command, and put_user to return the result. Eliminating some of this
may require new ioctl. There may be other opinions on what else needs
overhaul in this path.

- What would be a good way to upstream the tests? Nvme-cli may not be
very useful. Should it be similar to fio’s sg ioengine. But
unlike sg, here we are combining ng with io_uring, and one would want
to retain all the tunables of io_uring (register/fixed buffers/sqpoll
etc.)

- All the above is for 2.0 passthru which essentially forms a direct
path between io_uring and nvme. And io_uring and nvme programming
model share many similarities. For 3.0 passthru, would it be crazy to
think of trimming the path further by eliminating the block-layer and
doing stuff without “struct request”. There is some interest in
developing user-space block device [5] and FS anyway.

[1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@gmail.com/
[2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@kernel.dk/
[3] https://lore.kernel.org/linux-nvme/20211220141734.12206-1-joshi.k@samsung.com/
[4] https://lore.kernel.org/linux-nvme/20220216080208.GD10554@lst.de/
[5] https://lore.kernel.org/linux-block/87tucsf0sr.fsf@collabora.com/

-- 
2.25.1


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards more useful nvme passthrough
  2022-02-28  9:25 ` [LSF/MM/BPF TOPIC] Towards more useful nvme passthrough Kanchan Joshi
@ 2022-03-02 23:40   ` Luis Chamberlain
  0 siblings, 0 replies; 2+ messages in thread
From: Luis Chamberlain @ 2022-03-02 23:40 UTC (permalink / raw)
  To: Kanchan Joshi, Dave Jones; +Cc: lsf-pc, linux-nvme, linux-block

On Mon, Feb 28, 2022 at 02:55:11PM +0530, Kanchan Joshi wrote:
> I'd like a propose a session to go over:
> 
> - What are the issues in having the above work (uring-cmd and new nvme
> passthru) merged?

It sounds like we just needed to settle on the formats. And a few more
eyeballs / reviewed-by's. No? And it sounds like Jens is about to punt
a new series :)

> - What would be other useful things to add in nvme-passthru. For
> example- lack of vectored-io for passthru was one such missing piece.
> That is covered from nvme 5.18 onwards [4]. But are there other things
> that user-space would need before it starts treating this path as a
> good alternative to kernel-bypass?

I think it would be good to split this into two parts:

 * io-uring cmd extensions
 * what can be extended for nvme

io-uring cmd is not even upstream yet, so I don't think folks widely really
realize the potential yet. So I think it's a bit too early to tell here,
and so we should go out and preach at things like Plumbers and other
conferences with a few nice demos of what can be done. nvme being one
use case, but I think it would help to get other users active and not
just vaporware.

The problem I'm seeing with this effort too is it relies too heavily
on the nvme passthrough being the only use case so far, and that's
a bit too involved. So I'd like to encourage other simple users
to consider helping here.

Granted this is like looking for a nail when you're hammer. And so
the only way to not have it be that way is to aim smaller, a simple
real demo of something useful. I don't know.. I'd think something like
trinity might have a field day with this.

> - Despite the numbers above, nvme passthru has more room for
> efficiency e.g. unlike regular io, we do copy_to_user to fetch
> command, and put_user to return the result. Eliminating some of this
> may require new ioctl. There may be other opinions on what else needs
> overhaul in this path.

I think we are being to hard on ourselves. Start small, and, get some
basic stuff up. And allow for flexibility for improvement. I think
at this point we have more than proof of concept no but something
tangible?

> - What would be a good way to upstream the tests? Nvme-cli may not be
> very useful. Should it be similar to fio’s sg ioengine. But
> unlike sg, here we are combining ng with io_uring, and one would want
> to retain all the tunables of io_uring (register/fixed buffers/sqpoll
> etc.)

If the goal was to help open the door for unsupported commands then
in so far as upstream is concerned shouldn't we only care about the
generic plumbing? ie, specific commands / which might not yet be
baked for general consumption (like zone append) are left to up to
implementors to figure out where they test. Let's use zone append
as an example. Without a raw block interface to it, we can use this
framework, ideally.. but yeah how do we test? Are vendors all going
to agree to use microbenches with io-uring cmd?

> - All the above is for 2.0 passthru which essentially forms a direct
> path between io_uring and nvme. And io_uring and nvme programming
> model share many similarities. For 3.0 passthru, would it be crazy to
> think of trimming the path further by eliminating the block-layer and
> doing stuff without “struct request”. There is some interest in
> developing user-space block device [5] and FS anyway.

I failed to capture where 2.0 and 3.0 are defined. Can you elaborate?

  Luis

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-03-02 23:49 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220228093018epcas5p137f53cb05ce95fed2ac173b8fddf2eee@epcas5p1.samsung.com>
2022-02-28  9:25 ` [LSF/MM/BPF TOPIC] Towards more useful nvme passthrough Kanchan Joshi
2022-03-02 23:40   ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).