All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Bernd Schubert <bschubert@ddn.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Miklos Szeredi <mszeredi@redhat.com>,
	ZiyangZhang <ZiyangZhang@linux.alibaba.com>,
	Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>
Subject: Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
Date: Sat, 1 Apr 2023 21:19:11 +0800	[thread overview]
Message-ID: <ZCgvT5sHax67zlZu@ovpn-8-18.pek2.redhat.com> (raw)
In-Reply-To: <bffb98ef-ff50-92f5-7854-7a5245b9ab63@ddn.com>

On Fri, Mar 31, 2023 at 07:13:21PM +0000, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
> > Apply io_uring fused command for supporting zero copy:
> > 
> > 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
> > in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
> > retrieve it from concurrent fused command.
> > 
> > 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
> > fused cmd(zero copy) buffer
> > 
> > 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
> > request and submit secondary request; meantime setup complete callback via
> > this API, once secondary request is completed, the complete callback is
> > called for freeing the buffer and completing the fused command
> > 
> > Also request reference is held during fused command lifetime, and this way
> > guarantees that request buffer won't be freed until all inflight fused
> > commands are completed.
> > 
> > userspace(only implement sqe128 fused command):
> > 
> > 	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> > 
> > liburing test(only implement normal sqe fused command: two 64byte SQEs)
> > 
> > 	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
> >   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
> >   include/uapi/linux/ublk_cmd.h |   6 +-
> >   3 files changed, 303 insertions(+), 21 deletions(-)
> > 
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index 1713b2890abb..7b7aa24e9729 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -297,18 +297,126 @@ with specified IO tag in the command data:
> >     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
> >     the server buffer (pages) read to the IO request pages.
> >   
> > -Future development
> > -==================
> > +- ``UBLK_IO_FUSED_SUBMIT_IO``
> > +
> > +  Used for implementing zero copy feature.
> > +
> > +  It has to been the primary command of io_uring fused command. This command
> 
> s/been/be/ ?

Yeah.

> 
> > +  submits the generic secondary IO request with io buffer provided by our primary
> > +  command, and won't be completed until the secondary request is done.
> > +
> > +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
> > +  actually ublk request buffer's reference, and the reference is shared &
> > +  read-only, so the generic secondary request can retrieve any part of the buffer
> > +  by passing buffer offset & length.
> >   
> >   Zero copy
> > ----------
> > +=========
> > +
> > +What is zero copy?
> > +------------------
> > +
> > +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
> > +or page cache buffer(buffered io) or kernel buffer(meta io often) is used
> > +for submitting data to ublk driver, and all kinds of these buffers are
> > +represented by bio/bvecs(ublk request buffer) finally. Before supporting
> > +zero copy, data in these buffers has to be copied to ublk server userspace
> > +buffer before handling WRITE IO, or after handing READ IO, so that ublk
> > +server can handle IO for ``/dev/ublkb*`` with the copied data.
> > +
> > +The extra copy between ublk request buffer and ublk server userspace buffer
> > +not only increases CPU utilization(such as pinning pages, copy data), but
> > +also consumes memory bandwidth, and the cost could be very big when IO size
> > +is big. It is observed that ublk-null IOPS may be increased to ~5X if the
> > +extra copy can be avoided.
> > +
> > +So zero copy is very important for supporting high performance block device
> > +in userspace.
> > +
> > +Technical requirements
> > +----------------------
> > +
> > +- ublk request buffer use
> > +
> > +ublk request buffer is represented by bio/bvec, which is immutable, so do
> > +not try to change bvec via buffer reference; data can be read from or
> > +written to the buffer according to buffer direction, but bvec can't be
> > +changed
> > +
> > +- buffer lifetime
> > +
> > +Ublk server borrows ublk request buffer for handling ublk IO, ublk request
> > +buffer reference is used. Reference can't outlive the referent buffer. That
> > +means all request buffer references have to be released by ublk server
> > +before ublk driver completes this request, when request buffer ownership
> > +is transferred to upper layer(FS, application, ...).
> > +
> > +Also after ublk request is completed, any page belonging to this ublk
> > +request can not be written or read any more from ublk server since it is
> > +one block device from kernel viewpoint.
> > +
> > +- buffer direction
> > +
> > +For ublk WRITE request, ublk request buffer should only be accessed as data
> > +source, and the buffer can't be written by ublk server
> > +
> > +For ublk READ request, ublk request buffer should only be accessed as data
> > +destination, and the buffer can't be read by ublk server, otherwise kernel
> > +data is leaked to ublk server, which can be unprivileged application.
> > +
> > +- arbitrary size sub-buffer needs to be retrieved from ublk server
> > +
> > +ublk is one generic framework for implementing block device in userspace,
> > +and typical requirements include logical volume manager(mirror, stripped, ...),
> > +distributed network storage, compressed target, ...
> > +
> > +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
> > +ublk server needs to submit IOs with these sub-buffer(s). That also means
> > +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
> > +
> > +Any sub-buffer is actually one reference of ublk request buffer, which
> > +ownership can't be transferred to upper layer if any reference is held
> > +by ublk server.
> > +
> > +Why slice isn't good for ublk zero copy
> 
> slice

Good catch.

> 
> > +---------------------------------------
> > +
> > +- spliced page from ->splice_read() can't be written
> > +
> > +ublk READ request can't be handled because spliced page can't be written to, and
> > +extending splice for ublk zero copy isn't one good solution [#splice_extend]_
> > +
> > +- it is very hard to meet above requirements  wrt. request buffer lifetime
> > +
> > +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> > +attention to ublk request buffer lifetime. If is very inefficient to respect
> > +request buffer lifetime by using all pipe buffer's ->release() which requires
> > +all pipe buffers and pipe to be kept when ublk server handles IO. That means
> > +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> > +provided buffer, and the pipe needs to be populated with pages in ublk request
> > +buffer.
> > +
> > +
> > +io_uring fused command based zero copy
> > +--------------------------------------
> > +
> > +io_uring fused command includes one primary command(uring command) and one
> > +generic secondary request. The primary command is responsible for submitting
> > +secondary request with provided buffer from ublk request, and primary command
> > +won't be completed until the secondary request is completed.
> > +
> > +Typical ublk IO handling includes network and FS IO, so it is usual enough
> > +for io_uring net & fs to support IO with provided buffer from primary command.
> >   
> > -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
> 
> Maybe replace 'requirement' with "very useful"? It works without zc, 
> just with limited performance.

OK, also it has been observed that the extra copy affects performance a log for
big IO size(64K+). 

> 
> > -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
> > -can't be remapped any more in kernel with existing mm interfaces. This can
> > -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
> > -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
> > +Once primary command is submitted successfully, ublk driver guarantees that
> > +the ublk request buffer won't be gone away since secondary request actually
> > +grabs the buffer's reference. This way also guarantees that multiple
> > +concurrent fused commands associated with same request buffer works fine,
> > +as the provided buffer reference is shared & read-only.
> >   
> > +Also buffer usage direction flag is passed to primary command from userspace,
> > +so ublk driver can validate if it is legal to use buffer with requested
> > +direction.
> >   
> >   References
> >   ==========
> > @@ -323,4 +431,4 @@ References
> >   
> >   .. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
> >   
> > -.. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
> > +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
> 
> 
> Any chance you could add a few code path into this document? For example 
> as in Documentation/filesystems/fuse.rst,
> section "Kernel - userspace interface"

OK, just it isn't handy to describe the flow by pure text.

BTW, you can also see the code path in the link:

https://github.com/ming1/ubdsrv/blob/master/doc/ublk_intro.pdf


thanks,
Ming


  reply	other threads:[~2023-04-01 13:20 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
2023-03-30 11:36 ` [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
2023-03-30 11:36 ` [PATCH V6 02/17] io_uring: use ctx->cached_sq_head to calculate left sqes Ming Lei
2023-03-30 11:36 ` [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD Ming Lei
2023-04-01 14:35   ` Ming Lei
2023-03-30 11:36 ` [PATCH V6 04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD Ming Lei
2023-03-30 11:36 ` [PATCH V6 05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request Ming Lei
2023-03-30 11:36 ` [PATCH V6 06/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
2023-03-30 11:36 ` [PATCH V6 07/17] block: ublk_drv: add common exit handling Ming Lei
2023-03-30 11:36 ` [PATCH V6 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
2023-03-30 11:36 ` [PATCH V6 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
2023-03-30 11:36 ` [PATCH V6 10/17] block: ublk_drv: clean up several helpers Ming Lei
2023-03-30 11:36 ` [PATCH V6 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
2023-03-30 11:36 ` [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
2023-03-31 16:22   ` Bernd Schubert
2023-03-30 11:36 ` [PATCH V6 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
2023-03-30 11:36 ` [PATCH V6 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
2023-03-30 11:36 ` [PATCH V6 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
2023-03-30 11:36 ` [PATCH V6 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
2023-03-31 19:13   ` Bernd Schubert
2023-04-01 13:19     ` Ming Lei [this message]
2023-03-31 19:55   ` Bernd Schubert
2023-04-01 13:22     ` Ming Lei
2023-04-03  9:25     ` Bernd Schubert
2023-04-03  1:11 ` [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
2023-04-03  1:24   ` Jens Axboe
2023-04-04  7:48     ` Ming Lei
2023-04-03  1:23 ` (subset) " Jens Axboe
2023-04-18 19:38 ` Bernd Schubert
2023-04-19  1:51   ` Ming Lei
2023-04-19  9:56     ` Bernd Schubert
2023-04-19 11:19       ` Ming Lei
2023-04-19 15:42         ` Bernd Schubert
2023-04-20  1:18           ` Pavel Begunkov
2023-04-20  1:38           ` Ming Lei
2023-04-21 22:38             ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCgvT5sHax67zlZu@ovpn-8-18.pek2.redhat.com \
    --to=ming.lei@redhat.com \
    --cc=ZiyangZhang@linux.alibaba.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bschubert@ddn.com \
    --cc=dan.j.williams@intel.com \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.