Re: Follow up on UBD discussion

From: Ming Lei <ming.lei@redhat.com>
To: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>,
	Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>,
	joseph.qi@linux.alibaba.com, linux-block@vger.kernel.org
Subject: Re: Follow up on UBD discussion
Date: Sun, 8 May 2022 01:13:02 +0800	[thread overview]
Message-ID: <YnaonsoDjQjrutRb@T590> (raw)
In-Reply-To: <8a52ed85-3ffa-44a4-3e28-e13cdc793732@linux.alibaba.com>

On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
> On 2022/5/3 16:02, Ming Lei wrote:
> > Hello Gabriel,
> > 
> > CC linux-block and hope you don't mind, :-)
> > 
> > On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> >>
> >> Hi Ming,
> >>
> >> First of all, I hope I didn't put you on the spot too much during the
> >> discussion.  My original proposal was to propose my design, but your
> >> implementation quite solved the questions I had. :)
> > 
> > I think that is open source, then we can put efforts together to make things
> > better.
> > 
> >>
> >> I'd like to follow up to restart the communication and see
> >> where I can help more with UBD.  As I said during the talk, I've
> >> done some fio runs and I was able to saturate NBD much faster than UBD:
> >>
> >> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> > 
> > Yeah, that is true since NBD has extra socket communication cost which
> > can't be efficient as io_uring.
> > 
> >>
> >> I've also wrote some fixes to the initialization path, which I
> >> planned to send to you as soon as you published your code, but I think
> >> you might want to take a look already and see if you want to just squash
> >> it into your code base.
> >>
> >> I pushed those fixes here:
> >>
> >>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> > 
> > I have added the 1st fix and 3rd patch into my tree:
> > 
> > https://github.com/ming1/linux/commits/v5.17-ubd-dev
> > 
> > The added check in 2nd patch is done lockless, which may not be reliable
> > enough, so I didn't add it. Also adding device is in slow path, and no
> > necessary to improve in that code path.
> > 
> > I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> > code, remove command of UBD_IO_GET_DATA and always make ubd driver
> > builtin.
> > 
> > ubdsrv part has been cleaned up too:
> > 
> > https://github.com/ming1/ubdsrv
> > 
> >>
> >> I'm looking into adding support for multiple driver queues next, and
> >> should be able to share some patches on that shortly.
> > 
> > OK, please post them on linux-block so that more eyes can look at the
> > code, meantime the ubdsrv side needs to handle MQ too.
> > 
> > Sooner or later, the single ubdsrv task may be saturated by copying data and
> > io_uring command communication only, which can be shown by running io on
> > ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> > 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> > 
> > 
> > Thanks,
> > Ming
> 
> Hi Ming,
> 
> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
> replace TCMU by UBD as a new choice for implementing userspace bdev for
> its high performance and simplicity.
> 
> First, we have conducted some tests by fio and perf to evaluate UBD.
> 
> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>      the complicated SCSI layer and does not support multiqueue. However
> UBD is simply using io_uring passthrough and may support multiqueue in
> the future.(Note that even with a single queue now , UBD outperforms TCMU)

MQ isn't hard to support, and it is basically workable now:

https://github.com/ming1/ubdsrv/commits/devel
https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel

Just the affinity of pthread for each queue isn't setup yet.

> 
> 2) Some functions in UBD result in high CPU utilization and we guess
> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
> between bio vectors and UBD internal buffers while handling write and
> read requests and it could be eliminated by supporting zero-copy.

copy itself doesn't take much cpu, see the following trace:

-   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
   - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
      - 23.86% internal_get_user_pages_fast                                                                                  ▒
         + 21.14% get_user_pages_unlocked                                                                                    ▒
         + 2.62% lockless_pages_from_mm                                                                                      ▒
        6.42% ubd_release_pages.constprop.0

And we may provide option to allow to pin pages in the disk lifetime for avoiding
the cost in _get_user_pages_fast().

zero-copy has to touch page table, and its cost may be expensive too.
The big problem is that MM doesn't provide mechanism to support generic
remapping kernel pages to userspace.

> 
> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> reasonable so please figure out my mistakes.
> 
> 1) UBD issues one sqe to commit last completed request and fetch a new
> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> costs some CPU and steps of building a new sqe may lower throughput.
> Here I'd like to give a new solution: never submit sqe but trump up a
> cqe(with information of new UBD IO request) when calling queue_rq(). I
> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> user issues only one sqe for polling an fd and repeatedly gets multiple
> cqes when new events occur. Dose this solution break the architecture of
> UBD?

But each cqe has to be associated with one sqe, if I understand
correctly.

I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
And yes, batching is really important for UBD's performance.

> 
> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
> When an application configs many queues with bigger iodepth, UBDSRV has
> to preallocate more buffers(size = 256KiB) and results in heavy memory
> overhead. I think data buffers should be allocated by applications

That is just virtual memory, and pages can be reclaimed after IO is
done.

> themselves and passed to UBDSRV. In this way UBD offers more
> flexibility. However, while handling a write request, the control flow
> returns to the kernel part again to set buf addr and copy data from bio
> vectors. Is ioctl helpful by setting buf addr and copying write data to
> app buf?

It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
is sent to ubd driver via ubdsrv_io_cmd->addr.

No need any ioctl, and io_uring command can handle everything.

I think the idea is good, and we can provide one option for using
pre-allocated buffer or application buffer.

But the application buffer has to be in same process VM space with ubdsrv
daemon, otherwise it becomes slower to pin these application
buffers/pages.

> 
> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
> and wastes CPU when no IO is ready to be submitted. I think it can be
> optimized by adding a new array storing UBD IOs that are ready to be
> commit back to the kernel part. Then we could batch these IOs and avoid
> unnecessary iterations on IOs which are not ready(fetching or handling
> by targets).

That should be easy to avoid the whole queue iteration, but my perf
trace doesn't show ubd_fetch_and_submit() consumes too much CPU.

> 
> 4) Zero-copy support is important and we are trying to implement it now.

I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
as one generic approach. If it is ready, it is easy to integrate to UBD.

> 
> 5) Currently, UBD only support the loop target with io_uirng and all
> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
> completion 4.prepare one sqe) are done in one thread. As far as I know,

loop is one example, and it provides similar function with kernel loop by
< 200 lines of userspace code.

>  some applications such as SPDK, network fs and customized distribution
> systems do not support io_uring well.  I think we should separate target
> IO handling from the UBDSRV loop and allow applications handle target
> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
> io_uring-supported targets?)

UBD provides one framework for implementing userspace block driver, you
can do everything for handling the IO in userspace. The target code just
needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
separated from ubd loop already. But UBD is still in early stage,
and the interface will continue to improve or re-design. Or can you
explain your ideas in a bit details? It could be very helpful if you
can provide some application background.

Reason why I suggested to use io_uring is that io_uring is very efficient, also
async IO has been proved as very efficient approach for handling io.

Thanks
Ming