All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Follow up on UBD discussion
       [not found] <874k27rfwm.fsf@collabora.com>
@ 2022-05-03  8:02 ` Ming Lei
  2022-05-07  4:20   ` ZiyangZhang
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-05-03  8:02 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: linux-block

Hello Gabriel,

CC linux-block and hope you don't mind, :-)

On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> 
> Hi Ming,
> 
> First of all, I hope I didn't put you on the spot too much during the
> discussion.  My original proposal was to propose my design, but your
> implementation quite solved the questions I had. :)

I think that is open source, then we can put efforts together to make things
better.

> 
> I'd like to follow up to restart the communication and see
> where I can help more with UBD.  As I said during the talk, I've
> done some fio runs and I was able to saturate NBD much faster than UBD:
> 
> https://people.collabora.com/~krisman/mingl-ubd/bw.png

Yeah, that is true since NBD has extra socket communication cost which
can't be efficient as io_uring.

> 
> I've also wrote some fixes to the initialization path, which I
> planned to send to you as soon as you published your code, but I think
> you might want to take a look already and see if you want to just squash
> it into your code base.
> 
> I pushed those fixes here:
> 
>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd

I have added the 1st fix and 3rd patch into my tree:

https://github.com/ming1/linux/commits/v5.17-ubd-dev

The added check in 2nd patch is done lockless, which may not be reliable
enough, so I didn't add it. Also adding device is in slow path, and no
necessary to improve in that code path.

I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
code, remove command of UBD_IO_GET_DATA and always make ubd driver
builtin.

ubdsrv part has been cleaned up too:

https://github.com/ming1/ubdsrv

> 
> I'm looking into adding support for multiple driver queues next, and
> should be able to share some patches on that shortly.

OK, please post them on linux-block so that more eyes can look at the
code, meantime the ubdsrv side needs to handle MQ too.

Sooner or later, the single ubdsrv task may be saturated by copying data and
io_uring command communication only, which can be shown by running io on
ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
90% when IOPS is > 500K. So MQ may help some fast backing cases.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-03  8:02 ` Follow up on UBD discussion Ming Lei
@ 2022-05-07  4:20   ` ZiyangZhang
  2022-05-07 17:13     ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: ZiyangZhang @ 2022-05-07  4:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On 2022/5/3 16:02, Ming Lei wrote:
> Hello Gabriel,
> 
> CC linux-block and hope you don't mind, :-)
> 
> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
>>
>> Hi Ming,
>>
>> First of all, I hope I didn't put you on the spot too much during the
>> discussion.  My original proposal was to propose my design, but your
>> implementation quite solved the questions I had. :)
> 
> I think that is open source, then we can put efforts together to make things
> better.
> 
>>
>> I'd like to follow up to restart the communication and see
>> where I can help more with UBD.  As I said during the talk, I've
>> done some fio runs and I was able to saturate NBD much faster than UBD:
>>
>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> 
> Yeah, that is true since NBD has extra socket communication cost which
> can't be efficient as io_uring.
> 
>>
>> I've also wrote some fixes to the initialization path, which I
>> planned to send to you as soon as you published your code, but I think
>> you might want to take a look already and see if you want to just squash
>> it into your code base.
>>
>> I pushed those fixes here:
>>
>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> 
> I have added the 1st fix and 3rd patch into my tree:
> 
> https://github.com/ming1/linux/commits/v5.17-ubd-dev
> 
> The added check in 2nd patch is done lockless, which may not be reliable
> enough, so I didn't add it. Also adding device is in slow path, and no
> necessary to improve in that code path.
> 
> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> code, remove command of UBD_IO_GET_DATA and always make ubd driver
> builtin.
> 
> ubdsrv part has been cleaned up too:
> 
> https://github.com/ming1/ubdsrv
> 
>>
>> I'm looking into adding support for multiple driver queues next, and
>> should be able to share some patches on that shortly.
> 
> OK, please post them on linux-block so that more eyes can look at the
> code, meantime the ubdsrv side needs to handle MQ too.
> 
> Sooner or later, the single ubdsrv task may be saturated by copying data and
> io_uring command communication only, which can be shown by running io on
> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> 
> 
> Thanks,
> Ming

Hi Ming,

Now I am learning your userspace block driver(UBD) [1][2] and we plan to
replace TCMU by UBD as a new choice for implementing userspace bdev for
its high performance and simplicity.

First, we have conducted some tests by fio and perf to evaluate UBD.

1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
     the complicated SCSI layer and does not support multiqueue. However
UBD is simply using io_uring passthrough and may support multiqueue in
the future.(Note that even with a single queue now , UBD outperforms TCMU)

2) Some functions in UBD result in high CPU utilization and we guess
they also lower throughput. For example, ubdsrv_submit_fetch_commands()
frequently iterates on the array of UBD IOs and wastes CPU when no IO is
ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
between bio vectors and UBD internal buffers while handling write and
read requests and it could be eliminated by supporting zero-copy.

Second, I'd like to share some ideas on UBD. I'm not sure if they are
reasonable so please figure out my mistakes.

1) UBD issues one sqe to commit last completed request and fetch a new
one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
one cqe for the fetch command. We have evaluated that io_submit_sqes()
costs some CPU and steps of building a new sqe may lower throughput.
Here I'd like to give a new solution: never submit sqe but trump up a
cqe(with information of new UBD IO request) when calling queue_rq(). I
am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
user issues only one sqe for polling an fd and repeatedly gets multiple
cqes when new events occur. Dose this solution break the architecture of
UBD?

2) UBDSRV(the userspace part) should not allocate data buffers itself.
When an application configs many queues with bigger iodepth, UBDSRV has
to preallocate more buffers(size = 256KiB) and results in heavy memory
overhead. I think data buffers should be allocated by applications
themselves and passed to UBDSRV. In this way UBD offers more
flexibility. However, while handling a write request, the control flow
returns to the kernel part again to set buf addr and copy data from bio
vectors. Is ioctl helpful by setting buf addr and copying write data to
app buf?

3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
and wastes CPU when no IO is ready to be submitted. I think it can be
optimized by adding a new array storing UBD IOs that are ready to be
commit back to the kernel part. Then we could batch these IOs and avoid
unnecessary iterations on IOs which are not ready(fetching or handling
by targets).

4) Zero-copy support is important and we are trying to implement it now.

5) Currently, UBD only support the loop target with io_uirng and all
works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
completion 4.prepare one sqe) are done in one thread. As far as I know,
 some applications such as SPDK, network fs and customized distribution
systems do not support io_uring well.  I think we should separate target
IO handling from the UBDSRV loop and allow applications handle target
IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
io_uring-supported targets?)

Regards,
Zhang

[1] https://github.com/ming1/ubdsrv
[2]https://github.com/ming1/linux/commits/v5.17-ubd-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-07  4:20   ` ZiyangZhang
@ 2022-05-07 17:13     ` Ming Lei
  2022-05-09  4:05       ` Xiaoguang Wang
                         ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ming Lei @ 2022-05-07 17:13 UTC (permalink / raw)
  To: ZiyangZhang
  Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
> On 2022/5/3 16:02, Ming Lei wrote:
> > Hello Gabriel,
> > 
> > CC linux-block and hope you don't mind, :-)
> > 
> > On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> >>
> >> Hi Ming,
> >>
> >> First of all, I hope I didn't put you on the spot too much during the
> >> discussion.  My original proposal was to propose my design, but your
> >> implementation quite solved the questions I had. :)
> > 
> > I think that is open source, then we can put efforts together to make things
> > better.
> > 
> >>
> >> I'd like to follow up to restart the communication and see
> >> where I can help more with UBD.  As I said during the talk, I've
> >> done some fio runs and I was able to saturate NBD much faster than UBD:
> >>
> >> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> > 
> > Yeah, that is true since NBD has extra socket communication cost which
> > can't be efficient as io_uring.
> > 
> >>
> >> I've also wrote some fixes to the initialization path, which I
> >> planned to send to you as soon as you published your code, but I think
> >> you might want to take a look already and see if you want to just squash
> >> it into your code base.
> >>
> >> I pushed those fixes here:
> >>
> >>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> > 
> > I have added the 1st fix and 3rd patch into my tree:
> > 
> > https://github.com/ming1/linux/commits/v5.17-ubd-dev
> > 
> > The added check in 2nd patch is done lockless, which may not be reliable
> > enough, so I didn't add it. Also adding device is in slow path, and no
> > necessary to improve in that code path.
> > 
> > I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> > code, remove command of UBD_IO_GET_DATA and always make ubd driver
> > builtin.
> > 
> > ubdsrv part has been cleaned up too:
> > 
> > https://github.com/ming1/ubdsrv
> > 
> >>
> >> I'm looking into adding support for multiple driver queues next, and
> >> should be able to share some patches on that shortly.
> > 
> > OK, please post them on linux-block so that more eyes can look at the
> > code, meantime the ubdsrv side needs to handle MQ too.
> > 
> > Sooner or later, the single ubdsrv task may be saturated by copying data and
> > io_uring command communication only, which can be shown by running io on
> > ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> > 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> > 
> > 
> > Thanks,
> > Ming
> 
> Hi Ming,
> 
> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
> replace TCMU by UBD as a new choice for implementing userspace bdev for
> its high performance and simplicity.
> 
> First, we have conducted some tests by fio and perf to evaluate UBD.
> 
> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>      the complicated SCSI layer and does not support multiqueue. However
> UBD is simply using io_uring passthrough and may support multiqueue in
> the future.(Note that even with a single queue now , UBD outperforms TCMU)

MQ isn't hard to support, and it is basically workable now:

https://github.com/ming1/ubdsrv/commits/devel
https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel

Just the affinity of pthread for each queue isn't setup yet.

> 
> 2) Some functions in UBD result in high CPU utilization and we guess
> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
> between bio vectors and UBD internal buffers while handling write and
> read requests and it could be eliminated by supporting zero-copy.

copy itself doesn't take much cpu, see the following trace:

-   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
   - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
      - 23.86% internal_get_user_pages_fast                                                                                  ▒
         + 21.14% get_user_pages_unlocked                                                                                    ▒
         + 2.62% lockless_pages_from_mm                                                                                      ▒
        6.42% ubd_release_pages.constprop.0

And we may provide option to allow to pin pages in the disk lifetime for avoiding
the cost in _get_user_pages_fast().

zero-copy has to touch page table, and its cost may be expensive too.
The big problem is that MM doesn't provide mechanism to support generic
remapping kernel pages to userspace.

> 
> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> reasonable so please figure out my mistakes.
> 
> 1) UBD issues one sqe to commit last completed request and fetch a new
> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> costs some CPU and steps of building a new sqe may lower throughput.
> Here I'd like to give a new solution: never submit sqe but trump up a
> cqe(with information of new UBD IO request) when calling queue_rq(). I
> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> user issues only one sqe for polling an fd and repeatedly gets multiple
> cqes when new events occur. Dose this solution break the architecture of
> UBD?

But each cqe has to be associated with one sqe, if I understand
correctly.

I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
And yes, batching is really important for UBD's performance.

> 
> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
> When an application configs many queues with bigger iodepth, UBDSRV has
> to preallocate more buffers(size = 256KiB) and results in heavy memory
> overhead. I think data buffers should be allocated by applications

That is just virtual memory, and pages can be reclaimed after IO is
done.

> themselves and passed to UBDSRV. In this way UBD offers more
> flexibility. However, while handling a write request, the control flow
> returns to the kernel part again to set buf addr and copy data from bio
> vectors. Is ioctl helpful by setting buf addr and copying write data to
> app buf?

It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
is sent to ubd driver via ubdsrv_io_cmd->addr.

No need any ioctl, and io_uring command can handle everything.

I think the idea is good, and we can provide one option for using
pre-allocated buffer or application buffer.

But the application buffer has to be in same process VM space with ubdsrv
daemon, otherwise it becomes slower to pin these application
buffers/pages.

> 
> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
> and wastes CPU when no IO is ready to be submitted. I think it can be
> optimized by adding a new array storing UBD IOs that are ready to be
> commit back to the kernel part. Then we could batch these IOs and avoid
> unnecessary iterations on IOs which are not ready(fetching or handling
> by targets).

That should be easy to avoid the whole queue iteration, but my perf
trace doesn't show ubd_fetch_and_submit() consumes too much CPU.

> 
> 4) Zero-copy support is important and we are trying to implement it now.

I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
as one generic approach. If it is ready, it is easy to integrate to UBD.

> 
> 5) Currently, UBD only support the loop target with io_uirng and all
> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
> completion 4.prepare one sqe) are done in one thread. As far as I know,

loop is one example, and it provides similar function with kernel loop by
< 200 lines of userspace code.

>  some applications such as SPDK, network fs and customized distribution
> systems do not support io_uring well.  I think we should separate target
> IO handling from the UBDSRV loop and allow applications handle target
> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
> io_uring-supported targets?)

UBD provides one framework for implementing userspace block driver, you
can do everything for handling the IO in userspace. The target code just
needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
separated from ubd loop already. But UBD is still in early stage,
and the interface will continue to improve or re-design. Or can you
explain your ideas in a bit details? It could be very helpful if you
can provide some application background.

Reason why I suggested to use io_uring is that io_uring is very efficient, also
async IO has been proved as very efficient approach for handling io.


Thanks
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-07 17:13     ` Ming Lei
@ 2022-05-09  4:05       ` Xiaoguang Wang
  2022-05-09  7:36         ` Ming Lei
  2022-05-09  8:11       ` Ziyang Zhang
  2022-05-10  4:55       ` Ziyang Zhang
  2 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2022-05-09  4:05 UTC (permalink / raw)
  To: Ming Lei, ZiyangZhang
  Cc: Gabriel Krisman Bertazi, joseph.qi, linux-block, Jens Axboe

hi,

> On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
>> On 2022/5/3 16:02, Ming Lei wrote:
>>> Hello Gabriel,
>>>
>>> CC linux-block and hope you don't mind, :-)
>>>
>>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
>>>> Hi Ming,
>>>>
>>>> First of all, I hope I didn't put you on the spot too much during the
>>>> discussion.  My original proposal was to propose my design, but your
>>>> implementation quite solved the questions I had. :)
>>> I think that is open source, then we can put efforts together to make things
>>> better.
>>>
>>>> I'd like to follow up to restart the communication and see
>>>> where I can help more with UBD.  As I said during the talk, I've
>>>> done some fio runs and I was able to saturate NBD much faster than UBD:
>>>>
>>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
>>> Yeah, that is true since NBD has extra socket communication cost which
>>> can't be efficient as io_uring.
>>>
>>>> I've also wrote some fixes to the initialization path, which I
>>>> planned to send to you as soon as you published your code, but I think
>>>> you might want to take a look already and see if you want to just squash
>>>> it into your code base.
>>>>
>>>> I pushed those fixes here:
>>>>
>>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
>>> I have added the 1st fix and 3rd patch into my tree:
>>>
>>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
>>>
>>> The added check in 2nd patch is done lockless, which may not be reliable
>>> enough, so I didn't add it. Also adding device is in slow path, and no
>>> necessary to improve in that code path.
>>>
>>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
>>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
>>> builtin.
>>>
>>> ubdsrv part has been cleaned up too:
>>>
>>> https://github.com/ming1/ubdsrv
>>>
>>>> I'm looking into adding support for multiple driver queues next, and
>>>> should be able to share some patches on that shortly.
>>> OK, please post them on linux-block so that more eyes can look at the
>>> code, meantime the ubdsrv side needs to handle MQ too.
>>>
>>> Sooner or later, the single ubdsrv task may be saturated by copying data and
>>> io_uring command communication only, which can be shown by running io on
>>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
>>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
>>>
>>>
>>> Thanks,
>>> Ming
>> Hi Ming,
>>
>> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
>> replace TCMU by UBD as a new choice for implementing userspace bdev for
>> its high performance and simplicity.
>>
>> First, we have conducted some tests by fio and perf to evaluate UBD.
>>
>> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>>      the complicated SCSI layer and does not support multiqueue. However
>> UBD is simply using io_uring passthrough and may support multiqueue in
>> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> MQ isn't hard to support, and it is basically workable now:
>
> https://github.com/ming1/ubdsrv/commits/devel
> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
>
> Just the affinity of pthread for each queue isn't setup yet.
>
>> 2) Some functions in UBD result in high CPU utilization and we guess
>> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
>> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
>> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
>> between bio vectors and UBD internal buffers while handling write and
>> read requests and it could be eliminated by supporting zero-copy.
> copy itself doesn't take much cpu, see the following trace:
>
> -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
>    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
>       - 23.86% internal_get_user_pages_fast                                                                                  ▒
>          + 21.14% get_user_pages_unlocked                                                                                    ▒
>          + 2.62% lockless_pages_from_mm                                                                                      ▒
>         6.42% ubd_release_pages.constprop.0
>
> And we may provide option to allow to pin pages in the disk lifetime for avoiding
> the cost in _get_user_pages_fast().
>
> zero-copy has to touch page table, and its cost may be expensive too.
> The big problem is that MM doesn't provide mechanism to support generic
> remapping kernel pages to userspace.
>
>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>> reasonable so please figure out my mistakes.
>>
>> 1) UBD issues one sqe to commit last completed request and fetch a new
>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>> costs some CPU and steps of building a new sqe may lower throughput.
>> Here I'd like to give a new solution: never submit sqe but trump up a
>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>> user issues only one sqe for polling an fd and repeatedly gets multiple
>> cqes when new events occur. Dose this solution break the architecture of
>> UBD?
> But each cqe has to be associated with one sqe, if I understand
> correctly.
Yeah, for current io_uring implementation, it is. But if io_uring offers below
helper:
void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
{
        struct io_ring_ctx *ctx;
        ctx = file->private_data;

        spin_lock(&ctx->completion_lock);
        __io_fill_cqe(ctx, user_data, res, cflags);
        io_commit_cqring(ctx);
        spin_unlock(&ctx->completion_lock);
        io_cqring_ev_posted(ctx);
}

Then in ubd driver:
1) device setup stage
We attach io_uring files and user_data to every ubd hard queue.

2) when blk-mq->queue_rq() is called.
io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.

Regards,
Xiaoguang Wang
>
> I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
> And yes, batching is really important for UBD's performance.
>
>> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
>> When an application configs many queues with bigger iodepth, UBDSRV has
>> to preallocate more buffers(size = 256KiB) and results in heavy memory
>> overhead. I think data buffers should be allocated by applications
> That is just virtual memory, and pages can be reclaimed after IO is
> done.
>
>> themselves and passed to UBDSRV. In this way UBD offers more
>> flexibility. However, while handling a write request, the control flow
>> returns to the kernel part again to set buf addr and copy data from bio
>> vectors. Is ioctl helpful by setting buf addr and copying write data to
>> app buf?
> It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
> UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
> is sent to ubd driver via ubdsrv_io_cmd->addr.
>
> No need any ioctl, and io_uring command can handle everything.
>
> I think the idea is good, and we can provide one option for using
> pre-allocated buffer or application buffer.
>
> But the application buffer has to be in same process VM space with ubdsrv
> daemon, otherwise it becomes slower to pin these application
> buffers/pages.
>
>> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
>> and wastes CPU when no IO is ready to be submitted. I think it can be
>> optimized by adding a new array storing UBD IOs that are ready to be
>> commit back to the kernel part. Then we could batch these IOs and avoid
>> unnecessary iterations on IOs which are not ready(fetching or handling
>> by targets).
> That should be easy to avoid the whole queue iteration, but my perf
> trace doesn't show ubd_fetch_and_submit() consumes too much CPU.
>
>> 4) Zero-copy support is important and we are trying to implement it now.
> I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
> as one generic approach. If it is ready, it is easy to integrate to UBD.
>
>> 5) Currently, UBD only support the loop target with io_uirng and all
>> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
>> completion 4.prepare one sqe) are done in one thread. As far as I know,
> loop is one example, and it provides similar function with kernel loop by
> < 200 lines of userspace code.
>
>>  some applications such as SPDK, network fs and customized distribution
>> systems do not support io_uring well.  I think we should separate target
>> IO handling from the UBDSRV loop and allow applications handle target
>> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
>> io_uring-supported targets?)
> UBD provides one framework for implementing userspace block driver, you
> can do everything for handling the IO in userspace. The target code just
> needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
> separated from ubd loop already. But UBD is still in early stage,
> and the interface will continue to improve or re-design. Or can you
> explain your ideas in a bit details? It could be very helpful if you
> can provide some application background.
>
> Reason why I suggested to use io_uring is that io_uring is very efficient, also
> async IO has been proved as very efficient approach for handling io.
>
>
> Thanks
> Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-09  4:05       ` Xiaoguang Wang
@ 2022-05-09  7:36         ` Ming Lei
  2022-05-09 11:53           ` Xiaoguang Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-05-09  7:36 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: ZiyangZhang, Gabriel Krisman Bertazi, joseph.qi, linux-block, Jens Axboe

On Mon, May 09, 2022 at 12:05:54PM +0800, Xiaoguang Wang wrote:
> hi,
> 
> > On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
> >> On 2022/5/3 16:02, Ming Lei wrote:
> >>> Hello Gabriel,
> >>>
> >>> CC linux-block and hope you don't mind, :-)
> >>>
> >>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> >>>> Hi Ming,
> >>>>
> >>>> First of all, I hope I didn't put you on the spot too much during the
> >>>> discussion.  My original proposal was to propose my design, but your
> >>>> implementation quite solved the questions I had. :)
> >>> I think that is open source, then we can put efforts together to make things
> >>> better.
> >>>
> >>>> I'd like to follow up to restart the communication and see
> >>>> where I can help more with UBD.  As I said during the talk, I've
> >>>> done some fio runs and I was able to saturate NBD much faster than UBD:
> >>>>
> >>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> >>> Yeah, that is true since NBD has extra socket communication cost which
> >>> can't be efficient as io_uring.
> >>>
> >>>> I've also wrote some fixes to the initialization path, which I
> >>>> planned to send to you as soon as you published your code, but I think
> >>>> you might want to take a look already and see if you want to just squash
> >>>> it into your code base.
> >>>>
> >>>> I pushed those fixes here:
> >>>>
> >>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> >>> I have added the 1st fix and 3rd patch into my tree:
> >>>
> >>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
> >>>
> >>> The added check in 2nd patch is done lockless, which may not be reliable
> >>> enough, so I didn't add it. Also adding device is in slow path, and no
> >>> necessary to improve in that code path.
> >>>
> >>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> >>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
> >>> builtin.
> >>>
> >>> ubdsrv part has been cleaned up too:
> >>>
> >>> https://github.com/ming1/ubdsrv
> >>>
> >>>> I'm looking into adding support for multiple driver queues next, and
> >>>> should be able to share some patches on that shortly.
> >>> OK, please post them on linux-block so that more eyes can look at the
> >>> code, meantime the ubdsrv side needs to handle MQ too.
> >>>
> >>> Sooner or later, the single ubdsrv task may be saturated by copying data and
> >>> io_uring command communication only, which can be shown by running io on
> >>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> >>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> >>>
> >>>
> >>> Thanks,
> >>> Ming
> >> Hi Ming,
> >>
> >> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
> >> replace TCMU by UBD as a new choice for implementing userspace bdev for
> >> its high performance and simplicity.
> >>
> >> First, we have conducted some tests by fio and perf to evaluate UBD.
> >>
> >> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
> >>      the complicated SCSI layer and does not support multiqueue. However
> >> UBD is simply using io_uring passthrough and may support multiqueue in
> >> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> > MQ isn't hard to support, and it is basically workable now:
> >
> > https://github.com/ming1/ubdsrv/commits/devel
> > https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
> >
> > Just the affinity of pthread for each queue isn't setup yet.
> >
> >> 2) Some functions in UBD result in high CPU utilization and we guess
> >> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
> >> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
> >> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
> >> between bio vectors and UBD internal buffers while handling write and
> >> read requests and it could be eliminated by supporting zero-copy.
> > copy itself doesn't take much cpu, see the following trace:
> >
> > -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
> >    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
> >       - 23.86% internal_get_user_pages_fast                                                                                  ▒
> >          + 21.14% get_user_pages_unlocked                                                                                    ▒
> >          + 2.62% lockless_pages_from_mm                                                                                      ▒
> >         6.42% ubd_release_pages.constprop.0
> >
> > And we may provide option to allow to pin pages in the disk lifetime for avoiding
> > the cost in _get_user_pages_fast().
> >
> > zero-copy has to touch page table, and its cost may be expensive too.
> > The big problem is that MM doesn't provide mechanism to support generic
> > remapping kernel pages to userspace.
> >
> >> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> >> reasonable so please figure out my mistakes.
> >>
> >> 1) UBD issues one sqe to commit last completed request and fetch a new
> >> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> >> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> >> costs some CPU and steps of building a new sqe may lower throughput.
> >> Here I'd like to give a new solution: never submit sqe but trump up a
> >> cqe(with information of new UBD IO request) when calling queue_rq(). I
> >> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> >> user issues only one sqe for polling an fd and repeatedly gets multiple
> >> cqes when new events occur. Dose this solution break the architecture of
> >> UBD?
> > But each cqe has to be associated with one sqe, if I understand
> > correctly.
> Yeah, for current io_uring implementation, it is. But if io_uring offers below
> helper:
> void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
> {
>         struct io_ring_ctx *ctx;
>         ctx = file->private_data;
> 
>         spin_lock(&ctx->completion_lock);
>         __io_fill_cqe(ctx, user_data, res, cflags);
>         io_commit_cqring(ctx);
>         spin_unlock(&ctx->completion_lock);
>         io_cqring_ev_posted(ctx);
> }
> 
> Then in ubd driver:
> 1) device setup stage
> We attach io_uring files and user_data to every ubd hard queue.
> 
> 2) when blk-mq->queue_rq() is called.
> io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
> qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.

The above way is actually anti io_uring design, and I don't think it may
improve much since submitting UBD_IO_COMMIT_AND_FETCH_REQ is pretty lightweight.

Also without re-submitting UBD_IO_COMMIT_AND_FETCH_REQ command, how can you
commit io handling result from ubd server and ask ubd driver to complete
io request?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-07 17:13     ` Ming Lei
  2022-05-09  4:05       ` Xiaoguang Wang
@ 2022-05-09  8:11       ` Ziyang Zhang
  2022-05-09 11:24         ` Ming Lei
  2022-05-10  4:55       ` Ziyang Zhang
  2 siblings, 1 reply; 13+ messages in thread
From: Ziyang Zhang @ 2022-05-09  8:11 UTC (permalink / raw)
  To: Ming Lei; +Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On 2022/5/8 01:13, Ming Lei wrote:
> On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
>> On 2022/5/3 16:02, Ming Lei wrote:
>>> Hello Gabriel,
>>>
>>> CC linux-block and hope you don't mind, :-)
>>>
>>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
>>>>
>>>> Hi Ming,
>>>>
>>>> First of all, I hope I didn't put you on the spot too much during the
>>>> discussion.  My original proposal was to propose my design, but your
>>>> implementation quite solved the questions I had. :)
>>>
>>> I think that is open source, then we can put efforts together to make things
>>> better.
>>>
>>>>
>>>> I'd like to follow up to restart the communication and see
>>>> where I can help more with UBD.  As I said during the talk, I've
>>>> done some fio runs and I was able to saturate NBD much faster than UBD:
>>>>
>>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
>>>
>>> Yeah, that is true since NBD has extra socket communication cost which
>>> can't be efficient as io_uring.
>>>
>>>>
>>>> I've also wrote some fixes to the initialization path, which I
>>>> planned to send to you as soon as you published your code, but I think
>>>> you might want to take a look already and see if you want to just squash
>>>> it into your code base.
>>>>
>>>> I pushed those fixes here:
>>>>
>>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
>>>
>>> I have added the 1st fix and 3rd patch into my tree:
>>>
>>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
>>>
>>> The added check in 2nd patch is done lockless, which may not be reliable
>>> enough, so I didn't add it. Also adding device is in slow path, and no
>>> necessary to improve in that code path.
>>>
>>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
>>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
>>> builtin.
>>>
>>> ubdsrv part has been cleaned up too:
>>>
>>> https://github.com/ming1/ubdsrv
>>>
>>>>
>>>> I'm looking into adding support for multiple driver queues next, and
>>>> should be able to share some patches on that shortly.
>>>
>>> OK, please post them on linux-block so that more eyes can look at the
>>> code, meantime the ubdsrv side needs to handle MQ too.
>>>
>>> Sooner or later, the single ubdsrv task may be saturated by copying data and
>>> io_uring command communication only, which can be shown by running io on
>>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
>>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
>>>
>>>
>>> Thanks,
>>> Ming
>>
>> Hi Ming,
>>
>> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
>> replace TCMU by UBD as a new choice for implementing userspace bdev for
>> its high performance and simplicity.
>>
>> First, we have conducted some tests by fio and perf to evaluate UBD.
>>
>> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>>      the complicated SCSI layer and does not support multiqueue. However
>> UBD is simply using io_uring passthrough and may support multiqueue in
>> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> 
> MQ isn't hard to support, and it is basically workable now:
> 
> https://github.com/ming1/ubdsrv/commits/devel
> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
> 
> Just the affinity of pthread for each queue isn't setup yet.

Thanks Ming, I will try your new code.

> 
>>
>> 2) Some functions in UBD result in high CPU utilization and we guess
>> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
>> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
>> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
>> between bio vectors and UBD internal buffers while handling write and
>> read requests and it could be eliminated by supporting zero-copy.
> 
> copy itself doesn't take much cpu, see the following trace:
> 
> -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
>    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
>       - 23.86% internal_get_user_pages_fast                                                                                  ▒
>          + 21.14% get_user_pages_unlocked                                                                                    ▒
>          + 2.62% lockless_pages_from_mm                                                                                      ▒
>         6.42% ubd_release_pages.constprop.0
I got the following trace:
-   26.49%   2.26%  ubd    [ubd]    [k] ubd_commit_completion

   - 24.24% ubd_commit_completion

      - 24.00% ubd_copy_pages.isra.0

         - 10.73% internal_get_user_pages_fast

            - 9.05% get_user_pages_unlocked

               - 8.92% __get_user_pages

                  - 5.61% follow_page_pte

                     - 1.83% folio_mark_accessed

                          0.70% __lru_cache_activate_folio

                     - 1.25% _raw_spin_lock

                          0.61% preempt_count_add

                       0.91% try_grab_page

                    1.02% follow_pmd_mask.isra.0

                    0.58% follow_page_mask

                    0.52% follow_pud_mask

            - 1.35% gup_pud_range.constprop.0

                 1.22% gup_pmd_range.constprop.0

         - 7.71% memcpy_erms

            - 0.68% asm_common_interrupt

               - 0.67% common_interrupt

                  - 0.62% __common_interrupt

                     - 0.60% handle_edge_irq

                        - 0.53% handle_irq_event

                           - 0.51% __handle_irq_event_percpu

                                vring_interrupt

           4.07% ubd_release_pages.constprop.0


> 
> And we may provide option to allow to pin pages in the disk lifetime for avoiding
> the cost in _get_user_pages_fast().
> 
> zero-copy has to touch page table, and its cost may be expensive too.
> The big problem is that MM doesn't provide mechanism to support generic
> remapping kernel pages to userspace.
> 
>>
>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>> reasonable so please figure out my mistakes.
>>
>> 1) UBD issues one sqe to commit last completed request and fetch a new
>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>> costs some CPU and steps of building a new sqe may lower throughput.
>> Here I'd like to give a new solution: never submit sqe but trump up a
>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>> user issues only one sqe for polling an fd and repeatedly gets multiple
>> cqes when new events occur. Dose this solution break the architecture of
>> UBD?
> 
> But each cqe has to be associated with one sqe, if I understand
> correctly.
> 
> I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
> And yes, batching is really important for UBD's performance.
> 
>>
>> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
>> When an application configs many queues with bigger iodepth, UBDSRV has
>> to preallocate more buffers(size = 256KiB) and results in heavy memory
>> overhead. I think data buffers should be allocated by applications
> 
> That is just virtual memory, and pages can be reclaimed after IO is
> done.
> 
>> themselves and passed to UBDSRV. In this way UBD offers more
>> flexibility. However, while handling a write request, the control flow
>> returns to the kernel part again to set buf addr and copy data from bio
>> vectors. Is ioctl helpful by setting buf addr and copying write data to
>> app buf?
> 
> It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
> UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
> is sent to ubd driver via ubdsrv_io_cmd->addr.

Maybe one app allocates one data buffer until it gets a UBD IO request
because it does not know the data size and pre-allocation is forbidden.
In this way, UBD_IO_FETCH_REQ or UBD_IO_COMMIT_AND_FETCH_REQ are not
helpful.

> 
> No need any ioctl, and io_uring command can handle everything.
> 
> I think the idea is good, and we can provide one option for using
> pre-allocated buffer or application buffer.
> 
> But the application buffer has to be in same process VM space with ubdsrv
> daemon, otherwise it becomes slower to pin these application
> buffers/pages.
> 
>>
>> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
>> and wastes CPU when no IO is ready to be submitted. I think it can be
>> optimized by adding a new array storing UBD IOs that are ready to be
>> commit back to the kernel part. Then we could batch these IOs and avoid
>> unnecessary iterations on IOs which are not ready(fetching or handling
>> by targets).
> 
> That should be easy to avoid the whole queue iteration, but my perf
> trace doesn't show ubd_fetch_and_submit() consumes too much CPU.

My trace:
-    2.63%     1.71%  ubd     ubd    [.]ubdsrv_submit_fetch_commands

   - 1.71% 0x495641000034d33d

        __libc_start_main

        main

        cmd_dev_add

        ubdsrv_start_dev

        ubdsrv_start_io_daemon

      - start_daemon

         - 1.70% ubdsrv_io_handler

              ubdsrv_submit_fetch_commands

     0.92% ubdsrv_submit_fetch_commands

> 
>>
>> 4) Zero-copy support is important and we are trying to implement it now.
> 
> I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
> as one generic approach. If it is ready, it is easy to integrate to UBD.
> 
>>
>> 5) Currently, UBD only support the loop target with io_uirng and all
>> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
>> completion 4.prepare one sqe) are done in one thread. As far as I know,
> 
> loop is one example, and it provides similar function with kernel loop by
> < 200 lines of userspace code.
> 
>>  some applications such as SPDK, network fs and customized distribution
>> systems do not support io_uring well.  I think we should separate target
>> IO handling from the UBDSRV loop and allow applications handle target
>> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
>> io_uring-supported targets?)
> 
> UBD provides one framework for implementing userspace block driver, you
> can do everything for handling the IO in userspace. The target code just
> needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
> separated from ubd loop already. But UBD is still in early stage,
> and the interface will continue to improve or re-design. Or can you
> explain your ideas in a bit details? It could be very helpful if you
> can provide some application background.

The app:
1) Another worker thread per queue(not the UBD daemon thread per
queue)gets(by one API provides by UBDSRV such as ubdsrv_get_new_req) one
new UBD IO(from blk-mq) with io_size, io_off and tag.

2) handles the new IO(e.g. app_write_req_handle, app_read_req_handle)in
one worker threads(not in the UBD daemon thread).

3) One IO completes and the worker thread should notify the UBD daemon
thread.

Currently, I find:
	if (cqe->user_data & (1ULL << 63)) {
		ubdsrv_handle_tgt_cqe(dev, q, cqe);
		return;
	}
in ubdsrv.c:ubdsrv_handle_cqe. This assumes a target should use io_uring
to handle IO requests, and the app above does not support io_uring.
Maybe we should re-design the IO handling logic.

> 
> Reason why I suggested to use io_uring is that io_uring is very efficient, also
> async IO has been proved as very efficient approach for handling io.
> 
> 
> Thanks
> Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-09  8:11       ` Ziyang Zhang
@ 2022-05-09 11:24         ` Ming Lei
  2022-05-10  4:46           ` Ziyang Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-05-09 11:24 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On Mon, May 09, 2022 at 04:11:47PM +0800, Ziyang Zhang wrote:
> On 2022/5/8 01:13, Ming Lei wrote:
> > On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
> >> On 2022/5/3 16:02, Ming Lei wrote:
> >>> Hello Gabriel,
> >>>
> >>> CC linux-block and hope you don't mind, :-)
> >>>
> >>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> >>>>
> >>>> Hi Ming,
> >>>>
> >>>> First of all, I hope I didn't put you on the spot too much during the
> >>>> discussion.  My original proposal was to propose my design, but your
> >>>> implementation quite solved the questions I had. :)
> >>>
> >>> I think that is open source, then we can put efforts together to make things
> >>> better.
> >>>
> >>>>
> >>>> I'd like to follow up to restart the communication and see
> >>>> where I can help more with UBD.  As I said during the talk, I've
> >>>> done some fio runs and I was able to saturate NBD much faster than UBD:
> >>>>
> >>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> >>>
> >>> Yeah, that is true since NBD has extra socket communication cost which
> >>> can't be efficient as io_uring.
> >>>
> >>>>
> >>>> I've also wrote some fixes to the initialization path, which I
> >>>> planned to send to you as soon as you published your code, but I think
> >>>> you might want to take a look already and see if you want to just squash
> >>>> it into your code base.
> >>>>
> >>>> I pushed those fixes here:
> >>>>
> >>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> >>>
> >>> I have added the 1st fix and 3rd patch into my tree:
> >>>
> >>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
> >>>
> >>> The added check in 2nd patch is done lockless, which may not be reliable
> >>> enough, so I didn't add it. Also adding device is in slow path, and no
> >>> necessary to improve in that code path.
> >>>
> >>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> >>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
> >>> builtin.
> >>>
> >>> ubdsrv part has been cleaned up too:
> >>>
> >>> https://github.com/ming1/ubdsrv
> >>>
> >>>>
> >>>> I'm looking into adding support for multiple driver queues next, and
> >>>> should be able to share some patches on that shortly.
> >>>
> >>> OK, please post them on linux-block so that more eyes can look at the
> >>> code, meantime the ubdsrv side needs to handle MQ too.
> >>>
> >>> Sooner or later, the single ubdsrv task may be saturated by copying data and
> >>> io_uring command communication only, which can be shown by running io on
> >>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> >>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> >>>
> >>>
> >>> Thanks,
> >>> Ming
> >>
> >> Hi Ming,
> >>
> >> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
> >> replace TCMU by UBD as a new choice for implementing userspace bdev for
> >> its high performance and simplicity.
> >>
> >> First, we have conducted some tests by fio and perf to evaluate UBD.
> >>
> >> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
> >>      the complicated SCSI layer and does not support multiqueue. However
> >> UBD is simply using io_uring passthrough and may support multiqueue in
> >> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> > 
> > MQ isn't hard to support, and it is basically workable now:
> > 
> > https://github.com/ming1/ubdsrv/commits/devel
> > https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
> > 
> > Just the affinity of pthread for each queue isn't setup yet.
> 
> Thanks Ming, I will try your new code.
> 
> > 
> >>
> >> 2) Some functions in UBD result in high CPU utilization and we guess
> >> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
> >> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
> >> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
> >> between bio vectors and UBD internal buffers while handling write and
> >> read requests and it could be eliminated by supporting zero-copy.
> > 
> > copy itself doesn't take much cpu, see the following trace:
> > 
> > -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
> >    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
> >       - 23.86% internal_get_user_pages_fast                                                                                  ▒
> >          + 21.14% get_user_pages_unlocked                                                                                    ▒
> >          + 2.62% lockless_pages_from_mm                                                                                      ▒
> >         6.42% ubd_release_pages.constprop.0
> I got the following trace:
> -   26.49%   2.26%  ubd    [ubd]    [k] ubd_commit_completion
> 
>    - 24.24% ubd_commit_completion
> 
>       - 24.00% ubd_copy_pages.isra.0
> 
>          - 10.73% internal_get_user_pages_fast
> 
>             - 9.05% get_user_pages_unlocked
> 
>                - 8.92% __get_user_pages
> 
>                   - 5.61% follow_page_pte
> 
>                      - 1.83% folio_mark_accessed
> 
>                           0.70% __lru_cache_activate_folio
> 
>                      - 1.25% _raw_spin_lock
> 
>                           0.61% preempt_count_add
> 
>                        0.91% try_grab_page
> 
>                     1.02% follow_pmd_mask.isra.0
> 
>                     0.58% follow_page_mask
> 
>                     0.52% follow_pud_mask
> 
>             - 1.35% gup_pud_range.constprop.0
> 
>                  1.22% gup_pmd_range.constprop.0
> 
>          - 7.71% memcpy_erms
> 
>             - 0.68% asm_common_interrupt
> 
>                - 0.67% common_interrupt
> 
>                   - 0.62% __common_interrupt
> 
>                      - 0.60% handle_edge_irq
> 
>                         - 0.53% handle_irq_event
> 
>                            - 0.51% __handle_irq_event_percpu
> 
>                                 vring_interrupt
> 
>            4.07% ubd_release_pages.constprop.0

OK, but both internal_get_user_pages_fast and ubd_release_pages still
takes 15%, and memcpy_erms() is 7.7%. But mm zero copy isn't ready,
so memcpy can't be avoided, also zero copy has other cost, which may
be big enough too.

One approach I thought of for reducing cost of pinning pages is to release
pages lazily, such as, release page when the request is idle for enough time,
meantime takes LRU way to free pages when number of total pinned pages are
beyond max allowed amount. IMO, this approach should get much improvement,
but it needs pre-allocation of userspace buffer, and the implementation
shouldn't be very hard.

> 
> 
> > 
> > And we may provide option to allow to pin pages in the disk lifetime for avoiding
> > the cost in _get_user_pages_fast().
> > 
> > zero-copy has to touch page table, and its cost may be expensive too.
> > The big problem is that MM doesn't provide mechanism to support generic
> > remapping kernel pages to userspace.
> > 
> >>
> >> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> >> reasonable so please figure out my mistakes.
> >>
> >> 1) UBD issues one sqe to commit last completed request and fetch a new
> >> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> >> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> >> costs some CPU and steps of building a new sqe may lower throughput.
> >> Here I'd like to give a new solution: never submit sqe but trump up a
> >> cqe(with information of new UBD IO request) when calling queue_rq(). I
> >> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> >> user issues only one sqe for polling an fd and repeatedly gets multiple
> >> cqes when new events occur. Dose this solution break the architecture of
> >> UBD?
> > 
> > But each cqe has to be associated with one sqe, if I understand
> > correctly.
> > 
> > I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
> > And yes, batching is really important for UBD's performance.
> > 
> >>
> >> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
> >> When an application configs many queues with bigger iodepth, UBDSRV has
> >> to preallocate more buffers(size = 256KiB) and results in heavy memory
> >> overhead. I think data buffers should be allocated by applications
> > 
> > That is just virtual memory, and pages can be reclaimed after IO is
> > done.
> > 
> >> themselves and passed to UBDSRV. In this way UBD offers more
> >> flexibility. However, while handling a write request, the control flow
> >> returns to the kernel part again to set buf addr and copy data from bio
> >> vectors. Is ioctl helpful by setting buf addr and copying write data to
> >> app buf?
> > 
> > It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
> > UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
> > is sent to ubd driver via ubdsrv_io_cmd->addr.
> 
> Maybe one app allocates one data buffer until it gets a UBD IO request
> because it does not know the data size and pre-allocation is forbidden.
> In this way, UBD_IO_FETCH_REQ or UBD_IO_COMMIT_AND_FETCH_REQ are not
> helpful.

The userspace buffer from app can be seen when UBD_IO_COMMIT_AND_FETCH_REQ
is sent to ubd driver.

> 
> > 
> > No need any ioctl, and io_uring command can handle everything.
> > 
> > I think the idea is good, and we can provide one option for using
> > pre-allocated buffer or application buffer.
> > 
> > But the application buffer has to be in same process VM space with ubdsrv
> > daemon, otherwise it becomes slower to pin these application
> > buffers/pages.
> > 
> >>
> >> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
> >> and wastes CPU when no IO is ready to be submitted. I think it can be
> >> optimized by adding a new array storing UBD IOs that are ready to be
> >> commit back to the kernel part. Then we could batch these IOs and avoid
> >> unnecessary iterations on IOs which are not ready(fetching or handling
> >> by targets).
> > 
> > That should be easy to avoid the whole queue iteration, but my perf
> > trace doesn't show ubd_fetch_and_submit() consumes too much CPU.
> 
> My trace:
> -    2.63%     1.71%  ubd     ubd    [.]ubdsrv_submit_fetch_commands
> 
>    - 1.71% 0x495641000034d33d
> 
>         __libc_start_main
> 
>         main
> 
>         cmd_dev_add
> 
>         ubdsrv_start_dev
> 
>         ubdsrv_start_io_daemon
> 
>       - start_daemon
> 
>          - 1.70% ubdsrv_io_handler
> 
>               ubdsrv_submit_fetch_commands
> 
>      0.92% ubdsrv_submit_fetch_commands

Looks it is still not heavy enough, and we can use one per-queue/thread
bitmap to track IOs which need to be queued to ubd driver, and the
change should be easy.

> 
> > 
> >>
> >> 4) Zero-copy support is important and we are trying to implement it now.
> > 
> > I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
> > as one generic approach. If it is ready, it is easy to integrate to UBD.
> > 
> >>
> >> 5) Currently, UBD only support the loop target with io_uirng and all
> >> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
> >> completion 4.prepare one sqe) are done in one thread. As far as I know,
> > 
> > loop is one example, and it provides similar function with kernel loop by
> > < 200 lines of userspace code.
> > 
> >>  some applications such as SPDK, network fs and customized distribution
> >> systems do not support io_uring well.  I think we should separate target
> >> IO handling from the UBDSRV loop and allow applications handle target
> >> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
> >> io_uring-supported targets?)
> > 
> > UBD provides one framework for implementing userspace block driver, you
> > can do everything for handling the IO in userspace. The target code just
> > needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
> > separated from ubd loop already. But UBD is still in early stage,
> > and the interface will continue to improve or re-design. Or can you
> > explain your ideas in a bit details? It could be very helpful if you
> > can provide some application background.
> 
> The app:
> 1) Another worker thread per queue(not the UBD daemon thread per
> queue)gets(by one API provides by UBDSRV such as ubdsrv_get_new_req) one
> new UBD IO(from blk-mq) with io_size, io_off and tag.
> 
> 2) handles the new IO(e.g. app_write_req_handle, app_read_req_handle)in
> one worker threads(not in the UBD daemon thread).
> 
> 3) One IO completes and the worker thread should notify the UBD daemon
> thread.
> 
> Currently, I find:
> 	if (cqe->user_data & (1ULL << 63)) {
> 		ubdsrv_handle_tgt_cqe(dev, q, cqe);
> 		return;
> 	}
> in ubdsrv.c:ubdsrv_handle_cqe. This assumes a target should use io_uring
> to handle IO requests, and the app above does not support io_uring.
> Maybe we should re-design the IO handling logic.

You can export one helper of ubdsrv_complete_io() for target code

void ubdsrv_complete_io(struct ubdsrv_dev *dev,
        struct ubdsrv_queue *q, int tag, int res)
{
        struct ubd_io *io = &q->ios[tag];

        io->result = res;

        /* Mark this IO as free and ready for issuing to ubd driver */
        io->flags |= (UBDSRV_NEED_COMMIT_RQ_COMP | UBDSRV_IO_FREE);

        /* clear handling */
        io->flags &= ~UBDSRV_IO_HANDLING;
}

Then the target code calls ubdsrv_complete_io(), but you have to add new
approach to wakeup the ubdsrv pthread daemon which can wait in io_uring_enter().
That is why I still suggest to try to use io_uring for handling IO.

If you really have high performance requirement, and if cost of pining pages in
ubdsrv pthread daemon were saved, maybe you can run your app_write_req_handle()/
app_read_req_handle() in the ubdsrv pthread context directly if it is implemented
in AIO style.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-09  7:36         ` Ming Lei
@ 2022-05-09 11:53           ` Xiaoguang Wang
  2022-05-09 13:15             ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2022-05-09 11:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: ZiyangZhang, Gabriel Krisman Bertazi, joseph.qi, linux-block, Jens Axboe

hi,

>
>>>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>>>> reasonable so please figure out my mistakes.
>>>>
>>>> 1) UBD issues one sqe to commit last completed request and fetch a new
>>>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>>>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>>>> costs some CPU and steps of building a new sqe may lower throughput.
>>>> Here I'd like to give a new solution: never submit sqe but trump up a
>>>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>>>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>>>> user issues only one sqe for polling an fd and repeatedly gets multiple
>>>> cqes when new events occur. Dose this solution break the architecture of
>>>> UBD?
>>> But each cqe has to be associated with one sqe, if I understand
>>> correctly.
>> Yeah, for current io_uring implementation, it is. But if io_uring offers below
>> helper:
>> void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
>> {
>>         struct io_ring_ctx *ctx;
>>         ctx = file->private_data;
>>
>>         spin_lock(&ctx->completion_lock);
>>         __io_fill_cqe(ctx, user_data, res, cflags);
>>         io_commit_cqring(ctx);
>>         spin_unlock(&ctx->completion_lock);
>>         io_cqring_ev_posted(ctx);
>> }
>>
>> Then in ubd driver:
>> 1) device setup stage
>> We attach io_uring files and user_data to every ubd hard queue.
>>
>> 2) when blk-mq->queue_rq() is called.
>> io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
>> qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.
> The above way is actually anti io_uring design, and I don't think it may
> improve much since submitting UBD_IO_COMMIT_AND_FETCH_REQ is pretty lightweight.
Actually I don't come up with this idea mostly for performance reason :) Just try to
simplify codes a bit:
1) In current implementation, ubdsrv will need to submit queue depth number of
sqes firstly, and ubd_ctrl_start_dev() will also need to wait all sqes to be submitted.
2) Try to make ubd_queue_rq simpler, it maybe just call one io_gen_cqe_direct().

>
> Also without re-submitting UBD_IO_COMMIT_AND_FETCH_REQ command, how can you
> commit io handling result from ubd server and ask ubd driver to complete
> io request?
No, I don't mean to remove COMMIT command, we still need io_uring async
command feature to support ubd COMMIT or GETDATA command.

I have another concern that currently there are may flags in ubd kernel or
ubdsrv, such as:
#define UBDSRV_NEED_FETCH_RQ (1UL << 0)
#define UBDSRV_NEED_COMMIT_RQ_COMP (1UL << 1)
#define UBDSRV_IO_FREE (1UL << 2)
#define UBDSRV_IO_HANDLING (1UL << 3)
#define UBDSRV_NEED_GET_DATA (1UL << 4)

Some of their names looks weird, for example UBDSRV_IO_FREE. I think
more flags may result in more state machine error.

Regards,
Xiaoguang Wang
>
>
> Thanks, 
> Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-09 11:53           ` Xiaoguang Wang
@ 2022-05-09 13:15             ` Ming Lei
  2022-05-10  4:38               ` Xiaoguang Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-05-09 13:15 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: ZiyangZhang, Gabriel Krisman Bertazi, joseph.qi, linux-block, Jens Axboe

On Mon, May 09, 2022 at 07:53:05PM +0800, Xiaoguang Wang wrote:
> hi,
> 
> >
> >>>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> >>>> reasonable so please figure out my mistakes.
> >>>>
> >>>> 1) UBD issues one sqe to commit last completed request and fetch a new
> >>>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> >>>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> >>>> costs some CPU and steps of building a new sqe may lower throughput.
> >>>> Here I'd like to give a new solution: never submit sqe but trump up a
> >>>> cqe(with information of new UBD IO request) when calling queue_rq(). I
> >>>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> >>>> user issues only one sqe for polling an fd and repeatedly gets multiple
> >>>> cqes when new events occur. Dose this solution break the architecture of
> >>>> UBD?
> >>> But each cqe has to be associated with one sqe, if I understand
> >>> correctly.
> >> Yeah, for current io_uring implementation, it is. But if io_uring offers below
> >> helper:
> >> void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
> >> {
> >>         struct io_ring_ctx *ctx;
> >>         ctx = file->private_data;
> >>
> >>         spin_lock(&ctx->completion_lock);
> >>         __io_fill_cqe(ctx, user_data, res, cflags);
> >>         io_commit_cqring(ctx);
> >>         spin_unlock(&ctx->completion_lock);
> >>         io_cqring_ev_posted(ctx);
> >> }
> >>
> >> Then in ubd driver:
> >> 1) device setup stage
> >> We attach io_uring files and user_data to every ubd hard queue.
> >>
> >> 2) when blk-mq->queue_rq() is called.
> >> io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
> >> qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.
> > The above way is actually anti io_uring design, and I don't think it may
> > improve much since submitting UBD_IO_COMMIT_AND_FETCH_REQ is pretty lightweight.
> Actually I don't come up with this idea mostly for performance reason :) Just try to
> simplify codes a bit:
> 1) In current implementation, ubdsrv will need to submit queue depth number of
> sqes firstly, and ubd_ctrl_start_dev() will also need to wait all sqes to be submitted.

Yes, because handling IO need the associated io_uring commend reached to ubd driver
first.

> 2) Try to make ubd_queue_rq simpler, it maybe just call one io_gen_cqe_direct().

But it has to work at least. Also not see real simplification in your
suggestion.

> 
> >
> > Also without re-submitting UBD_IO_COMMIT_AND_FETCH_REQ command, how can you
> > commit io handling result from ubd server and ask ubd driver to complete
> > io request?
> No, I don't mean to remove COMMIT command, we still need io_uring async
> command feature to support ubd COMMIT or GETDATA command.

GETDATA command has been removed, because task_work_add() is used to
complete io_uring command(UBD_IO_COMMIT_AND_FETCH_REQ or UBD_IO_FETCH_REQ),
so pinning pages and copying data is always done in ubdsrv daemon
context.

You may not get the whole idea:

1) UBD_IO_FETCH_REQ is only submitted to ubd driver before starting
device because at the beginning there isn't any IO handled, so no need
to send COMMIT.

2) after device is started, only UBD_IO_COMMIT_AND_FETCH_REQ is
submitted for both committing io handling result to driver and fetching new
io request, and UBD_IO_COMMIT_AND_FETCH_REQ can be thought as combined
command of COMMIT and UBD_IO_FETCH_REQ.

3) COMMIT command is just submitted after queue is aborted, since we
needn't to fetch request any more, and just need to commit in-flight
request's result to ubd driver.

If you meant using COMMIT with io_gen_cqe_direct(), what benefit can we
get? Still one command is required for handling IO, that is exactly what
UBD_IO_COMMIT_AND_FETCH_REQ is doing.

> 
> I have another concern that currently there are may flags in ubd kernel or
> ubdsrv, such as:
> #define UBDSRV_NEED_FETCH_RQ (1UL << 0)

UBDSRV_NEED_FETCH_RQ means the to be queued io_uring command has to fetch
new io request from ubd driver.

> #define UBDSRV_NEED_COMMIT_RQ_COMP (1UL << 1)

UBDSRV_NEED_COMMIT_RQCOMP means the to be queued io_uring command has to
commit io handling result to ubd driver.

> #define UBDSRV_IO_FREE (1UL << 2)

Only io with this flag can be queued to ubd driver. Once this flag is
cleared, it means the io command has been submitted to ubd driver.

> #define UBDSRV_IO_HANDLING (1UL << 3)

UBDSRV_IO_HANDLING means the io command is being handled by target code.

> #define UBDSRV_NEED_GET_DATA (1UL << 4)

The above one has been removed.

> 
> Some of their names looks weird, for example UBDSRV_IO_FREE. I think
> more flags may result in more state machine error.

Figuring out perfect name is always not easy, but I don't think they
are weird since the above short comments explained them clearly.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-09 13:15             ` Ming Lei
@ 2022-05-10  4:38               ` Xiaoguang Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2022-05-10  4:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: ZiyangZhang, Gabriel Krisman Bertazi, joseph.qi, linux-block, Jens Axboe

hi,

> On Mon, May 09, 2022 at 07:53:05PM +0800, Xiaoguang Wang wrote:
>> hi,
>>
>>>>>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>>>>>> reasonable so please figure out my mistakes.
>>>>>>
>>>>>> 1) UBD issues one sqe to commit last completed request and fetch a new
>>>>>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>>>>>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>>>>>> costs some CPU and steps of building a new sqe may lower throughput.
>>>>>> Here I'd like to give a new solution: never submit sqe but trump up a
>>>>>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>>>>>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>>>>>> user issues only one sqe for polling an fd and repeatedly gets multiple
>>>>>> cqes when new events occur. Dose this solution break the architecture of
>>>>>> UBD?
>>>>> But each cqe has to be associated with one sqe, if I understand
>>>>> correctly.
>>>> Yeah, for current io_uring implementation, it is. But if io_uring offers below
>>>> helper:
>>>> void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
>>>> {
>>>>         struct io_ring_ctx *ctx;
>>>>         ctx = file->private_data;
>>>>
>>>>         spin_lock(&ctx->completion_lock);
>>>>         __io_fill_cqe(ctx, user_data, res, cflags);
>>>>         io_commit_cqring(ctx);
>>>>         spin_unlock(&ctx->completion_lock);
>>>>         io_cqring_ev_posted(ctx);
>>>> }
>>>>
>>>> Then in ubd driver:
>>>> 1) device setup stage
>>>> We attach io_uring files and user_data to every ubd hard queue.
>>>>
>>>> 2) when blk-mq->queue_rq() is called.
>>>> io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
>>>> qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.
>>> The above way is actually anti io_uring design, and I don't think it may
>>> improve much since submitting UBD_IO_COMMIT_AND_FETCH_REQ is pretty lightweight.
>> Actually I don't come up with this idea mostly for performance reason :) Just try to
>> simplify codes a bit:
>> 1) In current implementation, ubdsrv will need to submit queue depth number of
>> sqes firstly, and ubd_ctrl_start_dev() will also need to wait all sqes to be submitted.
> Yes, because handling IO need the associated io_uring commend reached to ubd driver
> first.
>
>> 2) Try to make ubd_queue_rq simpler, it maybe just call one io_gen_cqe_direct().
> But it has to work at least. Also not see real simplification in your
> suggestion.
>
>>> Also without re-submitting UBD_IO_COMMIT_AND_FETCH_REQ command, how can you
>>> commit io handling result from ubd server and ask ubd driver to complete
>>> io request?
>> No, I don't mean to remove COMMIT command, we still need io_uring async
>> command feature to support ubd COMMIT or GETDATA command.
> GETDATA command has been removed, because task_work_add() is used to
> complete io_uring command(UBD_IO_COMMIT_AND_FETCH_REQ or UBD_IO_FETCH_REQ),
> so pinning pages and copying data is always done in ubdsrv daemon
> context.
OK, I'll read your latest codes.

>
> You may not get the whole idea:
>
> 1) UBD_IO_FETCH_REQ is only submitted to ubd driver before starting
> device because at the beginning there isn't any IO handled, so no need
> to send COMMIT.
>
> 2) after device is started, only UBD_IO_COMMIT_AND_FETCH_REQ is
> submitted for both committing io handling result to driver and fetching new
> io request, and UBD_IO_COMMIT_AND_FETCH_REQ can be thought as combined
> command of COMMIT and UBD_IO_FETCH_REQ.
>
> 3) COMMIT command is just submitted after queue is aborted, since we
> needn't to fetch request any more, and just need to commit in-flight
> request's result to ubd driver.
Thanks for detailed clarification.
I think I have understood codes better now.

>
> If you meant using COMMIT with io_gen_cqe_direct(), what benefit can we
> get? Still one command is required for handling IO, that is exactly what
> UBD_IO_COMMIT_AND_FETCH_REQ is doing.
Agree now.
>
>> I have another concern that currently there are may flags in ubd kernel or
>> ubdsrv, such as:
>> #define UBDSRV_NEED_FETCH_RQ (1UL << 0)
> UBDSRV_NEED_FETCH_RQ means the to be queued io_uring command has to fetch
> new io request from ubd driver.
>
>> #define UBDSRV_NEED_COMMIT_RQ_COMP (1UL << 1)
> UBDSRV_NEED_COMMIT_RQCOMP means the to be queued io_uring command has to
> commit io handling result to ubd driver.
>
>> #define UBDSRV_IO_FREE (1UL << 2)
> Only io with this flag can be queued to ubd driver. Once this flag is
> cleared, it means the io command has been submitted to ubd driver.
>
>> #define UBDSRV_IO_HANDLING (1UL << 3)
> UBDSRV_IO_HANDLING means the io command is being handled by target code.
>
>> #define UBDSRV_NEED_GET_DATA (1UL << 4)
> The above one has been removed.
>
>> Some of their names looks weird, for example UBDSRV_IO_FREE. I think
>> more flags may result in more state machine error.
> Figuring out perfect name is always not easy, but I don't think they
> are weird since the above short comments explained them clearly.
OK, thanks for these clarifications again.

Regards,
Xiaoguang Wang

>
>
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-09 11:24         ` Ming Lei
@ 2022-05-10  4:46           ` Ziyang Zhang
  2022-05-11  1:52             ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Ziyang Zhang @ 2022-05-10  4:46 UTC (permalink / raw)
  To: Ming Lei; +Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On 2022/5/9 19:24, Ming Lei wrote:
> On Mon, May 09, 2022 at 04:11:47PM +0800, Ziyang Zhang wrote:
>> On 2022/5/8 01:13, Ming Lei wrote:
>>> On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
>>>> On 2022/5/3 16:02, Ming Lei wrote:
>>>>> Hello Gabriel,
>>>>>
>>>>> CC linux-block and hope you don't mind, :-)
>>>>>
>>>>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
>>>>>>
>>>>>> Hi Ming,
>>>>>>
>>>>>> First of all, I hope I didn't put you on the spot too much during the
>>>>>> discussion.  My original proposal was to propose my design, but your
>>>>>> implementation quite solved the questions I had. :)
>>>>>
>>>>> I think that is open source, then we can put efforts together to make things
>>>>> better.
>>>>>
>>>>>>
>>>>>> I'd like to follow up to restart the communication and see
>>>>>> where I can help more with UBD.  As I said during the talk, I've
>>>>>> done some fio runs and I was able to saturate NBD much faster than UBD:
>>>>>>
>>>>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
>>>>>
>>>>> Yeah, that is true since NBD has extra socket communication cost which
>>>>> can't be efficient as io_uring.
>>>>>
>>>>>>
>>>>>> I've also wrote some fixes to the initialization path, which I
>>>>>> planned to send to you as soon as you published your code, but I think
>>>>>> you might want to take a look already and see if you want to just squash
>>>>>> it into your code base.
>>>>>>
>>>>>> I pushed those fixes here:
>>>>>>
>>>>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
>>>>>
>>>>> I have added the 1st fix and 3rd patch into my tree:
>>>>>
>>>>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
>>>>>
>>>>> The added check in 2nd patch is done lockless, which may not be reliable
>>>>> enough, so I didn't add it. Also adding device is in slow path, and no
>>>>> necessary to improve in that code path.
>>>>>
>>>>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
>>>>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
>>>>> builtin.
>>>>>
>>>>> ubdsrv part has been cleaned up too:
>>>>>
>>>>> https://github.com/ming1/ubdsrv
>>>>>
>>>>>>
>>>>>> I'm looking into adding support for multiple driver queues next, and
>>>>>> should be able to share some patches on that shortly.
>>>>>
>>>>> OK, please post them on linux-block so that more eyes can look at the
>>>>> code, meantime the ubdsrv side needs to handle MQ too.
>>>>>
>>>>> Sooner or later, the single ubdsrv task may be saturated by copying data and
>>>>> io_uring command communication only, which can be shown by running io on
>>>>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
>>>>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Ming
>>>>
>>>> Hi Ming,
>>>>
>>>> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
>>>> replace TCMU by UBD as a new choice for implementing userspace bdev for
>>>> its high performance and simplicity.
>>>>
>>>> First, we have conducted some tests by fio and perf to evaluate UBD.
>>>>
>>>> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>>>>      the complicated SCSI layer and does not support multiqueue. However
>>>> UBD is simply using io_uring passthrough and may support multiqueue in
>>>> the future.(Note that even with a single queue now , UBD outperforms TCMU)
>>>
>>> MQ isn't hard to support, and it is basically workable now:
>>>
>>> https://github.com/ming1/ubdsrv/commits/devel
>>> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
>>>
>>> Just the affinity of pthread for each queue isn't setup yet.
>>
>> Thanks Ming, I will try your new code.
>>
>>>
>>>>
>>>> 2) Some functions in UBD result in high CPU utilization and we guess
>>>> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
>>>> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
>>>> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
>>>> between bio vectors and UBD internal buffers while handling write and
>>>> read requests and it could be eliminated by supporting zero-copy.
>>>
>>> copy itself doesn't take much cpu, see the following trace:
>>>
>>> -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
>>>    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
>>>       - 23.86% internal_get_user_pages_fast                                                                                  ▒
>>>          + 21.14% get_user_pages_unlocked                                                                                    ▒
>>>          + 2.62% lockless_pages_from_mm                                                                                      ▒
>>>         6.42% ubd_release_pages.constprop.0
>> I got the following trace:
>> -   26.49%   2.26%  ubd    [ubd]    [k] ubd_commit_completion
>>
>>    - 24.24% ubd_commit_completion
>>
>>       - 24.00% ubd_copy_pages.isra.0
>>
>>          - 10.73% internal_get_user_pages_fast
>>
>>             - 9.05% get_user_pages_unlocked
>>
>>                - 8.92% __get_user_pages
>>
>>                   - 5.61% follow_page_pte
>>
>>                      - 1.83% folio_mark_accessed
>>
>>                           0.70% __lru_cache_activate_folio
>>
>>                      - 1.25% _raw_spin_lock
>>
>>                           0.61% preempt_count_add
>>
>>                        0.91% try_grab_page
>>
>>                     1.02% follow_pmd_mask.isra.0
>>
>>                     0.58% follow_page_mask
>>
>>                     0.52% follow_pud_mask
>>
>>             - 1.35% gup_pud_range.constprop.0
>>
>>                  1.22% gup_pmd_range.constprop.0
>>
>>          - 7.71% memcpy_erms
>>
>>             - 0.68% asm_common_interrupt
>>
>>                - 0.67% common_interrupt
>>
>>                   - 0.62% __common_interrupt
>>
>>                      - 0.60% handle_edge_irq
>>
>>                         - 0.53% handle_irq_event
>>
>>                            - 0.51% __handle_irq_event_percpu
>>
>>                                 vring_interrupt
>>
>>            4.07% ubd_release_pages.constprop.0
> 
> OK, but both internal_get_user_pages_fast and ubd_release_pages still
> takes 15%, and memcpy_erms() is 7.7%. But mm zero copy isn't ready,
> so memcpy can't be avoided, also zero copy has other cost, which may
> be big enough too.
> 
> One approach I thought of for reducing cost of pinning pages is to release
> pages lazily, such as, release page when the request is idle for enough time,
> meantime takes LRU way to free pages when number of total pinned pages are
> beyond max allowed amount. IMO, this approach should get much improvement,
> but it needs pre-allocation of userspace buffer, and the implementation
> shouldn't be very hard.
> 
>>
>>
>>>
>>> And we may provide option to allow to pin pages in the disk lifetime for avoiding
>>> the cost in _get_user_pages_fast().
>>>
>>> zero-copy has to touch page table, and its cost may be expensive too.
>>> The big problem is that MM doesn't provide mechanism to support generic
>>> remapping kernel pages to userspace.
>>>
>>>>
>>>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>>>> reasonable so please figure out my mistakes.
>>>>
>>>> 1) UBD issues one sqe to commit last completed request and fetch a new
>>>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>>>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>>>> costs some CPU and steps of building a new sqe may lower throughput.
>>>> Here I'd like to give a new solution: never submit sqe but trump up a
>>>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>>>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>>>> user issues only one sqe for polling an fd and repeatedly gets multiple
>>>> cqes when new events occur. Dose this solution break the architecture of
>>>> UBD?
>>>
>>> But each cqe has to be associated with one sqe, if I understand
>>> correctly.
>>>
>>> I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
>>> And yes, batching is really important for UBD's performance.
>>>
>>>>
>>>> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
>>>> When an application configs many queues with bigger iodepth, UBDSRV has
>>>> to preallocate more buffers(size = 256KiB) and results in heavy memory
>>>> overhead. I think data buffers should be allocated by applications
>>>
>>> That is just virtual memory, and pages can be reclaimed after IO is
>>> done.
>>>
>>>> themselves and passed to UBDSRV. In this way UBD offers more
>>>> flexibility. However, while handling a write request, the control flow
>>>> returns to the kernel part again to set buf addr and copy data from bio
>>>> vectors. Is ioctl helpful by setting buf addr and copying write data to
>>>> app buf?
>>>
>>> It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
>>> UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
>>> is sent to ubd driver via ubdsrv_io_cmd->addr.
>>
>> Maybe one app allocates one data buffer until it gets a UBD IO request
>> because it does not know the data size and pre-allocation is forbidden.
>> In this way, UBD_IO_FETCH_REQ or UBD_IO_COMMIT_AND_FETCH_REQ are not
>> helpful.
> 
> The userspace buffer from app can be seen when UBD_IO_COMMIT_AND_FETCH_REQ
> is sent to ubd driver.

Hi Ming,

I re-thinked and here is one case and UBD_IO_COMMIT_AND_FETCH_REQ is not helpful:

(1) One sqe with UBD_IO_COMMIT_AND_FETCH_REQ is sent to ubd kernel driver but
    no buf addr is assigned because the app allocates bufs only after it gets one new IO req.

(2) One write req(in one cqe) is sent back to UBDSRV and app try to handle it.

(3) App allocates the buffer after knowing the data size.

(4) We may need a data copy in kernel part now(NEED_DATA)

UBDSRV(without NEED_DATA) does not allow apps allocate bufs "after" getting an write IO req.
(Maybe this app is too strange...)
Now, the app has to ''pre''-allocate bufs so the addr can be sent 
with new sqe(UBD_IO_COMMIT_AND_FETCH_REQ) in advance.

Regards,
Zhang

> 
>>
>>>
>>> No need any ioctl, and io_uring command can handle everything.
>>>
>>> I think the idea is good, and we can provide one option for using
>>> pre-allocated buffer or application buffer.
>>>
>>> But the application buffer has to be in same process VM space with ubdsrv
>>> daemon, otherwise it becomes slower to pin these application
>>> buffers/pages.
>>>
>>>>
>>>> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
>>>> and wastes CPU when no IO is ready to be submitted. I think it can be
>>>> optimized by adding a new array storing UBD IOs that are ready to be
>>>> commit back to the kernel part. Then we could batch these IOs and avoid
>>>> unnecessary iterations on IOs which are not ready(fetching or handling
>>>> by targets).
>>>
>>> That should be easy to avoid the whole queue iteration, but my perf
>>> trace doesn't show ubd_fetch_and_submit() consumes too much CPU.
>>
>> My trace:
>> -    2.63%     1.71%  ubd     ubd    [.]ubdsrv_submit_fetch_commands
>>
>>    - 1.71% 0x495641000034d33d
>>
>>         __libc_start_main
>>
>>         main
>>
>>         cmd_dev_add
>>
>>         ubdsrv_start_dev
>>
>>         ubdsrv_start_io_daemon
>>
>>       - start_daemon
>>
>>          - 1.70% ubdsrv_io_handler
>>
>>               ubdsrv_submit_fetch_commands
>>
>>      0.92% ubdsrv_submit_fetch_commands
> 
> Looks it is still not heavy enough, and we can use one per-queue/thread
> bitmap to track IOs which need to be queued to ubd driver, and the
> change should be easy.
> 
>>
>>>
>>>>
>>>> 4) Zero-copy support is important and we are trying to implement it now.
>>>
>>> I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
>>> as one generic approach. If it is ready, it is easy to integrate to UBD.
>>>
>>>>
>>>> 5) Currently, UBD only support the loop target with io_uirng and all
>>>> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
>>>> completion 4.prepare one sqe) are done in one thread. As far as I know,
>>>
>>> loop is one example, and it provides similar function with kernel loop by
>>> < 200 lines of userspace code.
>>>
>>>>  some applications such as SPDK, network fs and customized distribution
>>>> systems do not support io_uring well.  I think we should separate target
>>>> IO handling from the UBDSRV loop and allow applications handle target
>>>> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
>>>> io_uring-supported targets?)
>>>
>>> UBD provides one framework for implementing userspace block driver, you
>>> can do everything for handling the IO in userspace. The target code just
>>> needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
>>> separated from ubd loop already. But UBD is still in early stage,
>>> and the interface will continue to improve or re-design. Or can you
>>> explain your ideas in a bit details? It could be very helpful if you
>>> can provide some application background.
>>
>> The app:
>> 1) Another worker thread per queue(not the UBD daemon thread per
>> queue)gets(by one API provides by UBDSRV such as ubdsrv_get_new_req) one
>> new UBD IO(from blk-mq) with io_size, io_off and tag.
>>
>> 2) handles the new IO(e.g. app_write_req_handle, app_read_req_handle)in
>> one worker threads(not in the UBD daemon thread).
>>
>> 3) One IO completes and the worker thread should notify the UBD daemon
>> thread.
>>
>> Currently, I find:
>> 	if (cqe->user_data & (1ULL << 63)) {
>> 		ubdsrv_handle_tgt_cqe(dev, q, cqe);
>> 		return;
>> 	}
>> in ubdsrv.c:ubdsrv_handle_cqe. This assumes a target should use io_uring
>> to handle IO requests, and the app above does not support io_uring.
>> Maybe we should re-design the IO handling logic.
> 
> You can export one helper of ubdsrv_complete_io() for target code
> 
> void ubdsrv_complete_io(struct ubdsrv_dev *dev,
>         struct ubdsrv_queue *q, int tag, int res)
> {
>         struct ubd_io *io = &q->ios[tag];
> 
>         io->result = res;
> 
>         /* Mark this IO as free and ready for issuing to ubd driver */
>         io->flags |= (UBDSRV_NEED_COMMIT_RQ_COMP | UBDSRV_IO_FREE);
> 
>         /* clear handling */
>         io->flags &= ~UBDSRV_IO_HANDLING;
> }
> 
> Then the target code calls ubdsrv_complete_io(), but you have to add new
> approach to wakeup the ubdsrv pthread daemon which can wait in io_uring_enter().
> That is why I still suggest to try to use io_uring for handling IO.
> 
> If you really have high performance requirement, and if cost of pining pages in
> ubdsrv pthread daemon were saved, maybe you can run your app_write_req_handle()/
> app_read_req_handle() in the ubdsrv pthread context directly if it is implemented
> in AIO style.
> 
> 
> Thanks, 
> Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-07 17:13     ` Ming Lei
  2022-05-09  4:05       ` Xiaoguang Wang
  2022-05-09  8:11       ` Ziyang Zhang
@ 2022-05-10  4:55       ` Ziyang Zhang
  2 siblings, 0 replies; 13+ messages in thread
From: Ziyang Zhang @ 2022-05-10  4:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On 2022/5/8 01:13, Ming Lei wrote:
> On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
>> On 2022/5/3 16:02, Ming Lei wrote:
>>> Hello Gabriel,
>>>
>>> CC linux-block and hope you don't mind, :-)
>>>
>>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
>>>>
>>>> Hi Ming,
>>>>
>>>> First of all, I hope I didn't put you on the spot too much during the
>>>> discussion.  My original proposal was to propose my design, but your
>>>> implementation quite solved the questions I had. :)
>>>
>>> I think that is open source, then we can put efforts together to make things
>>> better.
>>>
>>>>
>>>> I'd like to follow up to restart the communication and see
>>>> where I can help more with UBD.  As I said during the talk, I've
>>>> done some fio runs and I was able to saturate NBD much faster than UBD:
>>>>
>>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
>>>
>>> Yeah, that is true since NBD has extra socket communication cost which
>>> can't be efficient as io_uring.
>>>
>>>>
>>>> I've also wrote some fixes to the initialization path, which I
>>>> planned to send to you as soon as you published your code, but I think
>>>> you might want to take a look already and see if you want to just squash
>>>> it into your code base.
>>>>
>>>> I pushed those fixes here:
>>>>
>>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
>>>
>>> I have added the 1st fix and 3rd patch into my tree:
>>>
>>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
>>>
>>> The added check in 2nd patch is done lockless, which may not be reliable
>>> enough, so I didn't add it. Also adding device is in slow path, and no
>>> necessary to improve in that code path.
>>>
>>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
>>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
>>> builtin.
>>>
>>> ubdsrv part has been cleaned up too:
>>>
>>> https://github.com/ming1/ubdsrv
>>>
>>>>
>>>> I'm looking into adding support for multiple driver queues next, and
>>>> should be able to share some patches on that shortly.
>>>
>>> OK, please post them on linux-block so that more eyes can look at the
>>> code, meantime the ubdsrv side needs to handle MQ too.
>>>
>>> Sooner or later, the single ubdsrv task may be saturated by copying data and
>>> io_uring command communication only, which can be shown by running io on
>>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
>>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
>>>
>>>
>>> Thanks,
>>> Ming
>>
>> Hi Ming,
>>
>> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
>> replace TCMU by UBD as a new choice for implementing userspace bdev for
>> its high performance and simplicity.
>>
>> First, we have conducted some tests by fio and perf to evaluate UBD.
>>
>> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>>      the complicated SCSI layer and does not support multiqueue. However
>> UBD is simply using io_uring passthrough and may support multiqueue in
>> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> 
> MQ isn't hard to support, and it is basically workable now:
> 
> https://github.com/ming1/ubdsrv/commits/devel
> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
> 
> Just the affinity of pthread for each queue isn't setup yet.
> 
>>
>> 2) Some functions in UBD result in high CPU utilization and we guess
>> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
>> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
>> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
>> between bio vectors and UBD internal buffers while handling write and
>> read requests and it could be eliminated by supporting zero-copy.
> 
> copy itself doesn't take much cpu, see the following trace:
> 
> -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
>    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
>       - 23.86% internal_get_user_pages_fast                                                                                  ▒
>          + 21.14% get_user_pages_unlocked                                                                                    ▒
>          + 2.62% lockless_pages_from_mm                                                                                      ▒
>         6.42% ubd_release_pages.constprop.0
> 
> And we may provide option to allow to pin pages in the disk lifetime for avoiding
> the cost in _get_user_pages_fast().
> 
> zero-copy has to touch page table, and its cost may be expensive too.
> The big problem is that MM doesn't provide mechanism to support generic
> remapping kernel pages to userspace.
> 
>>
>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>> reasonable so please figure out my mistakes.
>>
>> 1) UBD issues one sqe to commit last completed request and fetch a new
>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>> costs some CPU and steps of building a new sqe may lower throughput.
>> Here I'd like to give a new solution: never submit sqe but trump up a
>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>> user issues only one sqe for polling an fd and repeatedly gets multiple
>> cqes when new events occur. Dose this solution break the architecture of
>> UBD?
> 
> But each cqe has to be associated with one sqe, if I understand
> correctly.
> 
> I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
> And yes, batching is really important for UBD's performance.
> 
>>
>> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
>> When an application configs many queues with bigger iodepth, UBDSRV has
>> to preallocate more buffers(size = 256KiB) and results in heavy memory
>> overhead. I think data buffers should be allocated by applications
> 
> That is just virtual memory, and pages can be reclaimed after IO is
> done.

Hi Ming,

I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
If swap is not set and madvise() is not called, these pages may not be reclaimed.

Regards,
Zhang

> 
>> themselves and passed to UBDSRV. In this way UBD offers more
>> flexibility. However, while handling a write request, the control flow
>> returns to the kernel part again to set buf addr and copy data from bio
>> vectors. Is ioctl helpful by setting buf addr and copying write data to
>> app buf?
> 
> It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
> UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
> is sent to ubd driver via ubdsrv_io_cmd->addr.
> 
> No need any ioctl, and io_uring command can handle everything.
> 
> I think the idea is good, and we can provide one option for using
> pre-allocated buffer or application buffer.
> 
> But the application buffer has to be in same process VM space with ubdsrv
> daemon, otherwise it becomes slower to pin these application
> buffers/pages.
> 
>>
>> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
>> and wastes CPU when no IO is ready to be submitted. I think it can be
>> optimized by adding a new array storing UBD IOs that are ready to be
>> commit back to the kernel part. Then we could batch these IOs and avoid
>> unnecessary iterations on IOs which are not ready(fetching or handling
>> by targets).
> 
> That should be easy to avoid the whole queue iteration, but my perf
> trace doesn't show ubd_fetch_and_submit() consumes too much CPU.
> 
>>
>> 4) Zero-copy support is important and we are trying to implement it now.
> 
> I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
> as one generic approach. If it is ready, it is easy to integrate to UBD.
> 
>>
>> 5) Currently, UBD only support the loop target with io_uirng and all
>> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
>> completion 4.prepare one sqe) are done in one thread. As far as I know,
> 
> loop is one example, and it provides similar function with kernel loop by
> < 200 lines of userspace code.
> 
>>  some applications such as SPDK, network fs and customized distribution
>> systems do not support io_uring well.  I think we should separate target
>> IO handling from the UBDSRV loop and allow applications handle target
>> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
>> io_uring-supported targets?)
> 
> UBD provides one framework for implementing userspace block driver, you
> can do everything for handling the IO in userspace. The target code just
> needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
> separated from ubd loop already. But UBD is still in early stage,
> and the interface will continue to improve or re-design. Or can you
> explain your ideas in a bit details? It could be very helpful if you
> can provide some application background.
> 
> Reason why I suggested to use io_uring is that io_uring is very efficient, also
> async IO has been proved as very efficient approach for handling io.
> 
> 
> Thanks
> Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Follow up on UBD discussion
  2022-05-10  4:46           ` Ziyang Zhang
@ 2022-05-11  1:52             ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-05-11  1:52 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: Gabriel Krisman Bertazi, Xiaoguang Wang, joseph.qi, linux-block

On Tue, May 10, 2022 at 12:46:03PM +0800, Ziyang Zhang wrote:
> On 2022/5/9 19:24, Ming Lei wrote:
> > On Mon, May 09, 2022 at 04:11:47PM +0800, Ziyang Zhang wrote:
> >> On 2022/5/8 01:13, Ming Lei wrote:
> >>> On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
> >>>> On 2022/5/3 16:02, Ming Lei wrote:
> >>>>> Hello Gabriel,
> >>>>>
> >>>>> CC linux-block and hope you don't mind, :-)
> >>>>>
> >>>>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> >>>>>>
> >>>>>> Hi Ming,
> >>>>>>
> >>>>>> First of all, I hope I didn't put you on the spot too much during the
> >>>>>> discussion.  My original proposal was to propose my design, but your
> >>>>>> implementation quite solved the questions I had. :)
> >>>>>
> >>>>> I think that is open source, then we can put efforts together to make things
> >>>>> better.
> >>>>>
> >>>>>>
> >>>>>> I'd like to follow up to restart the communication and see
> >>>>>> where I can help more with UBD.  As I said during the talk, I've
> >>>>>> done some fio runs and I was able to saturate NBD much faster than UBD:
> >>>>>>
> >>>>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> >>>>>
> >>>>> Yeah, that is true since NBD has extra socket communication cost which
> >>>>> can't be efficient as io_uring.
> >>>>>
> >>>>>>
> >>>>>> I've also wrote some fixes to the initialization path, which I
> >>>>>> planned to send to you as soon as you published your code, but I think
> >>>>>> you might want to take a look already and see if you want to just squash
> >>>>>> it into your code base.
> >>>>>>
> >>>>>> I pushed those fixes here:
> >>>>>>
> >>>>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> >>>>>
> >>>>> I have added the 1st fix and 3rd patch into my tree:
> >>>>>
> >>>>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
> >>>>>
> >>>>> The added check in 2nd patch is done lockless, which may not be reliable
> >>>>> enough, so I didn't add it. Also adding device is in slow path, and no
> >>>>> necessary to improve in that code path.
> >>>>>
> >>>>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> >>>>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
> >>>>> builtin.
> >>>>>
> >>>>> ubdsrv part has been cleaned up too:
> >>>>>
> >>>>> https://github.com/ming1/ubdsrv
> >>>>>
> >>>>>>
> >>>>>> I'm looking into adding support for multiple driver queues next, and
> >>>>>> should be able to share some patches on that shortly.
> >>>>>
> >>>>> OK, please post them on linux-block so that more eyes can look at the
> >>>>> code, meantime the ubdsrv side needs to handle MQ too.
> >>>>>
> >>>>> Sooner or later, the single ubdsrv task may be saturated by copying data and
> >>>>> io_uring command communication only, which can be shown by running io on
> >>>>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> >>>>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Ming
> >>>>
> >>>> Hi Ming,
> >>>>
> >>>> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
> >>>> replace TCMU by UBD as a new choice for implementing userspace bdev for
> >>>> its high performance and simplicity.
> >>>>
> >>>> First, we have conducted some tests by fio and perf to evaluate UBD.
> >>>>
> >>>> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
> >>>>      the complicated SCSI layer and does not support multiqueue. However
> >>>> UBD is simply using io_uring passthrough and may support multiqueue in
> >>>> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> >>>
> >>> MQ isn't hard to support, and it is basically workable now:
> >>>
> >>> https://github.com/ming1/ubdsrv/commits/devel
> >>> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
> >>>
> >>> Just the affinity of pthread for each queue isn't setup yet.
> >>
> >> Thanks Ming, I will try your new code.
> >>
> >>>
> >>>>
> >>>> 2) Some functions in UBD result in high CPU utilization and we guess
> >>>> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
> >>>> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
> >>>> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
> >>>> between bio vectors and UBD internal buffers while handling write and
> >>>> read requests and it could be eliminated by supporting zero-copy.
> >>>
> >>> copy itself doesn't take much cpu, see the following trace:
> >>>
> >>> -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
> >>>    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
> >>>       - 23.86% internal_get_user_pages_fast                                                                                  ▒
> >>>          + 21.14% get_user_pages_unlocked                                                                                    ▒
> >>>          + 2.62% lockless_pages_from_mm                                                                                      ▒
> >>>         6.42% ubd_release_pages.constprop.0
> >> I got the following trace:
> >> -   26.49%   2.26%  ubd    [ubd]    [k] ubd_commit_completion
> >>
> >>    - 24.24% ubd_commit_completion
> >>
> >>       - 24.00% ubd_copy_pages.isra.0
> >>
> >>          - 10.73% internal_get_user_pages_fast
> >>
> >>             - 9.05% get_user_pages_unlocked
> >>
> >>                - 8.92% __get_user_pages
> >>
> >>                   - 5.61% follow_page_pte
> >>
> >>                      - 1.83% folio_mark_accessed
> >>
> >>                           0.70% __lru_cache_activate_folio
> >>
> >>                      - 1.25% _raw_spin_lock
> >>
> >>                           0.61% preempt_count_add
> >>
> >>                        0.91% try_grab_page
> >>
> >>                     1.02% follow_pmd_mask.isra.0
> >>
> >>                     0.58% follow_page_mask
> >>
> >>                     0.52% follow_pud_mask
> >>
> >>             - 1.35% gup_pud_range.constprop.0
> >>
> >>                  1.22% gup_pmd_range.constprop.0
> >>
> >>          - 7.71% memcpy_erms
> >>
> >>             - 0.68% asm_common_interrupt
> >>
> >>                - 0.67% common_interrupt
> >>
> >>                   - 0.62% __common_interrupt
> >>
> >>                      - 0.60% handle_edge_irq
> >>
> >>                         - 0.53% handle_irq_event
> >>
> >>                            - 0.51% __handle_irq_event_percpu
> >>
> >>                                 vring_interrupt
> >>
> >>            4.07% ubd_release_pages.constprop.0
> > 
> > OK, but both internal_get_user_pages_fast and ubd_release_pages still
> > takes 15%, and memcpy_erms() is 7.7%. But mm zero copy isn't ready,
> > so memcpy can't be avoided, also zero copy has other cost, which may
> > be big enough too.
> > 
> > One approach I thought of for reducing cost of pinning pages is to release
> > pages lazily, such as, release page when the request is idle for enough time,
> > meantime takes LRU way to free pages when number of total pinned pages are
> > beyond max allowed amount. IMO, this approach should get much improvement,
> > but it needs pre-allocation of userspace buffer, and the implementation
> > shouldn't be very hard.
> > 
> >>
> >>
> >>>
> >>> And we may provide option to allow to pin pages in the disk lifetime for avoiding
> >>> the cost in _get_user_pages_fast().
> >>>
> >>> zero-copy has to touch page table, and its cost may be expensive too.
> >>> The big problem is that MM doesn't provide mechanism to support generic
> >>> remapping kernel pages to userspace.
> >>>
> >>>>
> >>>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> >>>> reasonable so please figure out my mistakes.
> >>>>
> >>>> 1) UBD issues one sqe to commit last completed request and fetch a new
> >>>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> >>>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> >>>> costs some CPU and steps of building a new sqe may lower throughput.
> >>>> Here I'd like to give a new solution: never submit sqe but trump up a
> >>>> cqe(with information of new UBD IO request) when calling queue_rq(). I
> >>>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> >>>> user issues only one sqe for polling an fd and repeatedly gets multiple
> >>>> cqes when new events occur. Dose this solution break the architecture of
> >>>> UBD?
> >>>
> >>> But each cqe has to be associated with one sqe, if I understand
> >>> correctly.
> >>>
> >>> I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
> >>> And yes, batching is really important for UBD's performance.
> >>>
> >>>>
> >>>> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
> >>>> When an application configs many queues with bigger iodepth, UBDSRV has
> >>>> to preallocate more buffers(size = 256KiB) and results in heavy memory
> >>>> overhead. I think data buffers should be allocated by applications
> >>>
> >>> That is just virtual memory, and pages can be reclaimed after IO is
> >>> done.
> >>>
> >>>> themselves and passed to UBDSRV. In this way UBD offers more
> >>>> flexibility. However, while handling a write request, the control flow
> >>>> returns to the kernel part again to set buf addr and copy data from bio
> >>>> vectors. Is ioctl helpful by setting buf addr and copying write data to
> >>>> app buf?
> >>>
> >>> It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
> >>> UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
> >>> is sent to ubd driver via ubdsrv_io_cmd->addr.
> >>
> >> Maybe one app allocates one data buffer until it gets a UBD IO request
> >> because it does not know the data size and pre-allocation is forbidden.
> >> In this way, UBD_IO_FETCH_REQ or UBD_IO_COMMIT_AND_FETCH_REQ are not
> >> helpful.
> > 
> > The userspace buffer from app can be seen when UBD_IO_COMMIT_AND_FETCH_REQ
> > is sent to ubd driver.
> 
> Hi Ming,
> 
> I re-thinked and here is one case and UBD_IO_COMMIT_AND_FETCH_REQ is not helpful:
> 
> (1) One sqe with UBD_IO_COMMIT_AND_FETCH_REQ is sent to ubd kernel driver but
>     no buf addr is assigned because the app allocates bufs only after it gets one new IO req.

No, UBD_IO_COMMIT_AND_FETCH_REQ is only sent to kernel driver iff the
previous IO request in this tag slot is completed by ubdsrv. So once the
IO request is done, the ubdsrv target side should be IDLE for this IO
slot.

Storage is traditional C/S model: here client is the application of doing IO on
/dev/ubdbN, and server is the 'ubdsrv' daemon with the target implementation.

So your app for handling target is part of the server, which should only be
active in case that there is pending IO from client of /dev/ubdbN.

So UBD_IO_COMMIT_AND_FETCH_REQ without buffer address attached is
absolutely invalid.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-05-11  1:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <874k27rfwm.fsf@collabora.com>
2022-05-03  8:02 ` Follow up on UBD discussion Ming Lei
2022-05-07  4:20   ` ZiyangZhang
2022-05-07 17:13     ` Ming Lei
2022-05-09  4:05       ` Xiaoguang Wang
2022-05-09  7:36         ` Ming Lei
2022-05-09 11:53           ` Xiaoguang Wang
2022-05-09 13:15             ` Ming Lei
2022-05-10  4:38               ` Xiaoguang Wang
2022-05-09  8:11       ` Ziyang Zhang
2022-05-09 11:24         ` Ming Lei
2022-05-10  4:46           ` Ziyang Zhang
2022-05-11  1:52             ` Ming Lei
2022-05-10  4:55       ` Ziyang Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.