All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
@ 2022-06-27  8:20 Ziyang Zhang
  2022-06-27 15:29 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ziyang Zhang @ 2022-06-27  8:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi, Ziyang Zhang

Hi Ming,

We are learning your ubd code and developing a library: libubd for ubd.
This article explains why we need libubd and how we design it.

Related threads:
(1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
(2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
(3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
(4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/


Userspace block driver(ubd)[1], based on io_uring passthrough,
allows users to define their own backend storage in userspace
and provides block devices such as /dev/ubdbX.
Ming Lei has provided kernel driver code: ubd_drv.c[2]
and userspace code: ubdsrv[3].

ubd_drv.c simply passes all blk-mq IO requests
to ubdsrv through io_uring sqes/cqes. We think the kernel code
is pretty well-designed.

ubdsrv is implemented by a single daemon
and target(backend) IO handling(null_tgt and loop_tgt) 
is embedded in the daemon. 
While trying ubdsrv, we find ubdsrv is hard to be used 
by our backend.
First is description of our backend:

(1) a distributing system sends/receives IO requests 
    through network.

(2) The system use RPC calls among hundreds of
     storage servers and RPC calls are associated with data buffers
     allocated from a memory pool.

(3) On each server for each device(/dev/vdX), our backend runs
     many threads to handle IO requests and manage the device. 

Second are reasons why ubdsrv is hard to use for us:

(1) ubdsrv requires the target(backend) issues IO requests
    to the io_uring provided by ubdsrv but our backend 
    uses something like RPC and does not support io_uring.

(2) ubdsrv forks a daemon and it takes over everything.
    Users should type "list/stop/del" ctrl-commands to interact with
    the daemon. It is inconvenient for our backend
    because it has threads(from a C++ thread library) running inside.

(3) ubdsrv PRE-allocates internal data buffers for each ubd device.
    The data flow is:
    bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
    Since ubdsrv does not export its internal data buffer to backend,
    the second copy is unavoidable. 
    PRE-allocating data buffer may not be a good idea for wasting memory
    if there are hundreds of ubd devices(/dev/ubdbX).

To better use ubd in more complicated scenarios, we have developed libubd.
It does not assume implementation of backend and can be embedded into it.
We refer to the code structure of tcmu-runner[4], 
which includes a library(libtcmu) for users 
to embed tcmu-runner inside backend's code. 
It:

(1) Does not fork/pthread_create but embedded in backend's threads

(2) Provides libubd APIs for backend to add/delete ubd devices 
    and fetch/commit IO requests

(3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
    since the backend actually has no knowledge 
    on incoming data size until it gets an IO descriptor.

Note: 

(1) libubd is just a POC demo and is not stick to the principles of
    designing a library and we are still developing it now...

(2) The repo[5] including some useful examples using libubd. 

(3) We modify the kernel part: ubd_drv.c and 
    it[6] is against Ming Lei's newest branch[2]
    because we forked our branch from his early branch
    (v5.17-ubd-dev).

Thanks,
Zhang

[1]https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
[2]https://github.com/ming1/linux/tree/my_for-5.19-ubd-devel_v3
[3]https://github.com/ming1/ubdsrv
[4]https://github.com/open-iscsi/tcmu-runner
[5]https://github.com/old-memories/libubd
[6]https://github.com/old-memories/linux/tree/v5.17-ubd-dev-mq-ubuf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-27  8:20 [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough) Ziyang Zhang
@ 2022-06-27 15:29 ` Ming Lei
  2022-06-29  3:22   ` Ziyang Zhang
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-06-27 15:29 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

Hi Ziyang,

On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> Hi Ming,
> 
> We are learning your ubd code and developing a library: libubd for ubd.
> This article explains why we need libubd and how we design it.
> 
> Related threads:
> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> 
> 
> Userspace block driver(ubd)[1], based on io_uring passthrough,
> allows users to define their own backend storage in userspace
> and provides block devices such as /dev/ubdbX.
> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> and userspace code: ubdsrv[3].
> 
> ubd_drv.c simply passes all blk-mq IO requests
> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> is pretty well-designed.
> 
> ubdsrv is implemented by a single daemon
> and target(backend) IO handling(null_tgt and loop_tgt) 
> is embedded in the daemon. 
> While trying ubdsrv, we find ubdsrv is hard to be used 
> by our backend.

ubd is supposed to provide one generic framework for user space block
driver, and it can be used for doing lots of fun/useful thing.

If I understand correctly, this isn't same with your use case:

1) your user space block driver isn't generic, and should be dedicated
for Alibaba's uses

2) your case has been there for long time, and you want to switch from other
approach(maybe tcmu) to ubd given ubd has better performance.

> First is description of our backend:
> 
> (1) a distributing system sends/receives IO requests 
>     through network.
> 
> (2) The system use RPC calls among hundreds of
>      storage servers and RPC calls are associated with data buffers
>      allocated from a memory pool.
> 
> (3) On each server for each device(/dev/vdX), our backend runs
>      many threads to handle IO requests and manage the device. 
> 
> Second are reasons why ubdsrv is hard to use for us:
> 
> (1) ubdsrv requires the target(backend) issues IO requests
>     to the io_uring provided by ubdsrv but our backend 
>     uses something like RPC and does not support io_uring.

As one generic framework, the io command has to be io_uring
passthrough, and the io doesn't have to be handled by io_uring.

But IMO io_uring is much more efficient, so I'd try to make async io
(io uring) as the 1st citizen in the framework, especially for new
driver.

But it can support other way really, such as use io_uring with eventfd,
the other userspace context can handle io, then wake up io_uring context
via eventfd. You may not use io_uring for handling io, but you still
need to communicate with the context for handling io_uring passthrough
command, and one mechanism(such as eventfd) has to be there for the
communication.

> 
> (2) ubdsrv forks a daemon and it takes over everything.
>     Users should type "list/stop/del" ctrl-commands to interact with
>     the daemon. It is inconvenient for our backend
>     because it has threads(from a C++ thread library) running inside.

No, list/stop/del won't interact with the daemon, and the per-queue
pthread is only handling IO commands(io_uring passthrough) and IO request.

> 
> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
>     The data flow is:
>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
>     Since ubdsrv does not export its internal data buffer to backend,
>     the second copy is unavoidable. 
>     PRE-allocating data buffer may not be a good idea for wasting memory
>     if there are hundreds of ubd devices(/dev/ubdbX).

The preallocation is just virtual memory, which is cheap and not pinned, but
ubdsrv does support buffer provided by io command, see:

https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d

> 
> To better use ubd in more complicated scenarios, we have developed libubd.
> It does not assume implementation of backend and can be embedded into it.
> We refer to the code structure of tcmu-runner[4], 
> which includes a library(libtcmu) for users 
> to embed tcmu-runner inside backend's code. 
> It:
> 
> (1) Does not fork/pthread_create but embedded in backend's threads

That is because your backend may not use io_uring, I guess.

But it is pretty easy to move the decision of creating pthread to target
code, which can be done in the interface of .prepare_target().

> 
> (2) Provides libubd APIs for backend to add/delete ubd devices 
>     and fetch/commit IO requests

The above could be the main job of libubd.

> 
> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
>     since the backend actually has no knowledge 
>     on incoming data size until it gets an IO descriptor.

I can understand your requirement, not look at your code yet, but libubd
should be pretty thin from function viewpoint, and there are lots of common
things to abstract/share among all drivers, please see recent ubdsrv change:

https://github.com/ming1/ubdsrv/commits/master

in which:
	- coroutine is added for handling target io
	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
	supporting complicated target
	- c++ support

IMO, libubd isn't worth of one freshly new project, and it could be integrated
into ubdsrv easily. The potential users could be existed usersapce
block driver projects.

If you don't object, I am happy to co-work with you to add the support
for libubd in ubdsrv, then we can avoid to invent a wheel.

> 
> Note: 
> 
> (1) libubd is just a POC demo and is not stick to the principles of
>     designing a library and we are still developing it now...
> 
> (2) The repo[5] including some useful examples using libubd. 
> 
> (3) We modify the kernel part: ubd_drv.c and 
>     it[6] is against Ming Lei's newest branch[2]
>     because we forked our branch from his early branch
>     (v5.17-ubd-dev).

Please look at the following tree for ubd driver:

https://github.com/ming1/linux/tree/my_for-5.19-ubd-devel_v3

in which most of your change should have been there already.

I will post v3 soon, please feel free to review after it is out and
see if it is fine for you.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-27 15:29 ` Ming Lei
@ 2022-06-29  3:22   ` Ziyang Zhang
  2022-06-29 11:33     ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ziyang Zhang @ 2022-06-29  3:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

Hi Ming,

On 2022/6/27 23:29, Ming Lei wrote:
> Hi Ziyang,
> 
> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
>> Hi Ming,
>>
>> We are learning your ubd code and developing a library: libubd for ubd.
>> This article explains why we need libubd and how we design it.
>>
>> Related threads:
>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
>>
>>
>> Userspace block driver(ubd)[1], based on io_uring passthrough,
>> allows users to define their own backend storage in userspace
>> and provides block devices such as /dev/ubdbX.
>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
>> and userspace code: ubdsrv[3].
>>
>> ubd_drv.c simply passes all blk-mq IO requests
>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
>> is pretty well-designed.
>>
>> ubdsrv is implemented by a single daemon
>> and target(backend) IO handling(null_tgt and loop_tgt) 
>> is embedded in the daemon. 
>> While trying ubdsrv, we find ubdsrv is hard to be used 
>> by our backend.
> 
> ubd is supposed to provide one generic framework for user space block
> driver, and it can be used for doing lots of fun/useful thing.
> 
> If I understand correctly, this isn't same with your use case:
> 
> 1) your user space block driver isn't generic, and should be dedicated
> for Alibaba's uses
> 
> 2) your case has been there for long time, and you want to switch from other
> approach(maybe tcmu) to ubd given ubd has better performance.
> 

Yes, you are correct :)
The idea of design libubd is actually from libtcmu.

We do have some userspace storage system as the IO handling backend, 
and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.


I think your motivation is that provides a complete user block driver to users
and they DO NOT change any code.
Users DO change their code using libubd for embedding libubd into the backend.


>> First is description of our backend:
>>
>> (1) a distributing system sends/receives IO requests 
>>     through network.
>>
>> (2) The system use RPC calls among hundreds of
>>      storage servers and RPC calls are associated with data buffers
>>      allocated from a memory pool.
>>
>> (3) On each server for each device(/dev/vdX), our backend runs
>>      many threads to handle IO requests and manage the device. 
>>
>> Second are reasons why ubdsrv is hard to use for us:
>>
>> (1) ubdsrv requires the target(backend) issues IO requests
>>     to the io_uring provided by ubdsrv but our backend 
>>     uses something like RPC and does not support io_uring.
> 
> As one generic framework, the io command has to be io_uring
> passthrough, and the io doesn't have to be handled by io_uring.

Yes, our backend define its own communicating method.

> 
> But IMO io_uring is much more efficient, so I'd try to make async io
> (io uring) as the 1st citizen in the framework, especially for new
> driver.
> 
> But it can support other way really, such as use io_uring with eventfd,
> the other userspace context can handle io, then wake up io_uring context
> via eventfd. You may not use io_uring for handling io, but you still
> need to communicate with the context for handling io_uring passthrough
> command, and one mechanism(such as eventfd) has to be there for the
> communication.

Ok, eventfd may be helpful. 
If you read my API, you may find ubdlib_complete_io_request().
I think the backend io worker thread can call this function to tell the 
ubd queue thread(the io_uring context in it) to commit the IO.



> 
>>
>> (2) ubdsrv forks a daemon and it takes over everything.
>>     Users should type "list/stop/del" ctrl-commands to interact with
>>     the daemon. It is inconvenient for our backend
>>     because it has threads(from a C++ thread library) running inside.
> 
> No, list/stop/del won't interact with the daemon, and the per-queue
> pthread is only handling IO commands(io_uring passthrough) and IO request.
> 


Sorry I made a mistake.

I mean from user's view, 
he has to type list/del/stop from cmdlind to control the daemon.
(I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).

This is a little weird if we try to make a ubd library.
So I actually provides APIs in libubd for users to do these list/del/stop works.


>>
>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
>>     The data flow is:
>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
>>     Since ubdsrv does not export its internal data buffer to backend,
>>     the second copy is unavoidable. 
>>     PRE-allocating data buffer may not be a good idea for wasting memory
>>     if there are hundreds of ubd devices(/dev/ubdbX).
> 
> The preallocation is just virtual memory, which is cheap and not pinned, but
> ubdsrv does support buffer provided by io command, see:
> 
> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d

Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
but you did not reply :)

I paste it here:

"I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."

I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.

Another IMPORTANT problem is your commit:
https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
may be not helpful for WRITE requests if I understand correctly.

Consider this data flow:

1. ubdsrv commits an IO req(req1, a READ req).

2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
   addr1 is the addr of buffer user passed.
   

3. ubd gets the sqe and commits req1, sets io->addr to addr1.

4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.

5. ubd copys data to be written from biovec to addr1 in a task_work.

6. ubdsrv gets the cqe and tell the IO target to handle req2.

7. IO target handles req2. It is a WRITE req so target issues a io_uring write
   cmd(with buffer set to addr1).



The problem happens in 5). You cannot know the actual data_len of an blk-mq req
until you get one in queue_rq. So length of addr1 may be less than data_len.
> 
>>
>> To better use ubd in more complicated scenarios, we have developed libubd.
>> It does not assume implementation of backend and can be embedded into it.
>> We refer to the code structure of tcmu-runner[4], 
>> which includes a library(libtcmu) for users 
>> to embed tcmu-runner inside backend's code. 
>> It:
>>
>> (1) Does not fork/pthread_create but embedded in backend's threads
> 
> That is because your backend may not use io_uring, I guess.
> 
> But it is pretty easy to move the decision of creating pthread to target
> code, which can be done in the interface of .prepare_target().

I think the library should not create any thread if we want a libubd.

> 
>>
>> (2) Provides libubd APIs for backend to add/delete ubd devices 
>>     and fetch/commit IO requests
> 
> The above could be the main job of libubd.

indeed.

> 
>>
>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
>>     since the backend actually has no knowledge 
>>     on incoming data size until it gets an IO descriptor.
> 
> I can understand your requirement, not look at your code yet, but libubd
> should be pretty thin from function viewpoint, and there are lots of common
> things to abstract/share among all drivers, please see recent ubdsrv change:
> 
> https://github.com/ming1/ubdsrv/commits/master
> 
> in which:
> 	- coroutine is added for handling target io
> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> 	supporting complicated target
> 	- c++ support

Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
I think it is actually target(backend) design and ubd should not assume 
how the backend handle IOs. 

The work ubd in userspace has to be done is:

1) give some IO descriptors to backend, such as ubd_get_io_requests()

2) get IO completion form backend, such as ubd_complete_io_requests()



> 
> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> into ubdsrv easily. The potential users could be existed usersapce
> block driver projects.

Yes, so many userspace storage systems can use ubd!
You may look at tcmu-runner. It:

1) provides a library(libtcmu.c) for those who have a existing backend.

2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
   for those who just want to run it. 
   And the runner is build on top of libtcmu.

> 
> If you don't object, I am happy to co-work with you to add the support
> for libubd in ubdsrv, then we can avoid to invent a wheel

+1 :)

> 
>>
>> Note: 
>>
>> (1) libubd is just a POC demo and is not stick to the principles of
>>     designing a library and we are still developing it now...
>>
>> (2) The repo[5] including some useful examples using libubd. 
>>
>> (3) We modify the kernel part: ubd_drv.c and 
>>     it[6] is against Ming Lei's newest branch[2]
>>     because we forked our branch from his early branch
>>     (v5.17-ubd-dev).
> 
> Please look at the following tree for ubd driver:
> 
> https://github.com/ming1/linux/tree/my_for-5.19-ubd-devel_v3
> 
> in which most of your change should have been there already.
> 
> I will post v3 soon, please feel free to review after it is out and
> see if it is fine for you.

Yes, I have read your newest branch.
You use some task_work() functions in ubd_drv.c 
for error-handling such as aborting IO.

But I find they are too complicated to understand 
and it's hard to write libubd code in this branch.

So I choose your first(easiest to understand)
version: v5.17-ubd-dev.

Thanks,
Zhang.

> 
> 
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-29  3:22   ` Ziyang Zhang
@ 2022-06-29 11:33     ` Ming Lei
  2022-06-30  7:16       ` Ziyang Zhang
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-06-29 11:33 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi, ming.lei

On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> Hi Ming,
> 
> On 2022/6/27 23:29, Ming Lei wrote:
> > Hi Ziyang,
> > 
> > On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >> Hi Ming,
> >>
> >> We are learning your ubd code and developing a library: libubd for ubd.
> >> This article explains why we need libubd and how we design it.
> >>
> >> Related threads:
> >> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> >> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> >>
> >>
> >> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >> allows users to define their own backend storage in userspace
> >> and provides block devices such as /dev/ubdbX.
> >> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >> and userspace code: ubdsrv[3].
> >>
> >> ubd_drv.c simply passes all blk-mq IO requests
> >> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >> is pretty well-designed.
> >>
> >> ubdsrv is implemented by a single daemon
> >> and target(backend) IO handling(null_tgt and loop_tgt) 
> >> is embedded in the daemon. 
> >> While trying ubdsrv, we find ubdsrv is hard to be used 
> >> by our backend.
> > 
> > ubd is supposed to provide one generic framework for user space block
> > driver, and it can be used for doing lots of fun/useful thing.
> > 
> > If I understand correctly, this isn't same with your use case:
> > 
> > 1) your user space block driver isn't generic, and should be dedicated
> > for Alibaba's uses
> > 
> > 2) your case has been there for long time, and you want to switch from other
> > approach(maybe tcmu) to ubd given ubd has better performance.
> > 
> 
> Yes, you are correct :)
> The idea of design libubd is actually from libtcmu.
> 
> We do have some userspace storage system as the IO handling backend, 
> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> 
> 
> I think your motivation is that provides a complete user block driver to users
> and they DO NOT change any code.
> Users DO change their code using libubd for embedding libubd into the backend.
> 
> 
> >> First is description of our backend:
> >>
> >> (1) a distributing system sends/receives IO requests 
> >>     through network.
> >>
> >> (2) The system use RPC calls among hundreds of
> >>      storage servers and RPC calls are associated with data buffers
> >>      allocated from a memory pool.
> >>
> >> (3) On each server for each device(/dev/vdX), our backend runs
> >>      many threads to handle IO requests and manage the device. 
> >>
> >> Second are reasons why ubdsrv is hard to use for us:
> >>
> >> (1) ubdsrv requires the target(backend) issues IO requests
> >>     to the io_uring provided by ubdsrv but our backend 
> >>     uses something like RPC and does not support io_uring.
> > 
> > As one generic framework, the io command has to be io_uring
> > passthrough, and the io doesn't have to be handled by io_uring.
> 
> Yes, our backend define its own communicating method.
> 
> > 
> > But IMO io_uring is much more efficient, so I'd try to make async io
> > (io uring) as the 1st citizen in the framework, especially for new
> > driver.
> > 
> > But it can support other way really, such as use io_uring with eventfd,
> > the other userspace context can handle io, then wake up io_uring context
> > via eventfd. You may not use io_uring for handling io, but you still
> > need to communicate with the context for handling io_uring passthrough
> > command, and one mechanism(such as eventfd) has to be there for the
> > communication.
> 
> Ok, eventfd may be helpful. 
> If you read my API, you may find ubdlib_complete_io_request().
> I think the backend io worker thread can call this function to tell the 
> ubd queue thread(the io_uring context in it) to commit the IO.

The ubdlib_complete_io_request() has to be called in the same pthread
context, that looks not flexible. When you handle IO via non-io_uring in the same
context, the cpu utilization in submission/completion side should be
higher than io_uring. And this way should be worse than the usage in
ubd/loop, that is why I suggest to use one io_uring for handling both
io command and io request if possible.

> 
> 
> 
> > 
> >>
> >> (2) ubdsrv forks a daemon and it takes over everything.
> >>     Users should type "list/stop/del" ctrl-commands to interact with
> >>     the daemon. It is inconvenient for our backend
> >>     because it has threads(from a C++ thread library) running inside.
> > 
> > No, list/stop/del won't interact with the daemon, and the per-queue
> > pthread is only handling IO commands(io_uring passthrough) and IO request.
> > 
> 
> 
> Sorry I made a mistake.
> 
> I mean from user's view, 
> he has to type list/del/stop from cmdlind to control the daemon.
> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> 
> This is a little weird if we try to make a ubd library.
> So I actually provides APIs in libubd for users to do these list/del/stop works.

OK, that is fine to export APIs for admin purpose.

> 
> 
> >>
> >> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>     The data flow is:
> >>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>     Since ubdsrv does not export its internal data buffer to backend,
> >>     the second copy is unavoidable. 
> >>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>     if there are hundreds of ubd devices(/dev/ubdbX).
> > 
> > The preallocation is just virtual memory, which is cheap and not pinned, but
> > ubdsrv does support buffer provided by io command, see:
> > 
> > https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> 
> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> but you did not reply :)
> 
> I paste it here:
> 
> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> 
> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.

Usually posix_memalign just allocates virtual memory which is unlimited
in 64bit arch, and pages should be allocated until the buffer is read or write.
After the READ/WRITE is done, kernel still can reclaim the pages in this
virtual memory.

In future, we still may optimize the memory uses via madvise, such as
MADV_DONTNEED, after the slot is idle for long enough.

> 
> Another IMPORTANT problem is your commit:
> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> may be not helpful for WRITE requests if I understand correctly.
> 
> Consider this data flow:
> 
> 1. ubdsrv commits an IO req(req1, a READ req).
> 
> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
>    addr1 is the addr of buffer user passed.
>    
> 
> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> 
> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> 
> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> 
> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> 
> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
>    cmd(with buffer set to addr1).
> 
> 
> 
> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> until you get one in queue_rq. So length of addr1 may be less than data_len.

So far, the actual length of buffer has to be set as at least rq_max_blocks, since
we set it as ubd queue's max hw sectors. Yeah, you may argue memory
waste, but process virtual address is unlimited for 64bit arch, and
pages are allocated until actual read/write is started.

> > 
> >>
> >> To better use ubd in more complicated scenarios, we have developed libubd.
> >> It does not assume implementation of backend and can be embedded into it.
> >> We refer to the code structure of tcmu-runner[4], 
> >> which includes a library(libtcmu) for users 
> >> to embed tcmu-runner inside backend's code. 
> >> It:
> >>
> >> (1) Does not fork/pthread_create but embedded in backend's threads
> > 
> > That is because your backend may not use io_uring, I guess.
> > 
> > But it is pretty easy to move the decision of creating pthread to target
> > code, which can be done in the interface of .prepare_target().
> 
> I think the library should not create any thread if we want a libubd.

I Agree.

> 
> > 
> >>
> >> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>     and fetch/commit IO requests
> > 
> > The above could be the main job of libubd.
> 
> indeed.
> 
> > 
> >>
> >> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>     since the backend actually has no knowledge 
> >>     on incoming data size until it gets an IO descriptor.
> > 
> > I can understand your requirement, not look at your code yet, but libubd
> > should be pretty thin from function viewpoint, and there are lots of common
> > things to abstract/share among all drivers, please see recent ubdsrv change:
> > 
> > https://github.com/ming1/ubdsrv/commits/master
> > 
> > in which:
> > 	- coroutine is added for handling target io
> > 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> > 	supporting complicated target
> > 	- c++ support
> 
> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> I think it is actually target(backend) design and ubd should not assume 
> how the backend handle IOs. 
> 
> The work ubd in userspace has to be done is:
> 
> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> 
> 2) get IO completion form backend, such as ubd_complete_io_requests()

Or the user provides/registers two callbacks: handle_io_async() and
io_complete(), the former is called when one request comes from ubd
driver, the latter(optional) is called when one io is done.

Also you didn't mention how you notify io_uring about io completion after
io_uring_enter() is slept if your backend code doesn't use io_uring to
handle io.

I think one communication mechanism(such as eventfd) is needed for your
case.

> 
> 
> 
> > 
> > IMO, libubd isn't worth of one freshly new project, and it could be integrated
> > into ubdsrv easily. The potential users could be existed usersapce
> > block driver projects.
> 
> Yes, so many userspace storage systems can use ubd!
> You may look at tcmu-runner. It:
> 
> 1) provides a library(libtcmu.c) for those who have a existing backend.
> 
> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
>    for those who just want to run it. 
>    And the runner is build on top of libtcmu.
> 
> > 
> > If you don't object, I am happy to co-work with you to add the support
> > for libubd in ubdsrv, then we can avoid to invent a wheel
> 
> +1 :)

Thinking of further, I'd suggest to split ubdsrv into two parts:

1) libubdsrv
- provide APIs like what you did in libubd
- provide API for notify io_uring(handling io command) that one io is
completed, and the API should support handling IO from other context
(not same with the io_uring context for handling io command).

2) ubd target
- built on libubdsrv, such as ubd command is built on libubdsrv, and
specific target implementation is built on the library too.

It shouldn't be hard to work towards this direction, and I guess this
way should make current target implementation more clean.

> 
> > 
> >>
> >> Note: 
> >>
> >> (1) libubd is just a POC demo and is not stick to the principles of
> >>     designing a library and we are still developing it now...
> >>
> >> (2) The repo[5] including some useful examples using libubd. 
> >>
> >> (3) We modify the kernel part: ubd_drv.c and 
> >>     it[6] is against Ming Lei's newest branch[2]
> >>     because we forked our branch from his early branch
> >>     (v5.17-ubd-dev).
> > 
> > Please look at the following tree for ubd driver:
> > 
> > https://github.com/ming1/linux/tree/my_for-5.19-ubd-devel_v3
> > 
> > in which most of your change should have been there already.
> > 
> > I will post v3 soon, please feel free to review after it is out and
> > see if it is fine for you.
> 
> Yes, I have read your newest branch.
> You use some task_work() functions in ubd_drv.c 
> for error-handling such as aborting IO.

The IO aborting has to be supported, otherwise what if the io_uring
context(pthread) is killed, and what if the device is removed when
handling IO.

ubdsrv/tests/generic provides two tests for deleting ubd & killing
per-queue pthread meantime with heavy IO.

> 
> But I find they are too complicated to understand 
> and it's hard to write libubd code in this branch.

Actually the latest ubdsrv code becomes much clean, and should be
easier to abstract APIs for external/binary target.

> 
> So I choose your first(easiest to understand)
> version: v5.17-ubd-dev.

That code is outdated, and full of bugs, :-(


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-29 11:33     ` Ming Lei
@ 2022-06-30  7:16       ` Ziyang Zhang
  2022-06-30  9:09         ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ziyang Zhang @ 2022-06-30  7:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

Hi, Ming

On 2022/6/29 19:33, Ming Lei wrote:
> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
>> Hi Ming,
>>
>> On 2022/6/27 23:29, Ming Lei wrote:
>>> Hi Ziyang,
>>>
>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
>>>> Hi Ming,
>>>>
>>>> We are learning your ubd code and developing a library: libubd for ubd.
>>>> This article explains why we need libubd and how we design it.
>>>>
>>>> Related threads:
>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
>>>>
>>>>
>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
>>>> allows users to define their own backend storage in userspace
>>>> and provides block devices such as /dev/ubdbX.
>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
>>>> and userspace code: ubdsrv[3].
>>>>
>>>> ubd_drv.c simply passes all blk-mq IO requests
>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
>>>> is pretty well-designed.
>>>>
>>>> ubdsrv is implemented by a single daemon
>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
>>>> is embedded in the daemon. 
>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
>>>> by our backend.
>>>
>>> ubd is supposed to provide one generic framework for user space block
>>> driver, and it can be used for doing lots of fun/useful thing.
>>>
>>> If I understand correctly, this isn't same with your use case:
>>>
>>> 1) your user space block driver isn't generic, and should be dedicated
>>> for Alibaba's uses
>>>
>>> 2) your case has been there for long time, and you want to switch from other
>>> approach(maybe tcmu) to ubd given ubd has better performance.
>>>
>>
>> Yes, you are correct :)
>> The idea of design libubd is actually from libtcmu.
>>
>> We do have some userspace storage system as the IO handling backend, 
>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
>>
>>
>> I think your motivation is that provides a complete user block driver to users
>> and they DO NOT change any code.
>> Users DO change their code using libubd for embedding libubd into the backend.
>>
>>
>>>> First is description of our backend:
>>>>
>>>> (1) a distributing system sends/receives IO requests 
>>>>     through network.
>>>>
>>>> (2) The system use RPC calls among hundreds of
>>>>      storage servers and RPC calls are associated with data buffers
>>>>      allocated from a memory pool.
>>>>
>>>> (3) On each server for each device(/dev/vdX), our backend runs
>>>>      many threads to handle IO requests and manage the device. 
>>>>
>>>> Second are reasons why ubdsrv is hard to use for us:
>>>>
>>>> (1) ubdsrv requires the target(backend) issues IO requests
>>>>     to the io_uring provided by ubdsrv but our backend 
>>>>     uses something like RPC and does not support io_uring.
>>>
>>> As one generic framework, the io command has to be io_uring
>>> passthrough, and the io doesn't have to be handled by io_uring.
>>
>> Yes, our backend define its own communicating method.
>>
>>>
>>> But IMO io_uring is much more efficient, so I'd try to make async io
>>> (io uring) as the 1st citizen in the framework, especially for new
>>> driver.
>>>
>>> But it can support other way really, such as use io_uring with eventfd,
>>> the other userspace context can handle io, then wake up io_uring context
>>> via eventfd. You may not use io_uring for handling io, but you still
>>> need to communicate with the context for handling io_uring passthrough
>>> command, and one mechanism(such as eventfd) has to be there for the
>>> communication.
>>
>> Ok, eventfd may be helpful. 
>> If you read my API, you may find ubdlib_complete_io_request().
>> I think the backend io worker thread can call this function to tell the 
>> ubd queue thread(the io_uring context in it) to commit the IO.
> 
> The ubdlib_complete_io_request() has to be called in the same pthread
> context, that looks not flexible. When you handle IO via non-io_uring in the same
> context, the cpu utilization in submission/completion side should be
> higher than io_uring. And this way should be worse than the usage in
> ubd/loop, that is why I suggest to use one io_uring for handling both
> io command and io request if possible.

ubdlib_complete_io_request() can be called in the io worker thread,
not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).

You can find ubd_runner.c in my libubd repo. There are many io worker
threads for each ubdsrv queue to handle IO requests.

Actually this idea comes from tcmu-runner. The data flow is:

1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)

2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
   
   for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
   (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
    
   for other simple(can be handled very quickly), 
   handle it right now and call ubdlib_complete_io_request()

3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
   and setup sqe if one IO(IO completion) is ready to commit.
   
   Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.


4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
    (io_uring_enter() will return after at least one IO req is received from blk-mq)
   
5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
   each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
   
   After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
   This function can mark this  IO req's status to ready to commit.

IO handling/completion and io_uring_enter() can happen at the same time.

Besides, io_uring_enter can:

1) block and wait for cqes until at least
one blk-mq req comes from queue_rq()

2) submit sqes(with last IO completion and next fetch)

so I have to consider how to notify io_uring about io completion 
after io_uring_enter() is slept(block and wait for cqes).

In current version of ubd_runner(an async libubd target), I try to use an "unblock"
io_uring_enter_timeout() and caller can set a timeout value for it.
So IO completions happen after io_uring_enter_timeout() call can be committed
by next io_uring_enter_timeout() call...

But this is a very ugly implementation 
because I may waste CPU on useless loops in ubdsrv queue thread if
blk-mq reqs do not income frequently.

You mentioned that eventfd may be helpful and I agree with you. :)
I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
in  ubdlib_complete_io_request().

I will fix my code.

> 
>>
>>
>>
>>>
>>>>
>>>> (2) ubdsrv forks a daemon and it takes over everything.
>>>>     Users should type "list/stop/del" ctrl-commands to interact with
>>>>     the daemon. It is inconvenient for our backend
>>>>     because it has threads(from a C++ thread library) running inside.
>>>
>>> No, list/stop/del won't interact with the daemon, and the per-queue
>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
>>>
>>
>>
>> Sorry I made a mistake.
>>
>> I mean from user's view, 
>> he has to type list/del/stop from cmdlind to control the daemon.
>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
>>
>> This is a little weird if we try to make a ubd library.
>> So I actually provides APIs in libubd for users to do these list/del/stop works.
> 
> OK, that is fine to export APIs for admin purpose.
> 
>>
>>
>>>>
>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
>>>>     The data flow is:
>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
>>>>     Since ubdsrv does not export its internal data buffer to backend,
>>>>     the second copy is unavoidable. 
>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
>>>
>>> The preallocation is just virtual memory, which is cheap and not pinned, but
>>> ubdsrv does support buffer provided by io command, see:
>>>
>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>
>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
>> but you did not reply :)
>>
>> I paste it here:
>>
>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
>>
>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> 
> Usually posix_memalign just allocates virtual memory which is unlimited
> in 64bit arch, and pages should be allocated until the buffer is read or write.
> After the READ/WRITE is done, kernel still can reclaim the pages in this
> virtual memory.
> 
> In future, we still may optimize the memory uses via madvise, such as
> MADV_DONTNEED, after the slot is idle for long enough.

Ok, thanks for explanation. 

> 
>>
>> Another IMPORTANT problem is your commit:
>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>> may be not helpful for WRITE requests if I understand correctly.
>>
>> Consider this data flow:
>>
>> 1. ubdsrv commits an IO req(req1, a READ req).
>>
>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
>>    addr1 is the addr of buffer user passed.
>>    
>>
>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
>>
>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
>>
>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
>>
>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
>>
>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
>>    cmd(with buffer set to addr1).
>>
>>
>>
>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
>> until you get one in queue_rq. So length of addr1 may be less than data_len.
> 
> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> waste, but process virtual address is unlimited for 64bit arch, and
> pages are allocated until actual read/write is started.

Ok, since I allow users to config rq_max_blocks in libubd, 
it's users' responsibility to ensure length of user buffers
is at least rq_max_blocks.

Now I agree on your commit:
https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d

Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)

> 
>>>
>>>>
>>>> To better use ubd in more complicated scenarios, we have developed libubd.
>>>> It does not assume implementation of backend and can be embedded into it.
>>>> We refer to the code structure of tcmu-runner[4], 
>>>> which includes a library(libtcmu) for users 
>>>> to embed tcmu-runner inside backend's code. 
>>>> It:
>>>>
>>>> (1) Does not fork/pthread_create but embedded in backend's threads
>>>
>>> That is because your backend may not use io_uring, I guess.
>>>
>>> But it is pretty easy to move the decision of creating pthread to target
>>> code, which can be done in the interface of .prepare_target().
>>
>> I think the library should not create any thread if we want a libubd.
> 
> I Agree.
> 
>>
>>>
>>>>
>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
>>>>     and fetch/commit IO requests
>>>
>>> The above could be the main job of libubd.
>>
>> indeed.
>>
>>>
>>>>
>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
>>>>     since the backend actually has no knowledge 
>>>>     on incoming data size until it gets an IO descriptor.
>>>
>>> I can understand your requirement, not look at your code yet, but libubd
>>> should be pretty thin from function viewpoint, and there are lots of common
>>> things to abstract/share among all drivers, please see recent ubdsrv change:
>>>
>>> https://github.com/ming1/ubdsrv/commits/master
>>>
>>> in which:
>>> 	- coroutine is added for handling target io
>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
>>> 	supporting complicated target
>>> 	- c++ support
>>
>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
>> I think it is actually target(backend) design and ubd should not assume 
>> how the backend handle IOs. 
>>
>> The work ubd in userspace has to be done is:
>>
>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
>>
>> 2) get IO completion form backend, such as ubd_complete_io_requests()
> 
> Or the user provides/registers two callbacks: handle_io_async() and
> io_complete(), the former is called when one request comes from ubd
> driver, the latter(optional) is called when one io is done.
> 
> Also you didn't mention how you notify io_uring about io completion after
> io_uring_enter() is slept if your backend code doesn't use io_uring to
> handle io.
> 
> I think one communication mechanism(such as eventfd) is needed for your
> case.

Ok, I will try eventfd with io_uring.

> 
>>
>>
>>
>>>
>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
>>> into ubdsrv easily. The potential users could be existed usersapce
>>> block driver projects.
>>
>> Yes, so many userspace storage systems can use ubd!
>> You may look at tcmu-runner. It:
>>
>> 1) provides a library(libtcmu.c) for those who have a existing backend.
>>
>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
>>    for those who just want to run it. 
>>    And the runner is build on top of libtcmu.
>>
>>>
>>> If you don't object, I am happy to co-work with you to add the support
>>> for libubd in ubdsrv, then we can avoid to invent a wheel
>>
>> +1 :)
> 
> Thinking of further, I'd suggest to split ubdsrv into two parts:
> 
> 1) libubdsrv
> - provide APIs like what you did in libubd
> - provide API for notify io_uring(handling io command) that one io is
> completed, and the API should support handling IO from other context
> (not same with the io_uring context for handling io command).
> 
> 2) ubd target
> - built on libubdsrv, such as ubd command is built on libubdsrv, and
> specific target implementation is built on the library too.
> 
> It shouldn't be hard to work towards this direction, and I guess this
> way should make current target implementation more clean.
> 

Yes, this is like tcmu-runner's structure: a libtcmu and some target
Thanks, Ming.  Glad to co-work with you.

I will take your advice and improve libubd(the communication mechanism, maybe eventfd).

>>
>>>
>>>>
>>>> Note: 
>>>>
>>>> (1) libubd is just a POC demo and is not stick to the principles of
>>>>     designing a library and we are still developing it now...
>>>>
>>>> (2) The repo[5] including some useful examples using libubd. 
>>>>
>>>> (3) We modify the kernel part: ubd_drv.c and 
>>>>     it[6] is against Ming Lei's newest branch[2]
>>>>     because we forked our branch from his early branch
>>>>     (v5.17-ubd-dev).
>>>
>>> Please look at the following tree for ubd driver:
>>>
>>> https://github.com/ming1/linux/tree/my_for-5.19-ubd-devel_v3
>>>
>>> in which most of your change should have been there already.
>>>
>>> I will post v3 soon, please feel free to review after it is out and
>>> see if it is fine for you.
>>
>> Yes, I have read your newest branch.
>> You use some task_work() functions in ubd_drv.c 
>> for error-handling such as aborting IO.
> 
> The IO aborting has to be supported, otherwise what if the io_uring
> context(pthread) is killed, and what if the device is removed when
> handling IO.

Ok, now libubd just focuses on performance.
I will try to add these error-handling methods.

> 
> ubdsrv/tests/generic provides two tests for deleting ubd & killing
> per-queue pthread meantime with heavy IO.
> 
>>
>> But I find they are too complicated to understand 
>> and it's hard to write libubd code in this branch.
> 
> Actually the latest ubdsrv code becomes much clean, and should be
> easier to abstract APIs for external/binary target.
> 
>>
>> So I choose your first(easiest to understand)
>> version: v5.17-ubd-dev.
> 
> That code is outdated, and full of bugs, :-(

Ok, I will try to follow your newest branch.

> 
> 
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-30  7:16       ` Ziyang Zhang
@ 2022-06-30  9:09         ` Ming Lei
  2022-06-30  9:29           ` Ziyang Zhang
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-06-30  9:09 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
> Hi, Ming
> 
> On 2022/6/29 19:33, Ming Lei wrote:
> > On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> >> Hi Ming,
> >>
> >> On 2022/6/27 23:29, Ming Lei wrote:
> >>> Hi Ziyang,
> >>>
> >>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >>>> Hi Ming,
> >>>>
> >>>> We are learning your ubd code and developing a library: libubd for ubd.
> >>>> This article explains why we need libubd and how we design it.
> >>>>
> >>>> Related threads:
> >>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> >>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> >>>>
> >>>>
> >>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >>>> allows users to define their own backend storage in userspace
> >>>> and provides block devices such as /dev/ubdbX.
> >>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >>>> and userspace code: ubdsrv[3].
> >>>>
> >>>> ubd_drv.c simply passes all blk-mq IO requests
> >>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >>>> is pretty well-designed.
> >>>>
> >>>> ubdsrv is implemented by a single daemon
> >>>> and target(backend) IO handling(null_tgt and loop_tgt) 
> >>>> is embedded in the daemon. 
> >>>> While trying ubdsrv, we find ubdsrv is hard to be used 
> >>>> by our backend.
> >>>
> >>> ubd is supposed to provide one generic framework for user space block
> >>> driver, and it can be used for doing lots of fun/useful thing.
> >>>
> >>> If I understand correctly, this isn't same with your use case:
> >>>
> >>> 1) your user space block driver isn't generic, and should be dedicated
> >>> for Alibaba's uses
> >>>
> >>> 2) your case has been there for long time, and you want to switch from other
> >>> approach(maybe tcmu) to ubd given ubd has better performance.
> >>>
> >>
> >> Yes, you are correct :)
> >> The idea of design libubd is actually from libtcmu.
> >>
> >> We do have some userspace storage system as the IO handling backend, 
> >> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> >>
> >>
> >> I think your motivation is that provides a complete user block driver to users
> >> and they DO NOT change any code.
> >> Users DO change their code using libubd for embedding libubd into the backend.
> >>
> >>
> >>>> First is description of our backend:
> >>>>
> >>>> (1) a distributing system sends/receives IO requests 
> >>>>     through network.
> >>>>
> >>>> (2) The system use RPC calls among hundreds of
> >>>>      storage servers and RPC calls are associated with data buffers
> >>>>      allocated from a memory pool.
> >>>>
> >>>> (3) On each server for each device(/dev/vdX), our backend runs
> >>>>      many threads to handle IO requests and manage the device. 
> >>>>
> >>>> Second are reasons why ubdsrv is hard to use for us:
> >>>>
> >>>> (1) ubdsrv requires the target(backend) issues IO requests
> >>>>     to the io_uring provided by ubdsrv but our backend 
> >>>>     uses something like RPC and does not support io_uring.
> >>>
> >>> As one generic framework, the io command has to be io_uring
> >>> passthrough, and the io doesn't have to be handled by io_uring.
> >>
> >> Yes, our backend define its own communicating method.
> >>
> >>>
> >>> But IMO io_uring is much more efficient, so I'd try to make async io
> >>> (io uring) as the 1st citizen in the framework, especially for new
> >>> driver.
> >>>
> >>> But it can support other way really, such as use io_uring with eventfd,
> >>> the other userspace context can handle io, then wake up io_uring context
> >>> via eventfd. You may not use io_uring for handling io, but you still
> >>> need to communicate with the context for handling io_uring passthrough
> >>> command, and one mechanism(such as eventfd) has to be there for the
> >>> communication.
> >>
> >> Ok, eventfd may be helpful. 
> >> If you read my API, you may find ubdlib_complete_io_request().
> >> I think the backend io worker thread can call this function to tell the 
> >> ubd queue thread(the io_uring context in it) to commit the IO.
> > 
> > The ubdlib_complete_io_request() has to be called in the same pthread
> > context, that looks not flexible. When you handle IO via non-io_uring in the same
> > context, the cpu utilization in submission/completion side should be
> > higher than io_uring. And this way should be worse than the usage in
> > ubd/loop, that is why I suggest to use one io_uring for handling both
> > io command and io request if possible.
> 
> ubdlib_complete_io_request() can be called in the io worker thread,
> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
> 
> You can find ubd_runner.c in my libubd repo. There are many io worker
> threads for each ubdsrv queue to handle IO requests.
> 
> Actually this idea comes from tcmu-runner. The data flow is:
> 
> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
> 
> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
>    
>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
>     
>    for other simple(can be handled very quickly), 
>    handle it right now and call ubdlib_complete_io_request()
> 
> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
>    and setup sqe if one IO(IO completion) is ready to commit.
>    
>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
> 
> 
> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
>    
> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
>    
>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
>    This function can mark this  IO req's status to ready to commit.
> 
> IO handling/completion and io_uring_enter() can happen at the same time.
> 
> Besides, io_uring_enter can:
> 
> 1) block and wait for cqes until at least
> one blk-mq req comes from queue_rq()
> 
> 2) submit sqes(with last IO completion and next fetch)
> 
> so I have to consider how to notify io_uring about io completion 
> after io_uring_enter() is slept(block and wait for cqes).

Yeah, that was exactly my question, :-)

> 
> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
> io_uring_enter_timeout() and caller can set a timeout value for it.
> So IO completions happen after io_uring_enter_timeout() call can be committed
> by next io_uring_enter_timeout() call...
> 
> But this is a very ugly implementation 
> because I may waste CPU on useless loops in ubdsrv queue thread if
> blk-mq reqs do not income frequently.
> 
> You mentioned that eventfd may be helpful and I agree with you. :)
> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
> in  ubdlib_complete_io_request().
> 
> I will fix my code.

FYI, there is one example about using eventfd to wakeup io_uring, which
can be added to the library for your usecase:

https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b

> 
> > 
> >>
> >>
> >>
> >>>
> >>>>
> >>>> (2) ubdsrv forks a daemon and it takes over everything.
> >>>>     Users should type "list/stop/del" ctrl-commands to interact with
> >>>>     the daemon. It is inconvenient for our backend
> >>>>     because it has threads(from a C++ thread library) running inside.
> >>>
> >>> No, list/stop/del won't interact with the daemon, and the per-queue
> >>> pthread is only handling IO commands(io_uring passthrough) and IO request.
> >>>
> >>
> >>
> >> Sorry I made a mistake.
> >>
> >> I mean from user's view, 
> >> he has to type list/del/stop from cmdlind to control the daemon.
> >> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> >>
> >> This is a little weird if we try to make a ubd library.
> >> So I actually provides APIs in libubd for users to do these list/del/stop works.
> > 
> > OK, that is fine to export APIs for admin purpose.
> > 
> >>
> >>
> >>>>
> >>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>>>     The data flow is:
> >>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>>>     Since ubdsrv does not export its internal data buffer to backend,
> >>>>     the second copy is unavoidable. 
> >>>>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>>>     if there are hundreds of ubd devices(/dev/ubdbX).
> >>>
> >>> The preallocation is just virtual memory, which is cheap and not pinned, but
> >>> ubdsrv does support buffer provided by io command, see:
> >>>
> >>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>
> >> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> >> but you did not reply :)
> >>
> >> I paste it here:
> >>
> >> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> >> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> >> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> >>
> >> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> >> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> > 
> > Usually posix_memalign just allocates virtual memory which is unlimited
> > in 64bit arch, and pages should be allocated until the buffer is read or write.
> > After the READ/WRITE is done, kernel still can reclaim the pages in this
> > virtual memory.
> > 
> > In future, we still may optimize the memory uses via madvise, such as
> > MADV_DONTNEED, after the slot is idle for long enough.
> 
> Ok, thanks for explanation. 
> 
> > 
> >>
> >> Another IMPORTANT problem is your commit:
> >> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >> may be not helpful for WRITE requests if I understand correctly.
> >>
> >> Consider this data flow:
> >>
> >> 1. ubdsrv commits an IO req(req1, a READ req).
> >>
> >> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
> >>    addr1 is the addr of buffer user passed.
> >>    
> >>
> >> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> >>
> >> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> >>
> >> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> >>
> >> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> >>
> >> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
> >>    cmd(with buffer set to addr1).
> >>
> >>
> >>
> >> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> >> until you get one in queue_rq. So length of addr1 may be less than data_len.
> > 
> > So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> > we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> > waste, but process virtual address is unlimited for 64bit arch, and
> > pages are allocated until actual read/write is started.
> 
> Ok, since I allow users to config rq_max_blocks in libubd, 
> it's users' responsibility to ensure length of user buffers
> is at least rq_max_blocks.
> 
> Now I agree on your commit:
> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> 
> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
> 
> > 
> >>>
> >>>>
> >>>> To better use ubd in more complicated scenarios, we have developed libubd.
> >>>> It does not assume implementation of backend and can be embedded into it.
> >>>> We refer to the code structure of tcmu-runner[4], 
> >>>> which includes a library(libtcmu) for users 
> >>>> to embed tcmu-runner inside backend's code. 
> >>>> It:
> >>>>
> >>>> (1) Does not fork/pthread_create but embedded in backend's threads
> >>>
> >>> That is because your backend may not use io_uring, I guess.
> >>>
> >>> But it is pretty easy to move the decision of creating pthread to target
> >>> code, which can be done in the interface of .prepare_target().
> >>
> >> I think the library should not create any thread if we want a libubd.
> > 
> > I Agree.
> > 
> >>
> >>>
> >>>>
> >>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>>>     and fetch/commit IO requests
> >>>
> >>> The above could be the main job of libubd.
> >>
> >> indeed.
> >>
> >>>
> >>>>
> >>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>>>     since the backend actually has no knowledge 
> >>>>     on incoming data size until it gets an IO descriptor.
> >>>
> >>> I can understand your requirement, not look at your code yet, but libubd
> >>> should be pretty thin from function viewpoint, and there are lots of common
> >>> things to abstract/share among all drivers, please see recent ubdsrv change:
> >>>
> >>> https://github.com/ming1/ubdsrv/commits/master
> >>>
> >>> in which:
> >>> 	- coroutine is added for handling target io
> >>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> >>> 	supporting complicated target
> >>> 	- c++ support
> >>
> >> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> >> I think it is actually target(backend) design and ubd should not assume 
> >> how the backend handle IOs. 
> >>
> >> The work ubd in userspace has to be done is:
> >>
> >> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> >>
> >> 2) get IO completion form backend, such as ubd_complete_io_requests()
> > 
> > Or the user provides/registers two callbacks: handle_io_async() and
> > io_complete(), the former is called when one request comes from ubd
> > driver, the latter(optional) is called when one io is done.
> > 
> > Also you didn't mention how you notify io_uring about io completion after
> > io_uring_enter() is slept if your backend code doesn't use io_uring to
> > handle io.
> > 
> > I think one communication mechanism(such as eventfd) is needed for your
> > case.
> 
> Ok, I will try eventfd with io_uring.
> 
> > 
> >>
> >>
> >>
> >>>
> >>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> >>> into ubdsrv easily. The potential users could be existed usersapce
> >>> block driver projects.
> >>
> >> Yes, so many userspace storage systems can use ubd!
> >> You may look at tcmu-runner. It:
> >>
> >> 1) provides a library(libtcmu.c) for those who have a existing backend.
> >>
> >> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
> >>    for those who just want to run it. 
> >>    And the runner is build on top of libtcmu.
> >>
> >>>
> >>> If you don't object, I am happy to co-work with you to add the support
> >>> for libubd in ubdsrv, then we can avoid to invent a wheel
> >>
> >> +1 :)
> > 
> > Thinking of further, I'd suggest to split ubdsrv into two parts:
> > 
> > 1) libubdsrv
> > - provide APIs like what you did in libubd
> > - provide API for notify io_uring(handling io command) that one io is
> > completed, and the API should support handling IO from other context
> > (not same with the io_uring context for handling io command).
> > 
> > 2) ubd target
> > - built on libubdsrv, such as ubd command is built on libubdsrv, and
> > specific target implementation is built on the library too.
> > 
> > It shouldn't be hard to work towards this direction, and I guess this
> > way should make current target implementation more clean.
> > 
> 
> Yes, this is like tcmu-runner's structure: a libtcmu and some target
> Thanks, Ming.  Glad to co-work with you.
> 
> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).

I have added libublk branch for working towards this direction, if we
cowork on libublk, please write patch against this branch, then I can
apply your patch directly.

https://github.com/ming1/ubdsrv/tree/libublk

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-30  9:09         ` Ming Lei
@ 2022-06-30  9:29           ` Ziyang Zhang
  2022-06-30 11:45             ` Ming Lei
  2022-07-04  4:08             ` Ming Lei
  0 siblings, 2 replies; 12+ messages in thread
From: Ziyang Zhang @ 2022-06-30  9:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

On 2022/6/30 17:09, Ming Lei wrote:
> On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
>> Hi, Ming
>>
>> On 2022/6/29 19:33, Ming Lei wrote:
>>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
>>>> Hi Ming,
>>>>
>>>> On 2022/6/27 23:29, Ming Lei wrote:
>>>>> Hi Ziyang,
>>>>>
>>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
>>>>>> Hi Ming,
>>>>>>
>>>>>> We are learning your ubd code and developing a library: libubd for ubd.
>>>>>> This article explains why we need libubd and how we design it.
>>>>>>
>>>>>> Related threads:
>>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
>>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
>>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
>>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
>>>>>>
>>>>>>
>>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
>>>>>> allows users to define their own backend storage in userspace
>>>>>> and provides block devices such as /dev/ubdbX.
>>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
>>>>>> and userspace code: ubdsrv[3].
>>>>>>
>>>>>> ubd_drv.c simply passes all blk-mq IO requests
>>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
>>>>>> is pretty well-designed.
>>>>>>
>>>>>> ubdsrv is implemented by a single daemon
>>>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
>>>>>> is embedded in the daemon. 
>>>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
>>>>>> by our backend.
>>>>>
>>>>> ubd is supposed to provide one generic framework for user space block
>>>>> driver, and it can be used for doing lots of fun/useful thing.
>>>>>
>>>>> If I understand correctly, this isn't same with your use case:
>>>>>
>>>>> 1) your user space block driver isn't generic, and should be dedicated
>>>>> for Alibaba's uses
>>>>>
>>>>> 2) your case has been there for long time, and you want to switch from other
>>>>> approach(maybe tcmu) to ubd given ubd has better performance.
>>>>>
>>>>
>>>> Yes, you are correct :)
>>>> The idea of design libubd is actually from libtcmu.
>>>>
>>>> We do have some userspace storage system as the IO handling backend, 
>>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
>>>>
>>>>
>>>> I think your motivation is that provides a complete user block driver to users
>>>> and they DO NOT change any code.
>>>> Users DO change their code using libubd for embedding libubd into the backend.
>>>>
>>>>
>>>>>> First is description of our backend:
>>>>>>
>>>>>> (1) a distributing system sends/receives IO requests 
>>>>>>     through network.
>>>>>>
>>>>>> (2) The system use RPC calls among hundreds of
>>>>>>      storage servers and RPC calls are associated with data buffers
>>>>>>      allocated from a memory pool.
>>>>>>
>>>>>> (3) On each server for each device(/dev/vdX), our backend runs
>>>>>>      many threads to handle IO requests and manage the device. 
>>>>>>
>>>>>> Second are reasons why ubdsrv is hard to use for us:
>>>>>>
>>>>>> (1) ubdsrv requires the target(backend) issues IO requests
>>>>>>     to the io_uring provided by ubdsrv but our backend 
>>>>>>     uses something like RPC and does not support io_uring.
>>>>>
>>>>> As one generic framework, the io command has to be io_uring
>>>>> passthrough, and the io doesn't have to be handled by io_uring.
>>>>
>>>> Yes, our backend define its own communicating method.
>>>>
>>>>>
>>>>> But IMO io_uring is much more efficient, so I'd try to make async io
>>>>> (io uring) as the 1st citizen in the framework, especially for new
>>>>> driver.
>>>>>
>>>>> But it can support other way really, such as use io_uring with eventfd,
>>>>> the other userspace context can handle io, then wake up io_uring context
>>>>> via eventfd. You may not use io_uring for handling io, but you still
>>>>> need to communicate with the context for handling io_uring passthrough
>>>>> command, and one mechanism(such as eventfd) has to be there for the
>>>>> communication.
>>>>
>>>> Ok, eventfd may be helpful. 
>>>> If you read my API, you may find ubdlib_complete_io_request().
>>>> I think the backend io worker thread can call this function to tell the 
>>>> ubd queue thread(the io_uring context in it) to commit the IO.
>>>
>>> The ubdlib_complete_io_request() has to be called in the same pthread
>>> context, that looks not flexible. When you handle IO via non-io_uring in the same
>>> context, the cpu utilization in submission/completion side should be
>>> higher than io_uring. And this way should be worse than the usage in
>>> ubd/loop, that is why I suggest to use one io_uring for handling both
>>> io command and io request if possible.
>>
>> ubdlib_complete_io_request() can be called in the io worker thread,
>> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
>>
>> You can find ubd_runner.c in my libubd repo. There are many io worker
>> threads for each ubdsrv queue to handle IO requests.
>>
>> Actually this idea comes from tcmu-runner. The data flow is:
>>
>> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
>>
>> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
>>    
>>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
>>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
>>     
>>    for other simple(can be handled very quickly), 
>>    handle it right now and call ubdlib_complete_io_request()
>>
>> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
>>    and setup sqe if one IO(IO completion) is ready to commit.
>>    
>>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
>>
>>
>> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
>>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
>>    
>> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
>>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
>>    
>>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
>>    This function can mark this  IO req's status to ready to commit.
>>
>> IO handling/completion and io_uring_enter() can happen at the same time.
>>
>> Besides, io_uring_enter can:
>>
>> 1) block and wait for cqes until at least
>> one blk-mq req comes from queue_rq()
>>
>> 2) submit sqes(with last IO completion and next fetch)
>>
>> so I have to consider how to notify io_uring about io completion 
>> after io_uring_enter() is slept(block and wait for cqes).
> 
> Yeah, that was exactly my question, :-)
> 
>>
>> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
>> io_uring_enter_timeout() and caller can set a timeout value for it.
>> So IO completions happen after io_uring_enter_timeout() call can be committed
>> by next io_uring_enter_timeout() call...
>>
>> But this is a very ugly implementation 
>> because I may waste CPU on useless loops in ubdsrv queue thread if
>> blk-mq reqs do not income frequently.
>>
>> You mentioned that eventfd may be helpful and I agree with you. :)
>> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
>> in  ubdlib_complete_io_request().
>>
>> I will fix my code.
> 
> FYI, there is one example about using eventfd to wakeup io_uring, which
> can be added to the library for your usecase:
> 
> https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b

Thanks, will take a view.

> 
>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> (2) ubdsrv forks a daemon and it takes over everything.
>>>>>>     Users should type "list/stop/del" ctrl-commands to interact with
>>>>>>     the daemon. It is inconvenient for our backend
>>>>>>     because it has threads(from a C++ thread library) running inside.
>>>>>
>>>>> No, list/stop/del won't interact with the daemon, and the per-queue
>>>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
>>>>>
>>>>
>>>>
>>>> Sorry I made a mistake.
>>>>
>>>> I mean from user's view, 
>>>> he has to type list/del/stop from cmdlind to control the daemon.
>>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
>>>>
>>>> This is a little weird if we try to make a ubd library.
>>>> So I actually provides APIs in libubd for users to do these list/del/stop works.
>>>
>>> OK, that is fine to export APIs for admin purpose.
>>>
>>>>
>>>>
>>>>>>
>>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
>>>>>>     The data flow is:
>>>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
>>>>>>     Since ubdsrv does not export its internal data buffer to backend,
>>>>>>     the second copy is unavoidable. 
>>>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
>>>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
>>>>>
>>>>> The preallocation is just virtual memory, which is cheap and not pinned, but
>>>>> ubdsrv does support buffer provided by io command, see:
>>>>>
>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>>>
>>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
>>>> but you did not reply :)
>>>>
>>>> I paste it here:
>>>>
>>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
>>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
>>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
>>>>
>>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
>>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
>>>
>>> Usually posix_memalign just allocates virtual memory which is unlimited
>>> in 64bit arch, and pages should be allocated until the buffer is read or write.
>>> After the READ/WRITE is done, kernel still can reclaim the pages in this
>>> virtual memory.
>>>
>>> In future, we still may optimize the memory uses via madvise, such as
>>> MADV_DONTNEED, after the slot is idle for long enough.
>>
>> Ok, thanks for explanation. 
>>
>>>
>>>>
>>>> Another IMPORTANT problem is your commit:
>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>>> may be not helpful for WRITE requests if I understand correctly.
>>>>
>>>> Consider this data flow:
>>>>
>>>> 1. ubdsrv commits an IO req(req1, a READ req).
>>>>
>>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
>>>>    addr1 is the addr of buffer user passed.
>>>>    
>>>>
>>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
>>>>
>>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
>>>>
>>>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
>>>>
>>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
>>>>
>>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
>>>>    cmd(with buffer set to addr1).
>>>>
>>>>
>>>>
>>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
>>>> until you get one in queue_rq. So length of addr1 may be less than data_len.
>>>
>>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
>>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
>>> waste, but process virtual address is unlimited for 64bit arch, and
>>> pages are allocated until actual read/write is started.
>>
>> Ok, since I allow users to config rq_max_blocks in libubd, 
>> it's users' responsibility to ensure length of user buffers
>> is at least rq_max_blocks.
>>
>> Now I agree on your commit:
>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>
>> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
>>
>>>
>>>>>
>>>>>>
>>>>>> To better use ubd in more complicated scenarios, we have developed libubd.
>>>>>> It does not assume implementation of backend and can be embedded into it.
>>>>>> We refer to the code structure of tcmu-runner[4], 
>>>>>> which includes a library(libtcmu) for users 
>>>>>> to embed tcmu-runner inside backend's code. 
>>>>>> It:
>>>>>>
>>>>>> (1) Does not fork/pthread_create but embedded in backend's threads
>>>>>
>>>>> That is because your backend may not use io_uring, I guess.
>>>>>
>>>>> But it is pretty easy to move the decision of creating pthread to target
>>>>> code, which can be done in the interface of .prepare_target().
>>>>
>>>> I think the library should not create any thread if we want a libubd.
>>>
>>> I Agree.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
>>>>>>     and fetch/commit IO requests
>>>>>
>>>>> The above could be the main job of libubd.
>>>>
>>>> indeed.
>>>>
>>>>>
>>>>>>
>>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
>>>>>>     since the backend actually has no knowledge 
>>>>>>     on incoming data size until it gets an IO descriptor.
>>>>>
>>>>> I can understand your requirement, not look at your code yet, but libubd
>>>>> should be pretty thin from function viewpoint, and there are lots of common
>>>>> things to abstract/share among all drivers, please see recent ubdsrv change:
>>>>>
>>>>> https://github.com/ming1/ubdsrv/commits/master
>>>>>
>>>>> in which:
>>>>> 	- coroutine is added for handling target io
>>>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
>>>>> 	supporting complicated target
>>>>> 	- c++ support
>>>>
>>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
>>>> I think it is actually target(backend) design and ubd should not assume 
>>>> how the backend handle IOs. 
>>>>
>>>> The work ubd in userspace has to be done is:
>>>>
>>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
>>>>
>>>> 2) get IO completion form backend, such as ubd_complete_io_requests()
>>>
>>> Or the user provides/registers two callbacks: handle_io_async() and
>>> io_complete(), the former is called when one request comes from ubd
>>> driver, the latter(optional) is called when one io is done.
>>>
>>> Also you didn't mention how you notify io_uring about io completion after
>>> io_uring_enter() is slept if your backend code doesn't use io_uring to
>>> handle io.
>>>
>>> I think one communication mechanism(such as eventfd) is needed for your
>>> case.
>>
>> Ok, I will try eventfd with io_uring.
>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
>>>>> into ubdsrv easily. The potential users could be existed usersapce
>>>>> block driver projects.
>>>>
>>>> Yes, so many userspace storage systems can use ubd!
>>>> You may look at tcmu-runner. It:
>>>>
>>>> 1) provides a library(libtcmu.c) for those who have a existing backend.
>>>>
>>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
>>>>    for those who just want to run it. 
>>>>    And the runner is build on top of libtcmu.
>>>>
>>>>>
>>>>> If you don't object, I am happy to co-work with you to add the support
>>>>> for libubd in ubdsrv, then we can avoid to invent a wheel
>>>>
>>>> +1 :)
>>>
>>> Thinking of further, I'd suggest to split ubdsrv into two parts:
>>>
>>> 1) libubdsrv
>>> - provide APIs like what you did in libubd
>>> - provide API for notify io_uring(handling io command) that one io is
>>> completed, and the API should support handling IO from other context
>>> (not same with the io_uring context for handling io command).
>>>
>>> 2) ubd target
>>> - built on libubdsrv, such as ubd command is built on libubdsrv, and
>>> specific target implementation is built on the library too.
>>>
>>> It shouldn't be hard to work towards this direction, and I guess this
>>> way should make current target implementation more clean.
>>>
>>
>> Yes, this is like tcmu-runner's structure: a libtcmu and some target
>> Thanks, Ming.  Glad to co-work with you.
>>
>> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).
> 
> I have added libublk branch for working towards this direction, if we
> cowork on libublk, please write patch against this branch, then I can
> apply your patch directly.
> 
> https://github.com/ming1/ubdsrv/tree/libublk

Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot
because:
1) target implementation will be built on top of libubdsrv and the target
   should create pthread(ubdsrv loop) itself.

2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv.
   It was really a hard job. :-(

If you agree to adjust current ubdsrv project's structure, I will start right now.

> 
> Thanks, 
> Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-30  9:29           ` Ziyang Zhang
@ 2022-06-30 11:45             ` Ming Lei
  2022-07-04  4:08             ` Ming Lei
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-06-30 11:45 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

On Thu, Jun 30, 2022 at 05:29:07PM +0800, Ziyang Zhang wrote:
> On 2022/6/30 17:09, Ming Lei wrote:
> > On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
> >> Hi, Ming
> >>
> >> On 2022/6/29 19:33, Ming Lei wrote:
> >>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> >>>> Hi Ming,
> >>>>
> >>>> On 2022/6/27 23:29, Ming Lei wrote:
> >>>>> Hi Ziyang,
> >>>>>
> >>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >>>>>> Hi Ming,
> >>>>>>
> >>>>>> We are learning your ubd code and developing a library: libubd for ubd.
> >>>>>> This article explains why we need libubd and how we design it.
> >>>>>>
> >>>>>> Related threads:
> >>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> >>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> >>>>>>
> >>>>>>
> >>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >>>>>> allows users to define their own backend storage in userspace
> >>>>>> and provides block devices such as /dev/ubdbX.
> >>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >>>>>> and userspace code: ubdsrv[3].
> >>>>>>
> >>>>>> ubd_drv.c simply passes all blk-mq IO requests
> >>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >>>>>> is pretty well-designed.
> >>>>>>
> >>>>>> ubdsrv is implemented by a single daemon
> >>>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
> >>>>>> is embedded in the daemon. 
> >>>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
> >>>>>> by our backend.
> >>>>>
> >>>>> ubd is supposed to provide one generic framework for user space block
> >>>>> driver, and it can be used for doing lots of fun/useful thing.
> >>>>>
> >>>>> If I understand correctly, this isn't same with your use case:
> >>>>>
> >>>>> 1) your user space block driver isn't generic, and should be dedicated
> >>>>> for Alibaba's uses
> >>>>>
> >>>>> 2) your case has been there for long time, and you want to switch from other
> >>>>> approach(maybe tcmu) to ubd given ubd has better performance.
> >>>>>
> >>>>
> >>>> Yes, you are correct :)
> >>>> The idea of design libubd is actually from libtcmu.
> >>>>
> >>>> We do have some userspace storage system as the IO handling backend, 
> >>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> >>>>
> >>>>
> >>>> I think your motivation is that provides a complete user block driver to users
> >>>> and they DO NOT change any code.
> >>>> Users DO change their code using libubd for embedding libubd into the backend.
> >>>>
> >>>>
> >>>>>> First is description of our backend:
> >>>>>>
> >>>>>> (1) a distributing system sends/receives IO requests 
> >>>>>>     through network.
> >>>>>>
> >>>>>> (2) The system use RPC calls among hundreds of
> >>>>>>      storage servers and RPC calls are associated with data buffers
> >>>>>>      allocated from a memory pool.
> >>>>>>
> >>>>>> (3) On each server for each device(/dev/vdX), our backend runs
> >>>>>>      many threads to handle IO requests and manage the device. 
> >>>>>>
> >>>>>> Second are reasons why ubdsrv is hard to use for us:
> >>>>>>
> >>>>>> (1) ubdsrv requires the target(backend) issues IO requests
> >>>>>>     to the io_uring provided by ubdsrv but our backend 
> >>>>>>     uses something like RPC and does not support io_uring.
> >>>>>
> >>>>> As one generic framework, the io command has to be io_uring
> >>>>> passthrough, and the io doesn't have to be handled by io_uring.
> >>>>
> >>>> Yes, our backend define its own communicating method.
> >>>>
> >>>>>
> >>>>> But IMO io_uring is much more efficient, so I'd try to make async io
> >>>>> (io uring) as the 1st citizen in the framework, especially for new
> >>>>> driver.
> >>>>>
> >>>>> But it can support other way really, such as use io_uring with eventfd,
> >>>>> the other userspace context can handle io, then wake up io_uring context
> >>>>> via eventfd. You may not use io_uring for handling io, but you still
> >>>>> need to communicate with the context for handling io_uring passthrough
> >>>>> command, and one mechanism(such as eventfd) has to be there for the
> >>>>> communication.
> >>>>
> >>>> Ok, eventfd may be helpful. 
> >>>> If you read my API, you may find ubdlib_complete_io_request().
> >>>> I think the backend io worker thread can call this function to tell the 
> >>>> ubd queue thread(the io_uring context in it) to commit the IO.
> >>>
> >>> The ubdlib_complete_io_request() has to be called in the same pthread
> >>> context, that looks not flexible. When you handle IO via non-io_uring in the same
> >>> context, the cpu utilization in submission/completion side should be
> >>> higher than io_uring. And this way should be worse than the usage in
> >>> ubd/loop, that is why I suggest to use one io_uring for handling both
> >>> io command and io request if possible.
> >>
> >> ubdlib_complete_io_request() can be called in the io worker thread,
> >> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
> >>
> >> You can find ubd_runner.c in my libubd repo. There are many io worker
> >> threads for each ubdsrv queue to handle IO requests.
> >>
> >> Actually this idea comes from tcmu-runner. The data flow is:
> >>
> >> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
> >>
> >> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
> >>    
> >>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
> >>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
> >>     
> >>    for other simple(can be handled very quickly), 
> >>    handle it right now and call ubdlib_complete_io_request()
> >>
> >> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
> >>    and setup sqe if one IO(IO completion) is ready to commit.
> >>    
> >>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
> >>
> >>
> >> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
> >>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
> >>    
> >> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
> >>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
> >>    
> >>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
> >>    This function can mark this  IO req's status to ready to commit.
> >>
> >> IO handling/completion and io_uring_enter() can happen at the same time.
> >>
> >> Besides, io_uring_enter can:
> >>
> >> 1) block and wait for cqes until at least
> >> one blk-mq req comes from queue_rq()
> >>
> >> 2) submit sqes(with last IO completion and next fetch)
> >>
> >> so I have to consider how to notify io_uring about io completion 
> >> after io_uring_enter() is slept(block and wait for cqes).
> > 
> > Yeah, that was exactly my question, :-)
> > 
> >>
> >> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
> >> io_uring_enter_timeout() and caller can set a timeout value for it.
> >> So IO completions happen after io_uring_enter_timeout() call can be committed
> >> by next io_uring_enter_timeout() call...
> >>
> >> But this is a very ugly implementation 
> >> because I may waste CPU on useless loops in ubdsrv queue thread if
> >> blk-mq reqs do not income frequently.
> >>
> >> You mentioned that eventfd may be helpful and I agree with you. :)
> >> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
> >> in  ubdlib_complete_io_request().
> >>
> >> I will fix my code.
> > 
> > FYI, there is one example about using eventfd to wakeup io_uring, which
> > can be added to the library for your usecase:
> > 
> > https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b
> 
> Thanks, will take a view.
> 
> > 
> >>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> (2) ubdsrv forks a daemon and it takes over everything.
> >>>>>>     Users should type "list/stop/del" ctrl-commands to interact with
> >>>>>>     the daemon. It is inconvenient for our backend
> >>>>>>     because it has threads(from a C++ thread library) running inside.
> >>>>>
> >>>>> No, list/stop/del won't interact with the daemon, and the per-queue
> >>>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
> >>>>>
> >>>>
> >>>>
> >>>> Sorry I made a mistake.
> >>>>
> >>>> I mean from user's view, 
> >>>> he has to type list/del/stop from cmdlind to control the daemon.
> >>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> >>>>
> >>>> This is a little weird if we try to make a ubd library.
> >>>> So I actually provides APIs in libubd for users to do these list/del/stop works.
> >>>
> >>> OK, that is fine to export APIs for admin purpose.
> >>>
> >>>>
> >>>>
> >>>>>>
> >>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>>>>>     The data flow is:
> >>>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>>>>>     Since ubdsrv does not export its internal data buffer to backend,
> >>>>>>     the second copy is unavoidable. 
> >>>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
> >>>>>
> >>>>> The preallocation is just virtual memory, which is cheap and not pinned, but
> >>>>> ubdsrv does support buffer provided by io command, see:
> >>>>>
> >>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>
> >>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> >>>> but you did not reply :)
> >>>>
> >>>> I paste it here:
> >>>>
> >>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> >>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> >>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> >>>>
> >>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> >>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> >>>
> >>> Usually posix_memalign just allocates virtual memory which is unlimited
> >>> in 64bit arch, and pages should be allocated until the buffer is read or write.
> >>> After the READ/WRITE is done, kernel still can reclaim the pages in this
> >>> virtual memory.
> >>>
> >>> In future, we still may optimize the memory uses via madvise, such as
> >>> MADV_DONTNEED, after the slot is idle for long enough.
> >>
> >> Ok, thanks for explanation. 
> >>
> >>>
> >>>>
> >>>> Another IMPORTANT problem is your commit:
> >>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>> may be not helpful for WRITE requests if I understand correctly.
> >>>>
> >>>> Consider this data flow:
> >>>>
> >>>> 1. ubdsrv commits an IO req(req1, a READ req).
> >>>>
> >>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
> >>>>    addr1 is the addr of buffer user passed.
> >>>>    
> >>>>
> >>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> >>>>
> >>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> >>>>
> >>>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> >>>>
> >>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> >>>>
> >>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
> >>>>    cmd(with buffer set to addr1).
> >>>>
> >>>>
> >>>>
> >>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> >>>> until you get one in queue_rq. So length of addr1 may be less than data_len.
> >>>
> >>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> >>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> >>> waste, but process virtual address is unlimited for 64bit arch, and
> >>> pages are allocated until actual read/write is started.
> >>
> >> Ok, since I allow users to config rq_max_blocks in libubd, 
> >> it's users' responsibility to ensure length of user buffers
> >> is at least rq_max_blocks.
> >>
> >> Now I agree on your commit:
> >> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>
> >> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
> >>
> >>>
> >>>>>
> >>>>>>
> >>>>>> To better use ubd in more complicated scenarios, we have developed libubd.
> >>>>>> It does not assume implementation of backend and can be embedded into it.
> >>>>>> We refer to the code structure of tcmu-runner[4], 
> >>>>>> which includes a library(libtcmu) for users 
> >>>>>> to embed tcmu-runner inside backend's code. 
> >>>>>> It:
> >>>>>>
> >>>>>> (1) Does not fork/pthread_create but embedded in backend's threads
> >>>>>
> >>>>> That is because your backend may not use io_uring, I guess.
> >>>>>
> >>>>> But it is pretty easy to move the decision of creating pthread to target
> >>>>> code, which can be done in the interface of .prepare_target().
> >>>>
> >>>> I think the library should not create any thread if we want a libubd.
> >>>
> >>> I Agree.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>>>>>     and fetch/commit IO requests
> >>>>>
> >>>>> The above could be the main job of libubd.
> >>>>
> >>>> indeed.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>>>>>     since the backend actually has no knowledge 
> >>>>>>     on incoming data size until it gets an IO descriptor.
> >>>>>
> >>>>> I can understand your requirement, not look at your code yet, but libubd
> >>>>> should be pretty thin from function viewpoint, and there are lots of common
> >>>>> things to abstract/share among all drivers, please see recent ubdsrv change:
> >>>>>
> >>>>> https://github.com/ming1/ubdsrv/commits/master
> >>>>>
> >>>>> in which:
> >>>>> 	- coroutine is added for handling target io
> >>>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> >>>>> 	supporting complicated target
> >>>>> 	- c++ support
> >>>>
> >>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> >>>> I think it is actually target(backend) design and ubd should not assume 
> >>>> how the backend handle IOs. 
> >>>>
> >>>> The work ubd in userspace has to be done is:
> >>>>
> >>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> >>>>
> >>>> 2) get IO completion form backend, such as ubd_complete_io_requests()
> >>>
> >>> Or the user provides/registers two callbacks: handle_io_async() and
> >>> io_complete(), the former is called when one request comes from ubd
> >>> driver, the latter(optional) is called when one io is done.
> >>>
> >>> Also you didn't mention how you notify io_uring about io completion after
> >>> io_uring_enter() is slept if your backend code doesn't use io_uring to
> >>> handle io.
> >>>
> >>> I think one communication mechanism(such as eventfd) is needed for your
> >>> case.
> >>
> >> Ok, I will try eventfd with io_uring.
> >>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> >>>>> into ubdsrv easily. The potential users could be existed usersapce
> >>>>> block driver projects.
> >>>>
> >>>> Yes, so many userspace storage systems can use ubd!
> >>>> You may look at tcmu-runner. It:
> >>>>
> >>>> 1) provides a library(libtcmu.c) for those who have a existing backend.
> >>>>
> >>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
> >>>>    for those who just want to run it. 
> >>>>    And the runner is build on top of libtcmu.
> >>>>
> >>>>>
> >>>>> If you don't object, I am happy to co-work with you to add the support
> >>>>> for libubd in ubdsrv, then we can avoid to invent a wheel
> >>>>
> >>>> +1 :)
> >>>
> >>> Thinking of further, I'd suggest to split ubdsrv into two parts:
> >>>
> >>> 1) libubdsrv
> >>> - provide APIs like what you did in libubd
> >>> - provide API for notify io_uring(handling io command) that one io is
> >>> completed, and the API should support handling IO from other context
> >>> (not same with the io_uring context for handling io command).
> >>>
> >>> 2) ubd target
> >>> - built on libubdsrv, such as ubd command is built on libubdsrv, and
> >>> specific target implementation is built on the library too.
> >>>
> >>> It shouldn't be hard to work towards this direction, and I guess this
> >>> way should make current target implementation more clean.
> >>>
> >>
> >> Yes, this is like tcmu-runner's structure: a libtcmu and some target
> >> Thanks, Ming.  Glad to co-work with you.
> >>
> >> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).
> > 
> > I have added libublk branch for working towards this direction, if we
> > cowork on libublk, please write patch against this branch, then I can
> > apply your patch directly.
> > 
> > https://github.com/ming1/ubdsrv/tree/libublk
> 
> Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot
> because:
> 1) target implementation will be built on top of libubdsrv and the target
>    should create pthread(ubdsrv loop) itself.

That isn't hard, but ubdsrv project can provide generic target abstract
meantime.

> 
> 2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv.
>    It was really a hard job. :-(

No, the default ubd device created in ubdsrv project still needs the
daemon implementation, but the pthread/process(daemon) can be moved
out of libubd. And it can't affect libubd requirement, can it?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-06-30  9:29           ` Ziyang Zhang
  2022-06-30 11:45             ` Ming Lei
@ 2022-07-04  4:08             ` Ming Lei
  2022-07-04  9:49               ` Ziyang Zhang
  1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-07-04  4:08 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi, ming.lei

On Thu, Jun 30, 2022 at 05:29:07PM +0800, Ziyang Zhang wrote:
> On 2022/6/30 17:09, Ming Lei wrote:
> > On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
> >> Hi, Ming
> >>
> >> On 2022/6/29 19:33, Ming Lei wrote:
> >>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> >>>> Hi Ming,
> >>>>
> >>>> On 2022/6/27 23:29, Ming Lei wrote:
> >>>>> Hi Ziyang,
> >>>>>
> >>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >>>>>> Hi Ming,
> >>>>>>
> >>>>>> We are learning your ubd code and developing a library: libubd for ubd.
> >>>>>> This article explains why we need libubd and how we design it.
> >>>>>>
> >>>>>> Related threads:
> >>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> >>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> >>>>>>
> >>>>>>
> >>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >>>>>> allows users to define their own backend storage in userspace
> >>>>>> and provides block devices such as /dev/ubdbX.
> >>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >>>>>> and userspace code: ubdsrv[3].
> >>>>>>
> >>>>>> ubd_drv.c simply passes all blk-mq IO requests
> >>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >>>>>> is pretty well-designed.
> >>>>>>
> >>>>>> ubdsrv is implemented by a single daemon
> >>>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
> >>>>>> is embedded in the daemon. 
> >>>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
> >>>>>> by our backend.
> >>>>>
> >>>>> ubd is supposed to provide one generic framework for user space block
> >>>>> driver, and it can be used for doing lots of fun/useful thing.
> >>>>>
> >>>>> If I understand correctly, this isn't same with your use case:
> >>>>>
> >>>>> 1) your user space block driver isn't generic, and should be dedicated
> >>>>> for Alibaba's uses
> >>>>>
> >>>>> 2) your case has been there for long time, and you want to switch from other
> >>>>> approach(maybe tcmu) to ubd given ubd has better performance.
> >>>>>
> >>>>
> >>>> Yes, you are correct :)
> >>>> The idea of design libubd is actually from libtcmu.
> >>>>
> >>>> We do have some userspace storage system as the IO handling backend, 
> >>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> >>>>
> >>>>
> >>>> I think your motivation is that provides a complete user block driver to users
> >>>> and they DO NOT change any code.
> >>>> Users DO change their code using libubd for embedding libubd into the backend.
> >>>>
> >>>>
> >>>>>> First is description of our backend:
> >>>>>>
> >>>>>> (1) a distributing system sends/receives IO requests 
> >>>>>>     through network.
> >>>>>>
> >>>>>> (2) The system use RPC calls among hundreds of
> >>>>>>      storage servers and RPC calls are associated with data buffers
> >>>>>>      allocated from a memory pool.
> >>>>>>
> >>>>>> (3) On each server for each device(/dev/vdX), our backend runs
> >>>>>>      many threads to handle IO requests and manage the device. 
> >>>>>>
> >>>>>> Second are reasons why ubdsrv is hard to use for us:
> >>>>>>
> >>>>>> (1) ubdsrv requires the target(backend) issues IO requests
> >>>>>>     to the io_uring provided by ubdsrv but our backend 
> >>>>>>     uses something like RPC and does not support io_uring.
> >>>>>
> >>>>> As one generic framework, the io command has to be io_uring
> >>>>> passthrough, and the io doesn't have to be handled by io_uring.
> >>>>
> >>>> Yes, our backend define its own communicating method.
> >>>>
> >>>>>
> >>>>> But IMO io_uring is much more efficient, so I'd try to make async io
> >>>>> (io uring) as the 1st citizen in the framework, especially for new
> >>>>> driver.
> >>>>>
> >>>>> But it can support other way really, such as use io_uring with eventfd,
> >>>>> the other userspace context can handle io, then wake up io_uring context
> >>>>> via eventfd. You may not use io_uring for handling io, but you still
> >>>>> need to communicate with the context for handling io_uring passthrough
> >>>>> command, and one mechanism(such as eventfd) has to be there for the
> >>>>> communication.
> >>>>
> >>>> Ok, eventfd may be helpful. 
> >>>> If you read my API, you may find ubdlib_complete_io_request().
> >>>> I think the backend io worker thread can call this function to tell the 
> >>>> ubd queue thread(the io_uring context in it) to commit the IO.
> >>>
> >>> The ubdlib_complete_io_request() has to be called in the same pthread
> >>> context, that looks not flexible. When you handle IO via non-io_uring in the same
> >>> context, the cpu utilization in submission/completion side should be
> >>> higher than io_uring. And this way should be worse than the usage in
> >>> ubd/loop, that is why I suggest to use one io_uring for handling both
> >>> io command and io request if possible.
> >>
> >> ubdlib_complete_io_request() can be called in the io worker thread,
> >> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
> >>
> >> You can find ubd_runner.c in my libubd repo. There are many io worker
> >> threads for each ubdsrv queue to handle IO requests.
> >>
> >> Actually this idea comes from tcmu-runner. The data flow is:
> >>
> >> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
> >>
> >> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
> >>    
> >>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
> >>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
> >>     
> >>    for other simple(can be handled very quickly), 
> >>    handle it right now and call ubdlib_complete_io_request()
> >>
> >> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
> >>    and setup sqe if one IO(IO completion) is ready to commit.
> >>    
> >>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
> >>
> >>
> >> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
> >>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
> >>    
> >> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
> >>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
> >>    
> >>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
> >>    This function can mark this  IO req's status to ready to commit.
> >>
> >> IO handling/completion and io_uring_enter() can happen at the same time.
> >>
> >> Besides, io_uring_enter can:
> >>
> >> 1) block and wait for cqes until at least
> >> one blk-mq req comes from queue_rq()
> >>
> >> 2) submit sqes(with last IO completion and next fetch)
> >>
> >> so I have to consider how to notify io_uring about io completion 
> >> after io_uring_enter() is slept(block and wait for cqes).
> > 
> > Yeah, that was exactly my question, :-)
> > 
> >>
> >> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
> >> io_uring_enter_timeout() and caller can set a timeout value for it.
> >> So IO completions happen after io_uring_enter_timeout() call can be committed
> >> by next io_uring_enter_timeout() call...
> >>
> >> But this is a very ugly implementation 
> >> because I may waste CPU on useless loops in ubdsrv queue thread if
> >> blk-mq reqs do not income frequently.
> >>
> >> You mentioned that eventfd may be helpful and I agree with you. :)
> >> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
> >> in  ubdlib_complete_io_request().
> >>
> >> I will fix my code.
> > 
> > FYI, there is one example about using eventfd to wakeup io_uring, which
> > can be added to the library for your usecase:
> > 
> > https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b
> 
> Thanks, will take a view.
> 
> > 
> >>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> (2) ubdsrv forks a daemon and it takes over everything.
> >>>>>>     Users should type "list/stop/del" ctrl-commands to interact with
> >>>>>>     the daemon. It is inconvenient for our backend
> >>>>>>     because it has threads(from a C++ thread library) running inside.
> >>>>>
> >>>>> No, list/stop/del won't interact with the daemon, and the per-queue
> >>>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
> >>>>>
> >>>>
> >>>>
> >>>> Sorry I made a mistake.
> >>>>
> >>>> I mean from user's view, 
> >>>> he has to type list/del/stop from cmdlind to control the daemon.
> >>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> >>>>
> >>>> This is a little weird if we try to make a ubd library.
> >>>> So I actually provides APIs in libubd for users to do these list/del/stop works.
> >>>
> >>> OK, that is fine to export APIs for admin purpose.
> >>>
> >>>>
> >>>>
> >>>>>>
> >>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>>>>>     The data flow is:
> >>>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>>>>>     Since ubdsrv does not export its internal data buffer to backend,
> >>>>>>     the second copy is unavoidable. 
> >>>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
> >>>>>
> >>>>> The preallocation is just virtual memory, which is cheap and not pinned, but
> >>>>> ubdsrv does support buffer provided by io command, see:
> >>>>>
> >>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>
> >>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> >>>> but you did not reply :)
> >>>>
> >>>> I paste it here:
> >>>>
> >>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> >>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> >>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> >>>>
> >>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> >>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> >>>
> >>> Usually posix_memalign just allocates virtual memory which is unlimited
> >>> in 64bit arch, and pages should be allocated until the buffer is read or write.
> >>> After the READ/WRITE is done, kernel still can reclaim the pages in this
> >>> virtual memory.
> >>>
> >>> In future, we still may optimize the memory uses via madvise, such as
> >>> MADV_DONTNEED, after the slot is idle for long enough.
> >>
> >> Ok, thanks for explanation. 
> >>
> >>>
> >>>>
> >>>> Another IMPORTANT problem is your commit:
> >>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>> may be not helpful for WRITE requests if I understand correctly.
> >>>>
> >>>> Consider this data flow:
> >>>>
> >>>> 1. ubdsrv commits an IO req(req1, a READ req).
> >>>>
> >>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
> >>>>    addr1 is the addr of buffer user passed.
> >>>>    
> >>>>
> >>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> >>>>
> >>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> >>>>
> >>>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> >>>>
> >>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> >>>>
> >>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
> >>>>    cmd(with buffer set to addr1).
> >>>>
> >>>>
> >>>>
> >>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> >>>> until you get one in queue_rq. So length of addr1 may be less than data_len.
> >>>
> >>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> >>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> >>> waste, but process virtual address is unlimited for 64bit arch, and
> >>> pages are allocated until actual read/write is started.
> >>
> >> Ok, since I allow users to config rq_max_blocks in libubd, 
> >> it's users' responsibility to ensure length of user buffers
> >> is at least rq_max_blocks.
> >>
> >> Now I agree on your commit:
> >> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>
> >> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
> >>
> >>>
> >>>>>
> >>>>>>
> >>>>>> To better use ubd in more complicated scenarios, we have developed libubd.
> >>>>>> It does not assume implementation of backend and can be embedded into it.
> >>>>>> We refer to the code structure of tcmu-runner[4], 
> >>>>>> which includes a library(libtcmu) for users 
> >>>>>> to embed tcmu-runner inside backend's code. 
> >>>>>> It:
> >>>>>>
> >>>>>> (1) Does not fork/pthread_create but embedded in backend's threads
> >>>>>
> >>>>> That is because your backend may not use io_uring, I guess.
> >>>>>
> >>>>> But it is pretty easy to move the decision of creating pthread to target
> >>>>> code, which can be done in the interface of .prepare_target().
> >>>>
> >>>> I think the library should not create any thread if we want a libubd.
> >>>
> >>> I Agree.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>>>>>     and fetch/commit IO requests
> >>>>>
> >>>>> The above could be the main job of libubd.
> >>>>
> >>>> indeed.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>>>>>     since the backend actually has no knowledge 
> >>>>>>     on incoming data size until it gets an IO descriptor.
> >>>>>
> >>>>> I can understand your requirement, not look at your code yet, but libubd
> >>>>> should be pretty thin from function viewpoint, and there are lots of common
> >>>>> things to abstract/share among all drivers, please see recent ubdsrv change:
> >>>>>
> >>>>> https://github.com/ming1/ubdsrv/commits/master
> >>>>>
> >>>>> in which:
> >>>>> 	- coroutine is added for handling target io
> >>>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> >>>>> 	supporting complicated target
> >>>>> 	- c++ support
> >>>>
> >>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> >>>> I think it is actually target(backend) design and ubd should not assume 
> >>>> how the backend handle IOs. 
> >>>>
> >>>> The work ubd in userspace has to be done is:
> >>>>
> >>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> >>>>
> >>>> 2) get IO completion form backend, such as ubd_complete_io_requests()
> >>>
> >>> Or the user provides/registers two callbacks: handle_io_async() and
> >>> io_complete(), the former is called when one request comes from ubd
> >>> driver, the latter(optional) is called when one io is done.
> >>>
> >>> Also you didn't mention how you notify io_uring about io completion after
> >>> io_uring_enter() is slept if your backend code doesn't use io_uring to
> >>> handle io.
> >>>
> >>> I think one communication mechanism(such as eventfd) is needed for your
> >>> case.
> >>
> >> Ok, I will try eventfd with io_uring.
> >>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> >>>>> into ubdsrv easily. The potential users could be existed usersapce
> >>>>> block driver projects.
> >>>>
> >>>> Yes, so many userspace storage systems can use ubd!
> >>>> You may look at tcmu-runner. It:
> >>>>
> >>>> 1) provides a library(libtcmu.c) for those who have a existing backend.
> >>>>
> >>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
> >>>>    for those who just want to run it. 
> >>>>    And the runner is build on top of libtcmu.
> >>>>
> >>>>>
> >>>>> If you don't object, I am happy to co-work with you to add the support
> >>>>> for libubd in ubdsrv, then we can avoid to invent a wheel
> >>>>
> >>>> +1 :)
> >>>
> >>> Thinking of further, I'd suggest to split ubdsrv into two parts:
> >>>
> >>> 1) libubdsrv
> >>> - provide APIs like what you did in libubd
> >>> - provide API for notify io_uring(handling io command) that one io is
> >>> completed, and the API should support handling IO from other context
> >>> (not same with the io_uring context for handling io command).
> >>>
> >>> 2) ubd target
> >>> - built on libubdsrv, such as ubd command is built on libubdsrv, and
> >>> specific target implementation is built on the library too.
> >>>
> >>> It shouldn't be hard to work towards this direction, and I guess this
> >>> way should make current target implementation more clean.
> >>>
> >>
> >> Yes, this is like tcmu-runner's structure: a libtcmu and some target
> >> Thanks, Ming.  Glad to co-work with you.
> >>
> >> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).
> > 
> > I have added libublk branch for working towards this direction, if we
> > cowork on libublk, please write patch against this branch, then I can
> > apply your patch directly.
> > 
> > https://github.com/ming1/ubdsrv/tree/libublk
> 
> Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot
> because:
> 1) target implementation will be built on top of libubdsrv and the target
>    should create pthread(ubdsrv loop) itself.
> 
> 2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv.
>    It was really a hard job. :-(

Both the two are not hard to do, and turns out that making libubdsrv is actually
one big cleanup.

All these works[1] are basically done:

1) libublksrv
- built .so and .a are under lib/
- exported header file is include/ublksrv.h
- so any other application can make ublk device against this library
- eventfd notification is added too, so io handling doesn't have to
be done via io_uring, one callback of ->handle_event(), and two APIs
are added for this support

2) ublk/ubd utility
- built against libublksrv, meantime it uses the private header of the
library too, which is fine, since the two are in same project

3) two examples
- demo_null.c: one < 200 LOC standalone example to show how to make
a ubd/null block device against libublksrv

- demo_event.c: one simple standalone example(~300LOC) to make one ublk
disk by handling io via another pthread(not by io_uring) against
libublksrv

Any comments/feedback/tests are welcome.


[1] libublk
https://github.com/ming1/ubdsrv/tree/libublk


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-07-04  4:08             ` Ming Lei
@ 2022-07-04  9:49               ` Ziyang Zhang
  2022-07-04 11:03                 ` Ming Lei
  2022-07-04 13:50                 ` Ming Lei
  0 siblings, 2 replies; 12+ messages in thread
From: Ziyang Zhang @ 2022-07-04  9:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi

On 2022/7/4 12:08, Ming Lei wrote:
> On Thu, Jun 30, 2022 at 05:29:07PM +0800, Ziyang Zhang wrote:
>> On 2022/6/30 17:09, Ming Lei wrote:
>>> On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
>>>> Hi, Ming
>>>>
>>>> On 2022/6/29 19:33, Ming Lei wrote:
>>>>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
>>>>>> Hi Ming,
>>>>>>
>>>>>> On 2022/6/27 23:29, Ming Lei wrote:
>>>>>>> Hi Ziyang,
>>>>>>>
>>>>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
>>>>>>>> Hi Ming,
>>>>>>>>
>>>>>>>> We are learning your ubd code and developing a library: libubd for ubd.
>>>>>>>> This article explains why we need libubd and how we design it.
>>>>>>>>
>>>>>>>> Related threads:
>>>>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
>>>>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
>>>>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
>>>>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
>>>>>>>>
>>>>>>>>
>>>>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
>>>>>>>> allows users to define their own backend storage in userspace
>>>>>>>> and provides block devices such as /dev/ubdbX.
>>>>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
>>>>>>>> and userspace code: ubdsrv[3].
>>>>>>>>
>>>>>>>> ubd_drv.c simply passes all blk-mq IO requests
>>>>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
>>>>>>>> is pretty well-designed.
>>>>>>>>
>>>>>>>> ubdsrv is implemented by a single daemon
>>>>>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
>>>>>>>> is embedded in the daemon. 
>>>>>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
>>>>>>>> by our backend.
>>>>>>>
>>>>>>> ubd is supposed to provide one generic framework for user space block
>>>>>>> driver, and it can be used for doing lots of fun/useful thing.
>>>>>>>
>>>>>>> If I understand correctly, this isn't same with your use case:
>>>>>>>
>>>>>>> 1) your user space block driver isn't generic, and should be dedicated
>>>>>>> for Alibaba's uses
>>>>>>>
>>>>>>> 2) your case has been there for long time, and you want to switch from other
>>>>>>> approach(maybe tcmu) to ubd given ubd has better performance.
>>>>>>>
>>>>>>
>>>>>> Yes, you are correct :)
>>>>>> The idea of design libubd is actually from libtcmu.
>>>>>>
>>>>>> We do have some userspace storage system as the IO handling backend, 
>>>>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
>>>>>>
>>>>>>
>>>>>> I think your motivation is that provides a complete user block driver to users
>>>>>> and they DO NOT change any code.
>>>>>> Users DO change their code using libubd for embedding libubd into the backend.
>>>>>>
>>>>>>
>>>>>>>> First is description of our backend:
>>>>>>>>
>>>>>>>> (1) a distributing system sends/receives IO requests 
>>>>>>>>     through network.
>>>>>>>>
>>>>>>>> (2) The system use RPC calls among hundreds of
>>>>>>>>      storage servers and RPC calls are associated with data buffers
>>>>>>>>      allocated from a memory pool.
>>>>>>>>
>>>>>>>> (3) On each server for each device(/dev/vdX), our backend runs
>>>>>>>>      many threads to handle IO requests and manage the device. 
>>>>>>>>
>>>>>>>> Second are reasons why ubdsrv is hard to use for us:
>>>>>>>>
>>>>>>>> (1) ubdsrv requires the target(backend) issues IO requests
>>>>>>>>     to the io_uring provided by ubdsrv but our backend 
>>>>>>>>     uses something like RPC and does not support io_uring.
>>>>>>>
>>>>>>> As one generic framework, the io command has to be io_uring
>>>>>>> passthrough, and the io doesn't have to be handled by io_uring.
>>>>>>
>>>>>> Yes, our backend define its own communicating method.
>>>>>>
>>>>>>>
>>>>>>> But IMO io_uring is much more efficient, so I'd try to make async io
>>>>>>> (io uring) as the 1st citizen in the framework, especially for new
>>>>>>> driver.
>>>>>>>
>>>>>>> But it can support other way really, such as use io_uring with eventfd,
>>>>>>> the other userspace context can handle io, then wake up io_uring context
>>>>>>> via eventfd. You may not use io_uring for handling io, but you still
>>>>>>> need to communicate with the context for handling io_uring passthrough
>>>>>>> command, and one mechanism(such as eventfd) has to be there for the
>>>>>>> communication.
>>>>>>
>>>>>> Ok, eventfd may be helpful. 
>>>>>> If you read my API, you may find ubdlib_complete_io_request().
>>>>>> I think the backend io worker thread can call this function to tell the 
>>>>>> ubd queue thread(the io_uring context in it) to commit the IO.
>>>>>
>>>>> The ubdlib_complete_io_request() has to be called in the same pthread
>>>>> context, that looks not flexible. When you handle IO via non-io_uring in the same
>>>>> context, the cpu utilization in submission/completion side should be
>>>>> higher than io_uring. And this way should be worse than the usage in
>>>>> ubd/loop, that is why I suggest to use one io_uring for handling both
>>>>> io command and io request if possible.
>>>>
>>>> ubdlib_complete_io_request() can be called in the io worker thread,
>>>> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
>>>>
>>>> You can find ubd_runner.c in my libubd repo. There are many io worker
>>>> threads for each ubdsrv queue to handle IO requests.
>>>>
>>>> Actually this idea comes from tcmu-runner. The data flow is:
>>>>
>>>> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
>>>>
>>>> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
>>>>    
>>>>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
>>>>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
>>>>     
>>>>    for other simple(can be handled very quickly), 
>>>>    handle it right now and call ubdlib_complete_io_request()
>>>>
>>>> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
>>>>    and setup sqe if one IO(IO completion) is ready to commit.
>>>>    
>>>>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
>>>>
>>>>
>>>> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
>>>>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
>>>>    
>>>> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
>>>>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
>>>>    
>>>>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
>>>>    This function can mark this  IO req's status to ready to commit.
>>>>
>>>> IO handling/completion and io_uring_enter() can happen at the same time.
>>>>
>>>> Besides, io_uring_enter can:
>>>>
>>>> 1) block and wait for cqes until at least
>>>> one blk-mq req comes from queue_rq()
>>>>
>>>> 2) submit sqes(with last IO completion and next fetch)
>>>>
>>>> so I have to consider how to notify io_uring about io completion 
>>>> after io_uring_enter() is slept(block and wait for cqes).
>>>
>>> Yeah, that was exactly my question, :-)
>>>
>>>>
>>>> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
>>>> io_uring_enter_timeout() and caller can set a timeout value for it.
>>>> So IO completions happen after io_uring_enter_timeout() call can be committed
>>>> by next io_uring_enter_timeout() call...
>>>>
>>>> But this is a very ugly implementation 
>>>> because I may waste CPU on useless loops in ubdsrv queue thread if
>>>> blk-mq reqs do not income frequently.
>>>>
>>>> You mentioned that eventfd may be helpful and I agree with you. :)
>>>> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
>>>> in  ubdlib_complete_io_request().
>>>>
>>>> I will fix my code.
>>>
>>> FYI, there is one example about using eventfd to wakeup io_uring, which
>>> can be added to the library for your usecase:
>>>
>>> https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b
>>
>> Thanks, will take a view.
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> (2) ubdsrv forks a daemon and it takes over everything.
>>>>>>>>     Users should type "list/stop/del" ctrl-commands to interact with
>>>>>>>>     the daemon. It is inconvenient for our backend
>>>>>>>>     because it has threads(from a C++ thread library) running inside.
>>>>>>>
>>>>>>> No, list/stop/del won't interact with the daemon, and the per-queue
>>>>>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry I made a mistake.
>>>>>>
>>>>>> I mean from user's view, 
>>>>>> he has to type list/del/stop from cmdlind to control the daemon.
>>>>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
>>>>>>
>>>>>> This is a little weird if we try to make a ubd library.
>>>>>> So I actually provides APIs in libubd for users to do these list/del/stop works.
>>>>>
>>>>> OK, that is fine to export APIs for admin purpose.
>>>>>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
>>>>>>>>     The data flow is:
>>>>>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
>>>>>>>>     Since ubdsrv does not export its internal data buffer to backend,
>>>>>>>>     the second copy is unavoidable. 
>>>>>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
>>>>>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
>>>>>>>
>>>>>>> The preallocation is just virtual memory, which is cheap and not pinned, but
>>>>>>> ubdsrv does support buffer provided by io command, see:
>>>>>>>
>>>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>>>>>
>>>>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
>>>>>> but you did not reply :)
>>>>>>
>>>>>> I paste it here:
>>>>>>
>>>>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
>>>>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
>>>>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
>>>>>>
>>>>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
>>>>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
>>>>>
>>>>> Usually posix_memalign just allocates virtual memory which is unlimited
>>>>> in 64bit arch, and pages should be allocated until the buffer is read or write.
>>>>> After the READ/WRITE is done, kernel still can reclaim the pages in this
>>>>> virtual memory.
>>>>>
>>>>> In future, we still may optimize the memory uses via madvise, such as
>>>>> MADV_DONTNEED, after the slot is idle for long enough.
>>>>
>>>> Ok, thanks for explanation. 
>>>>
>>>>>
>>>>>>
>>>>>> Another IMPORTANT problem is your commit:
>>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>>>>> may be not helpful for WRITE requests if I understand correctly.
>>>>>>
>>>>>> Consider this data flow:
>>>>>>
>>>>>> 1. ubdsrv commits an IO req(req1, a READ req).
>>>>>>
>>>>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
>>>>>>    addr1 is the addr of buffer user passed.
>>>>>>    
>>>>>>
>>>>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
>>>>>>
>>>>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
>>>>>>
>>>>>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
>>>>>>
>>>>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
>>>>>>
>>>>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
>>>>>>    cmd(with buffer set to addr1).
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
>>>>>> until you get one in queue_rq. So length of addr1 may be less than data_len.
>>>>>
>>>>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
>>>>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
>>>>> waste, but process virtual address is unlimited for 64bit arch, and
>>>>> pages are allocated until actual read/write is started.
>>>>
>>>> Ok, since I allow users to config rq_max_blocks in libubd, 
>>>> it's users' responsibility to ensure length of user buffers
>>>> is at least rq_max_blocks.
>>>>
>>>> Now I agree on your commit:
>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
>>>>
>>>> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
>>>>
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> To better use ubd in more complicated scenarios, we have developed libubd.
>>>>>>>> It does not assume implementation of backend and can be embedded into it.
>>>>>>>> We refer to the code structure of tcmu-runner[4], 
>>>>>>>> which includes a library(libtcmu) for users 
>>>>>>>> to embed tcmu-runner inside backend's code. 
>>>>>>>> It:
>>>>>>>>
>>>>>>>> (1) Does not fork/pthread_create but embedded in backend's threads
>>>>>>>
>>>>>>> That is because your backend may not use io_uring, I guess.
>>>>>>>
>>>>>>> But it is pretty easy to move the decision of creating pthread to target
>>>>>>> code, which can be done in the interface of .prepare_target().
>>>>>>
>>>>>> I think the library should not create any thread if we want a libubd.
>>>>>
>>>>> I Agree.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
>>>>>>>>     and fetch/commit IO requests
>>>>>>>
>>>>>>> The above could be the main job of libubd.
>>>>>>
>>>>>> indeed.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
>>>>>>>>     since the backend actually has no knowledge 
>>>>>>>>     on incoming data size until it gets an IO descriptor.
>>>>>>>
>>>>>>> I can understand your requirement, not look at your code yet, but libubd
>>>>>>> should be pretty thin from function viewpoint, and there are lots of common
>>>>>>> things to abstract/share among all drivers, please see recent ubdsrv change:
>>>>>>>
>>>>>>> https://github.com/ming1/ubdsrv/commits/master
>>>>>>>
>>>>>>> in which:
>>>>>>> 	- coroutine is added for handling target io
>>>>>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
>>>>>>> 	supporting complicated target
>>>>>>> 	- c++ support
>>>>>>
>>>>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
>>>>>> I think it is actually target(backend) design and ubd should not assume 
>>>>>> how the backend handle IOs. 
>>>>>>
>>>>>> The work ubd in userspace has to be done is:
>>>>>>
>>>>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
>>>>>>
>>>>>> 2) get IO completion form backend, such as ubd_complete_io_requests()
>>>>>
>>>>> Or the user provides/registers two callbacks: handle_io_async() and
>>>>> io_complete(), the former is called when one request comes from ubd
>>>>> driver, the latter(optional) is called when one io is done.
>>>>>
>>>>> Also you didn't mention how you notify io_uring about io completion after
>>>>> io_uring_enter() is slept if your backend code doesn't use io_uring to
>>>>> handle io.
>>>>>
>>>>> I think one communication mechanism(such as eventfd) is needed for your
>>>>> case.
>>>>
>>>> Ok, I will try eventfd with io_uring.
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
>>>>>>> into ubdsrv easily. The potential users could be existed usersapce
>>>>>>> block driver projects.
>>>>>>
>>>>>> Yes, so many userspace storage systems can use ubd!
>>>>>> You may look at tcmu-runner. It:
>>>>>>
>>>>>> 1) provides a library(libtcmu.c) for those who have a existing backend.
>>>>>>
>>>>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
>>>>>>    for those who just want to run it. 
>>>>>>    And the runner is build on top of libtcmu.
>>>>>>
>>>>>>>
>>>>>>> If you don't object, I am happy to co-work with you to add the support
>>>>>>> for libubd in ubdsrv, then we can avoid to invent a wheel
>>>>>>
>>>>>> +1 :)
>>>>>
>>>>> Thinking of further, I'd suggest to split ubdsrv into two parts:
>>>>>
>>>>> 1) libubdsrv
>>>>> - provide APIs like what you did in libubd
>>>>> - provide API for notify io_uring(handling io command) that one io is
>>>>> completed, and the API should support handling IO from other context
>>>>> (not same with the io_uring context for handling io command).
>>>>>
>>>>> 2) ubd target
>>>>> - built on libubdsrv, such as ubd command is built on libubdsrv, and
>>>>> specific target implementation is built on the library too.
>>>>>
>>>>> It shouldn't be hard to work towards this direction, and I guess this
>>>>> way should make current target implementation more clean.
>>>>>
>>>>
>>>> Yes, this is like tcmu-runner's structure: a libtcmu and some target
>>>> Thanks, Ming.  Glad to co-work with you.
>>>>
>>>> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).
>>>
>>> I have added libublk branch for working towards this direction, if we
>>> cowork on libublk, please write patch against this branch, then I can
>>> apply your patch directly.
>>>
>>> https://github.com/ming1/ubdsrv/tree/libublk
>>
>> Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot
>> because:
>> 1) target implementation will be built on top of libubdsrv and the target
>>    should create pthread(ubdsrv loop) itself.
>>
>> 2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv.
>>    It was really a hard job. :-(
> 
> Both the two are not hard to do, and turns out that making libubdsrv is actually
> one big cleanup.
> 
> All these works[1] are basically done:
> 
> 1) libublksrv
> - built .so and .a are under lib/
> - exported header file is include/ublksrv.h
> - so any other application can make ublk device against this library
> - eventfd notification is added too, so io handling doesn't have to
> be done via io_uring, one callback of ->handle_event(), and two APIs
> are added for this support
> 
> 2) ublk/ubd utility
> - built against libublksrv, meantime it uses the private header of the
> library too, which is fine, since the two are in same project
> 
> 3) two examples
> - demo_null.c: one < 200 LOC standalone example to show how to make
> a ubd/null block device against libublksrv
> 
> - demo_event.c: one simple standalone example(~300LOC) to make one ublk
> disk by handling io via another pthread(not by io_uring) against
> libublksrv
> 
> Any comments/feedback/tests are welcome.
> 
> 
> [1] libublk
> https://github.com/ming1/ubdsrv/tree/libublk
> 

Hi Ming,

Thanks for your libublk code.
You almost have done everything we discussed before :)

One small question:
Could you please write another demo_event_loop.c:

1) based on demo_event.c(handling io via another pthread)
   but the loop backend is a fd(file). User calls pwrite/pread with the fd.
   
2) User-provided buffer: data buffer is allocated by backend
   (not posix_memalign() in ublksrv.c)
   and it is passed to ublksrv(finally to ubd_drv in kernel) by libublksrv's API
   ( maybe passed as an arg to ublksrv_complete_io()? )

3) I think pread/pwrite should be in demo_event_real_io_handler_fn() 
   for all IOs of each pending list. Am I correct?

4) Shall we consider multiple pthreads calling demo_event_real_io_handler_fn()
   (they are io worker threads actually)
   so we can handle IO concurrently?


I think designing more examples helps us design libublk's API more clearly.

> 
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-07-04  9:49               ` Ziyang Zhang
@ 2022-07-04 11:03                 ` Ming Lei
  2022-07-04 13:50                 ` Ming Lei
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-07-04 11:03 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi, ming.lei

On Mon, Jul 04, 2022 at 05:49:34PM +0800, Ziyang Zhang wrote:
> On 2022/7/4 12:08, Ming Lei wrote:
> > On Thu, Jun 30, 2022 at 05:29:07PM +0800, Ziyang Zhang wrote:
> >> On 2022/6/30 17:09, Ming Lei wrote:
> >>> On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
> >>>> Hi, Ming
> >>>>
> >>>> On 2022/6/29 19:33, Ming Lei wrote:
> >>>>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> >>>>>> Hi Ming,
> >>>>>>
> >>>>>> On 2022/6/27 23:29, Ming Lei wrote:
> >>>>>>> Hi Ziyang,
> >>>>>>>
> >>>>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >>>>>>>> Hi Ming,
> >>>>>>>>
> >>>>>>>> We are learning your ubd code and developing a library: libubd for ubd.
> >>>>>>>> This article explains why we need libubd and how we design it.
> >>>>>>>>
> >>>>>>>> Related threads:
> >>>>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >>>>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >>>>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> >>>>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >>>>>>>> allows users to define their own backend storage in userspace
> >>>>>>>> and provides block devices such as /dev/ubdbX.
> >>>>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >>>>>>>> and userspace code: ubdsrv[3].
> >>>>>>>>
> >>>>>>>> ubd_drv.c simply passes all blk-mq IO requests
> >>>>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >>>>>>>> is pretty well-designed.
> >>>>>>>>
> >>>>>>>> ubdsrv is implemented by a single daemon
> >>>>>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
> >>>>>>>> is embedded in the daemon. 
> >>>>>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
> >>>>>>>> by our backend.
> >>>>>>>
> >>>>>>> ubd is supposed to provide one generic framework for user space block
> >>>>>>> driver, and it can be used for doing lots of fun/useful thing.
> >>>>>>>
> >>>>>>> If I understand correctly, this isn't same with your use case:
> >>>>>>>
> >>>>>>> 1) your user space block driver isn't generic, and should be dedicated
> >>>>>>> for Alibaba's uses
> >>>>>>>
> >>>>>>> 2) your case has been there for long time, and you want to switch from other
> >>>>>>> approach(maybe tcmu) to ubd given ubd has better performance.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, you are correct :)
> >>>>>> The idea of design libubd is actually from libtcmu.
> >>>>>>
> >>>>>> We do have some userspace storage system as the IO handling backend, 
> >>>>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> >>>>>>
> >>>>>>
> >>>>>> I think your motivation is that provides a complete user block driver to users
> >>>>>> and they DO NOT change any code.
> >>>>>> Users DO change their code using libubd for embedding libubd into the backend.
> >>>>>>
> >>>>>>
> >>>>>>>> First is description of our backend:
> >>>>>>>>
> >>>>>>>> (1) a distributing system sends/receives IO requests 
> >>>>>>>>     through network.
> >>>>>>>>
> >>>>>>>> (2) The system use RPC calls among hundreds of
> >>>>>>>>      storage servers and RPC calls are associated with data buffers
> >>>>>>>>      allocated from a memory pool.
> >>>>>>>>
> >>>>>>>> (3) On each server for each device(/dev/vdX), our backend runs
> >>>>>>>>      many threads to handle IO requests and manage the device. 
> >>>>>>>>
> >>>>>>>> Second are reasons why ubdsrv is hard to use for us:
> >>>>>>>>
> >>>>>>>> (1) ubdsrv requires the target(backend) issues IO requests
> >>>>>>>>     to the io_uring provided by ubdsrv but our backend 
> >>>>>>>>     uses something like RPC and does not support io_uring.
> >>>>>>>
> >>>>>>> As one generic framework, the io command has to be io_uring
> >>>>>>> passthrough, and the io doesn't have to be handled by io_uring.
> >>>>>>
> >>>>>> Yes, our backend define its own communicating method.
> >>>>>>
> >>>>>>>
> >>>>>>> But IMO io_uring is much more efficient, so I'd try to make async io
> >>>>>>> (io uring) as the 1st citizen in the framework, especially for new
> >>>>>>> driver.
> >>>>>>>
> >>>>>>> But it can support other way really, such as use io_uring with eventfd,
> >>>>>>> the other userspace context can handle io, then wake up io_uring context
> >>>>>>> via eventfd. You may not use io_uring for handling io, but you still
> >>>>>>> need to communicate with the context for handling io_uring passthrough
> >>>>>>> command, and one mechanism(such as eventfd) has to be there for the
> >>>>>>> communication.
> >>>>>>
> >>>>>> Ok, eventfd may be helpful. 
> >>>>>> If you read my API, you may find ubdlib_complete_io_request().
> >>>>>> I think the backend io worker thread can call this function to tell the 
> >>>>>> ubd queue thread(the io_uring context in it) to commit the IO.
> >>>>>
> >>>>> The ubdlib_complete_io_request() has to be called in the same pthread
> >>>>> context, that looks not flexible. When you handle IO via non-io_uring in the same
> >>>>> context, the cpu utilization in submission/completion side should be
> >>>>> higher than io_uring. And this way should be worse than the usage in
> >>>>> ubd/loop, that is why I suggest to use one io_uring for handling both
> >>>>> io command and io request if possible.
> >>>>
> >>>> ubdlib_complete_io_request() can be called in the io worker thread,
> >>>> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
> >>>>
> >>>> You can find ubd_runner.c in my libubd repo. There are many io worker
> >>>> threads for each ubdsrv queue to handle IO requests.
> >>>>
> >>>> Actually this idea comes from tcmu-runner. The data flow is:
> >>>>
> >>>> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
> >>>>
> >>>> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
> >>>>    
> >>>>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
> >>>>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
> >>>>     
> >>>>    for other simple(can be handled very quickly), 
> >>>>    handle it right now and call ubdlib_complete_io_request()
> >>>>
> >>>> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
> >>>>    and setup sqe if one IO(IO completion) is ready to commit.
> >>>>    
> >>>>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
> >>>>
> >>>>
> >>>> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
> >>>>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
> >>>>    
> >>>> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
> >>>>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
> >>>>    
> >>>>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
> >>>>    This function can mark this  IO req's status to ready to commit.
> >>>>
> >>>> IO handling/completion and io_uring_enter() can happen at the same time.
> >>>>
> >>>> Besides, io_uring_enter can:
> >>>>
> >>>> 1) block and wait for cqes until at least
> >>>> one blk-mq req comes from queue_rq()
> >>>>
> >>>> 2) submit sqes(with last IO completion and next fetch)
> >>>>
> >>>> so I have to consider how to notify io_uring about io completion 
> >>>> after io_uring_enter() is slept(block and wait for cqes).
> >>>
> >>> Yeah, that was exactly my question, :-)
> >>>
> >>>>
> >>>> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
> >>>> io_uring_enter_timeout() and caller can set a timeout value for it.
> >>>> So IO completions happen after io_uring_enter_timeout() call can be committed
> >>>> by next io_uring_enter_timeout() call...
> >>>>
> >>>> But this is a very ugly implementation 
> >>>> because I may waste CPU on useless loops in ubdsrv queue thread if
> >>>> blk-mq reqs do not income frequently.
> >>>>
> >>>> You mentioned that eventfd may be helpful and I agree with you. :)
> >>>> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
> >>>> in  ubdlib_complete_io_request().
> >>>>
> >>>> I will fix my code.
> >>>
> >>> FYI, there is one example about using eventfd to wakeup io_uring, which
> >>> can be added to the library for your usecase:
> >>>
> >>> https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b
> >>
> >> Thanks, will take a view.
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> (2) ubdsrv forks a daemon and it takes over everything.
> >>>>>>>>     Users should type "list/stop/del" ctrl-commands to interact with
> >>>>>>>>     the daemon. It is inconvenient for our backend
> >>>>>>>>     because it has threads(from a C++ thread library) running inside.
> >>>>>>>
> >>>>>>> No, list/stop/del won't interact with the daemon, and the per-queue
> >>>>>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Sorry I made a mistake.
> >>>>>>
> >>>>>> I mean from user's view, 
> >>>>>> he has to type list/del/stop from cmdlind to control the daemon.
> >>>>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> >>>>>>
> >>>>>> This is a little weird if we try to make a ubd library.
> >>>>>> So I actually provides APIs in libubd for users to do these list/del/stop works.
> >>>>>
> >>>>> OK, that is fine to export APIs for admin purpose.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>>>>>>>     The data flow is:
> >>>>>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>>>>>>>     Since ubdsrv does not export its internal data buffer to backend,
> >>>>>>>>     the second copy is unavoidable. 
> >>>>>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>>>>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
> >>>>>>>
> >>>>>>> The preallocation is just virtual memory, which is cheap and not pinned, but
> >>>>>>> ubdsrv does support buffer provided by io command, see:
> >>>>>>>
> >>>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>>>
> >>>>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> >>>>>> but you did not reply :)
> >>>>>>
> >>>>>> I paste it here:
> >>>>>>
> >>>>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> >>>>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> >>>>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> >>>>>>
> >>>>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> >>>>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> >>>>>
> >>>>> Usually posix_memalign just allocates virtual memory which is unlimited
> >>>>> in 64bit arch, and pages should be allocated until the buffer is read or write.
> >>>>> After the READ/WRITE is done, kernel still can reclaim the pages in this
> >>>>> virtual memory.
> >>>>>
> >>>>> In future, we still may optimize the memory uses via madvise, such as
> >>>>> MADV_DONTNEED, after the slot is idle for long enough.
> >>>>
> >>>> Ok, thanks for explanation. 
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Another IMPORTANT problem is your commit:
> >>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>>> may be not helpful for WRITE requests if I understand correctly.
> >>>>>>
> >>>>>> Consider this data flow:
> >>>>>>
> >>>>>> 1. ubdsrv commits an IO req(req1, a READ req).
> >>>>>>
> >>>>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
> >>>>>>    addr1 is the addr of buffer user passed.
> >>>>>>    
> >>>>>>
> >>>>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> >>>>>>
> >>>>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> >>>>>>
> >>>>>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> >>>>>>
> >>>>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> >>>>>>
> >>>>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
> >>>>>>    cmd(with buffer set to addr1).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> >>>>>> until you get one in queue_rq. So length of addr1 may be less than data_len.
> >>>>>
> >>>>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> >>>>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> >>>>> waste, but process virtual address is unlimited for 64bit arch, and
> >>>>> pages are allocated until actual read/write is started.
> >>>>
> >>>> Ok, since I allow users to config rq_max_blocks in libubd, 
> >>>> it's users' responsibility to ensure length of user buffers
> >>>> is at least rq_max_blocks.
> >>>>
> >>>> Now I agree on your commit:
> >>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>
> >>>> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> To better use ubd in more complicated scenarios, we have developed libubd.
> >>>>>>>> It does not assume implementation of backend and can be embedded into it.
> >>>>>>>> We refer to the code structure of tcmu-runner[4], 
> >>>>>>>> which includes a library(libtcmu) for users 
> >>>>>>>> to embed tcmu-runner inside backend's code. 
> >>>>>>>> It:
> >>>>>>>>
> >>>>>>>> (1) Does not fork/pthread_create but embedded in backend's threads
> >>>>>>>
> >>>>>>> That is because your backend may not use io_uring, I guess.
> >>>>>>>
> >>>>>>> But it is pretty easy to move the decision of creating pthread to target
> >>>>>>> code, which can be done in the interface of .prepare_target().
> >>>>>>
> >>>>>> I think the library should not create any thread if we want a libubd.
> >>>>>
> >>>>> I Agree.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>>>>>>>     and fetch/commit IO requests
> >>>>>>>
> >>>>>>> The above could be the main job of libubd.
> >>>>>>
> >>>>>> indeed.
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>>>>>>>     since the backend actually has no knowledge 
> >>>>>>>>     on incoming data size until it gets an IO descriptor.
> >>>>>>>
> >>>>>>> I can understand your requirement, not look at your code yet, but libubd
> >>>>>>> should be pretty thin from function viewpoint, and there are lots of common
> >>>>>>> things to abstract/share among all drivers, please see recent ubdsrv change:
> >>>>>>>
> >>>>>>> https://github.com/ming1/ubdsrv/commits/master
> >>>>>>>
> >>>>>>> in which:
> >>>>>>> 	- coroutine is added for handling target io
> >>>>>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> >>>>>>> 	supporting complicated target
> >>>>>>> 	- c++ support
> >>>>>>
> >>>>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> >>>>>> I think it is actually target(backend) design and ubd should not assume 
> >>>>>> how the backend handle IOs. 
> >>>>>>
> >>>>>> The work ubd in userspace has to be done is:
> >>>>>>
> >>>>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> >>>>>>
> >>>>>> 2) get IO completion form backend, such as ubd_complete_io_requests()
> >>>>>
> >>>>> Or the user provides/registers two callbacks: handle_io_async() and
> >>>>> io_complete(), the former is called when one request comes from ubd
> >>>>> driver, the latter(optional) is called when one io is done.
> >>>>>
> >>>>> Also you didn't mention how you notify io_uring about io completion after
> >>>>> io_uring_enter() is slept if your backend code doesn't use io_uring to
> >>>>> handle io.
> >>>>>
> >>>>> I think one communication mechanism(such as eventfd) is needed for your
> >>>>> case.
> >>>>
> >>>> Ok, I will try eventfd with io_uring.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> >>>>>>> into ubdsrv easily. The potential users could be existed usersapce
> >>>>>>> block driver projects.
> >>>>>>
> >>>>>> Yes, so many userspace storage systems can use ubd!
> >>>>>> You may look at tcmu-runner. It:
> >>>>>>
> >>>>>> 1) provides a library(libtcmu.c) for those who have a existing backend.
> >>>>>>
> >>>>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
> >>>>>>    for those who just want to run it. 
> >>>>>>    And the runner is build on top of libtcmu.
> >>>>>>
> >>>>>>>
> >>>>>>> If you don't object, I am happy to co-work with you to add the support
> >>>>>>> for libubd in ubdsrv, then we can avoid to invent a wheel
> >>>>>>
> >>>>>> +1 :)
> >>>>>
> >>>>> Thinking of further, I'd suggest to split ubdsrv into two parts:
> >>>>>
> >>>>> 1) libubdsrv
> >>>>> - provide APIs like what you did in libubd
> >>>>> - provide API for notify io_uring(handling io command) that one io is
> >>>>> completed, and the API should support handling IO from other context
> >>>>> (not same with the io_uring context for handling io command).
> >>>>>
> >>>>> 2) ubd target
> >>>>> - built on libubdsrv, such as ubd command is built on libubdsrv, and
> >>>>> specific target implementation is built on the library too.
> >>>>>
> >>>>> It shouldn't be hard to work towards this direction, and I guess this
> >>>>> way should make current target implementation more clean.
> >>>>>
> >>>>
> >>>> Yes, this is like tcmu-runner's structure: a libtcmu and some target
> >>>> Thanks, Ming.  Glad to co-work with you.
> >>>>
> >>>> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).
> >>>
> >>> I have added libublk branch for working towards this direction, if we
> >>> cowork on libublk, please write patch against this branch, then I can
> >>> apply your patch directly.
> >>>
> >>> https://github.com/ming1/ubdsrv/tree/libublk
> >>
> >> Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot
> >> because:
> >> 1) target implementation will be built on top of libubdsrv and the target
> >>    should create pthread(ubdsrv loop) itself.
> >>
> >> 2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv.
> >>    It was really a hard job. :-(
> > 
> > Both the two are not hard to do, and turns out that making libubdsrv is actually
> > one big cleanup.
> > 
> > All these works[1] are basically done:
> > 
> > 1) libublksrv
> > - built .so and .a are under lib/
> > - exported header file is include/ublksrv.h
> > - so any other application can make ublk device against this library
> > - eventfd notification is added too, so io handling doesn't have to
> > be done via io_uring, one callback of ->handle_event(), and two APIs
> > are added for this support
> > 
> > 2) ublk/ubd utility
> > - built against libublksrv, meantime it uses the private header of the
> > library too, which is fine, since the two are in same project
> > 
> > 3) two examples
> > - demo_null.c: one < 200 LOC standalone example to show how to make
> > a ubd/null block device against libublksrv
> > 
> > - demo_event.c: one simple standalone example(~300LOC) to make one ublk
> > disk by handling io via another pthread(not by io_uring) against
> > libublksrv
> > 
> > Any comments/feedback/tests are welcome.
> > 
> > 
> > [1] libublk
> > https://github.com/ming1/ubdsrv/tree/libublk
> > 
> 
> Hi Ming,
> 
> Thanks for your libublk code.
> You almost have done everything we discussed before :)
> 
> One small question:
> Could you please write another demo_event_loop.c:

That is actually easy, but I'd suggest you to do it, and I can apply
the patch on ublksrv tree once it is ready. It is also one good chance
for you to understand the recent change of libublksrv.

I have written one loop based on io_uring, which performs very well.

The eventfd interface is actually for existed project which can't use
io_uring or io_uring may not implement the function, that should be
your use case.

> 
> 1) based on demo_event.c(handling io via another pthread)
>    but the loop backend is a fd(file). User calls pwrite/pread with the fd.
>    
> 2) User-provided buffer: data buffer is allocated by backend
>    (not posix_memalign() in ublksrv.c)
>    and it is passed to ublksrv(finally to ubd_drv in kernel) by libublksrv's API
>    ( maybe passed as an arg to ublksrv_complete_io()? )
> 
> 3) I think pread/pwrite should be in demo_event_real_io_handler_fn() 
>    for all IOs of each pending list. Am I correct?

Exactly, the pthread of demo_event_real_io_handler_fn is for handling
io from /dev/ublkbN.

> 
> 4) Shall we consider multiple pthreads calling demo_event_real_io_handler_fn()
>    (they are io worker threads actually)
>    so we can handle IO concurrently?

That can be done, but sync among these pthreads have to be dealt with
well.

> 
> 
> I think designing more examples helps us design libublk's API more clearly.

The example of demo_event.c is enough for implementing the loop/event.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
  2022-07-04  9:49               ` Ziyang Zhang
  2022-07-04 11:03                 ` Ming Lei
@ 2022-07-04 13:50                 ` Ming Lei
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-07-04 13:50 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: linux-block, io-uring, linux-kernel, Gabriel Krisman Bertazi,
	Xiaoguang Wang, joseph.qi, ming.lei

On Mon, Jul 04, 2022 at 05:49:34PM +0800, Ziyang Zhang wrote:
> On 2022/7/4 12:08, Ming Lei wrote:
> > On Thu, Jun 30, 2022 at 05:29:07PM +0800, Ziyang Zhang wrote:
> >> On 2022/6/30 17:09, Ming Lei wrote:
> >>> On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
> >>>> Hi, Ming
> >>>>
> >>>> On 2022/6/29 19:33, Ming Lei wrote:
> >>>>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> >>>>>> Hi Ming,
> >>>>>>
> >>>>>> On 2022/6/27 23:29, Ming Lei wrote:
> >>>>>>> Hi Ziyang,
> >>>>>>>
> >>>>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >>>>>>>> Hi Ming,
> >>>>>>>>
> >>>>>>>> We are learning your ubd code and developing a library: libubd for ubd.
> >>>>>>>> This article explains why we need libubd and how we design it.
> >>>>>>>>
> >>>>>>>> Related threads:
> >>>>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >>>>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >>>>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@redhat.com/
> >>>>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@redhat.com/
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >>>>>>>> allows users to define their own backend storage in userspace
> >>>>>>>> and provides block devices such as /dev/ubdbX.
> >>>>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >>>>>>>> and userspace code: ubdsrv[3].
> >>>>>>>>
> >>>>>>>> ubd_drv.c simply passes all blk-mq IO requests
> >>>>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >>>>>>>> is pretty well-designed.
> >>>>>>>>
> >>>>>>>> ubdsrv is implemented by a single daemon
> >>>>>>>> and target(backend) IO handling(null_tgt and loop_tgt) 
> >>>>>>>> is embedded in the daemon. 
> >>>>>>>> While trying ubdsrv, we find ubdsrv is hard to be used 
> >>>>>>>> by our backend.
> >>>>>>>
> >>>>>>> ubd is supposed to provide one generic framework for user space block
> >>>>>>> driver, and it can be used for doing lots of fun/useful thing.
> >>>>>>>
> >>>>>>> If I understand correctly, this isn't same with your use case:
> >>>>>>>
> >>>>>>> 1) your user space block driver isn't generic, and should be dedicated
> >>>>>>> for Alibaba's uses
> >>>>>>>
> >>>>>>> 2) your case has been there for long time, and you want to switch from other
> >>>>>>> approach(maybe tcmu) to ubd given ubd has better performance.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, you are correct :)
> >>>>>> The idea of design libubd is actually from libtcmu.
> >>>>>>
> >>>>>> We do have some userspace storage system as the IO handling backend, 
> >>>>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> >>>>>>
> >>>>>>
> >>>>>> I think your motivation is that provides a complete user block driver to users
> >>>>>> and they DO NOT change any code.
> >>>>>> Users DO change their code using libubd for embedding libubd into the backend.
> >>>>>>
> >>>>>>
> >>>>>>>> First is description of our backend:
> >>>>>>>>
> >>>>>>>> (1) a distributing system sends/receives IO requests 
> >>>>>>>>     through network.
> >>>>>>>>
> >>>>>>>> (2) The system use RPC calls among hundreds of
> >>>>>>>>      storage servers and RPC calls are associated with data buffers
> >>>>>>>>      allocated from a memory pool.
> >>>>>>>>
> >>>>>>>> (3) On each server for each device(/dev/vdX), our backend runs
> >>>>>>>>      many threads to handle IO requests and manage the device. 
> >>>>>>>>
> >>>>>>>> Second are reasons why ubdsrv is hard to use for us:
> >>>>>>>>
> >>>>>>>> (1) ubdsrv requires the target(backend) issues IO requests
> >>>>>>>>     to the io_uring provided by ubdsrv but our backend 
> >>>>>>>>     uses something like RPC and does not support io_uring.
> >>>>>>>
> >>>>>>> As one generic framework, the io command has to be io_uring
> >>>>>>> passthrough, and the io doesn't have to be handled by io_uring.
> >>>>>>
> >>>>>> Yes, our backend define its own communicating method.
> >>>>>>
> >>>>>>>
> >>>>>>> But IMO io_uring is much more efficient, so I'd try to make async io
> >>>>>>> (io uring) as the 1st citizen in the framework, especially for new
> >>>>>>> driver.
> >>>>>>>
> >>>>>>> But it can support other way really, such as use io_uring with eventfd,
> >>>>>>> the other userspace context can handle io, then wake up io_uring context
> >>>>>>> via eventfd. You may not use io_uring for handling io, but you still
> >>>>>>> need to communicate with the context for handling io_uring passthrough
> >>>>>>> command, and one mechanism(such as eventfd) has to be there for the
> >>>>>>> communication.
> >>>>>>
> >>>>>> Ok, eventfd may be helpful. 
> >>>>>> If you read my API, you may find ubdlib_complete_io_request().
> >>>>>> I think the backend io worker thread can call this function to tell the 
> >>>>>> ubd queue thread(the io_uring context in it) to commit the IO.
> >>>>>
> >>>>> The ubdlib_complete_io_request() has to be called in the same pthread
> >>>>> context, that looks not flexible. When you handle IO via non-io_uring in the same
> >>>>> context, the cpu utilization in submission/completion side should be
> >>>>> higher than io_uring. And this way should be worse than the usage in
> >>>>> ubd/loop, that is why I suggest to use one io_uring for handling both
> >>>>> io command and io request if possible.
> >>>>
> >>>> ubdlib_complete_io_request() can be called in the io worker thread,
> >>>> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
> >>>>
> >>>> You can find ubd_runner.c in my libubd repo. There are many io worker
> >>>> threads for each ubdsrv queue to handle IO requests.
> >>>>
> >>>> Actually this idea comes from tcmu-runner. The data flow is:
> >>>>
> >>>> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
> >>>>
> >>>> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
> >>>>    
> >>>>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
> >>>>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
> >>>>     
> >>>>    for other simple(can be handled very quickly), 
> >>>>    handle it right now and call ubdlib_complete_io_request()
> >>>>
> >>>> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
> >>>>    and setup sqe if one IO(IO completion) is ready to commit.
> >>>>    
> >>>>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
> >>>>
> >>>>
> >>>> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
> >>>>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
> >>>>    
> >>>> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
> >>>>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
> >>>>    
> >>>>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
> >>>>    This function can mark this  IO req's status to ready to commit.
> >>>>
> >>>> IO handling/completion and io_uring_enter() can happen at the same time.
> >>>>
> >>>> Besides, io_uring_enter can:
> >>>>
> >>>> 1) block and wait for cqes until at least
> >>>> one blk-mq req comes from queue_rq()
> >>>>
> >>>> 2) submit sqes(with last IO completion and next fetch)
> >>>>
> >>>> so I have to consider how to notify io_uring about io completion 
> >>>> after io_uring_enter() is slept(block and wait for cqes).
> >>>
> >>> Yeah, that was exactly my question, :-)
> >>>
> >>>>
> >>>> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
> >>>> io_uring_enter_timeout() and caller can set a timeout value for it.
> >>>> So IO completions happen after io_uring_enter_timeout() call can be committed
> >>>> by next io_uring_enter_timeout() call...
> >>>>
> >>>> But this is a very ugly implementation 
> >>>> because I may waste CPU on useless loops in ubdsrv queue thread if
> >>>> blk-mq reqs do not income frequently.
> >>>>
> >>>> You mentioned that eventfd may be helpful and I agree with you. :)
> >>>> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
> >>>> in  ubdlib_complete_io_request().
> >>>>
> >>>> I will fix my code.
> >>>
> >>> FYI, there is one example about using eventfd to wakeup io_uring, which
> >>> can be added to the library for your usecase:
> >>>
> >>> https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b
> >>
> >> Thanks, will take a view.
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> (2) ubdsrv forks a daemon and it takes over everything.
> >>>>>>>>     Users should type "list/stop/del" ctrl-commands to interact with
> >>>>>>>>     the daemon. It is inconvenient for our backend
> >>>>>>>>     because it has threads(from a C++ thread library) running inside.
> >>>>>>>
> >>>>>>> No, list/stop/del won't interact with the daemon, and the per-queue
> >>>>>>> pthread is only handling IO commands(io_uring passthrough) and IO request.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Sorry I made a mistake.
> >>>>>>
> >>>>>> I mean from user's view, 
> >>>>>> he has to type list/del/stop from cmdlind to control the daemon.
> >>>>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> >>>>>>
> >>>>>> This is a little weird if we try to make a ubd library.
> >>>>>> So I actually provides APIs in libubd for users to do these list/del/stop works.
> >>>>>
> >>>>> OK, that is fine to export APIs for admin purpose.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>>>>>>>     The data flow is:
> >>>>>>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>>>>>>>     Since ubdsrv does not export its internal data buffer to backend,
> >>>>>>>>     the second copy is unavoidable. 
> >>>>>>>>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>>>>>>>     if there are hundreds of ubd devices(/dev/ubdbX).
> >>>>>>>
> >>>>>>> The preallocation is just virtual memory, which is cheap and not pinned, but
> >>>>>>> ubdsrv does support buffer provided by io command, see:
> >>>>>>>
> >>>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>>>
> >>>>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> >>>>>> but you did not reply :)
> >>>>>>
> >>>>>> I paste it here:
> >>>>>>
> >>>>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> >>>>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> >>>>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> >>>>>>
> >>>>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> >>>>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> >>>>>
> >>>>> Usually posix_memalign just allocates virtual memory which is unlimited
> >>>>> in 64bit arch, and pages should be allocated until the buffer is read or write.
> >>>>> After the READ/WRITE is done, kernel still can reclaim the pages in this
> >>>>> virtual memory.
> >>>>>
> >>>>> In future, we still may optimize the memory uses via madvise, such as
> >>>>> MADV_DONTNEED, after the slot is idle for long enough.
> >>>>
> >>>> Ok, thanks for explanation. 
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Another IMPORTANT problem is your commit:
> >>>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>>> may be not helpful for WRITE requests if I understand correctly.
> >>>>>>
> >>>>>> Consider this data flow:
> >>>>>>
> >>>>>> 1. ubdsrv commits an IO req(req1, a READ req).
> >>>>>>
> >>>>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
> >>>>>>    addr1 is the addr of buffer user passed.
> >>>>>>    
> >>>>>>
> >>>>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> >>>>>>
> >>>>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> >>>>>>
> >>>>>> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> >>>>>>
> >>>>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> >>>>>>
> >>>>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
> >>>>>>    cmd(with buffer set to addr1).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> >>>>>> until you get one in queue_rq. So length of addr1 may be less than data_len.
> >>>>>
> >>>>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> >>>>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> >>>>> waste, but process virtual address is unlimited for 64bit arch, and
> >>>>> pages are allocated until actual read/write is started.
> >>>>
> >>>> Ok, since I allow users to config rq_max_blocks in libubd, 
> >>>> it's users' responsibility to ensure length of user buffers
> >>>> is at least rq_max_blocks.
> >>>>
> >>>> Now I agree on your commit:
> >>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>>>
> >>>> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> To better use ubd in more complicated scenarios, we have developed libubd.
> >>>>>>>> It does not assume implementation of backend and can be embedded into it.
> >>>>>>>> We refer to the code structure of tcmu-runner[4], 
> >>>>>>>> which includes a library(libtcmu) for users 
> >>>>>>>> to embed tcmu-runner inside backend's code. 
> >>>>>>>> It:
> >>>>>>>>
> >>>>>>>> (1) Does not fork/pthread_create but embedded in backend's threads
> >>>>>>>
> >>>>>>> That is because your backend may not use io_uring, I guess.
> >>>>>>>
> >>>>>>> But it is pretty easy to move the decision of creating pthread to target
> >>>>>>> code, which can be done in the interface of .prepare_target().
> >>>>>>
> >>>>>> I think the library should not create any thread if we want a libubd.
> >>>>>
> >>>>> I Agree.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>>>>>>>     and fetch/commit IO requests
> >>>>>>>
> >>>>>>> The above could be the main job of libubd.
> >>>>>>
> >>>>>> indeed.
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>>>>>>>     since the backend actually has no knowledge 
> >>>>>>>>     on incoming data size until it gets an IO descriptor.
> >>>>>>>
> >>>>>>> I can understand your requirement, not look at your code yet, but libubd
> >>>>>>> should be pretty thin from function viewpoint, and there are lots of common
> >>>>>>> things to abstract/share among all drivers, please see recent ubdsrv change:
> >>>>>>>
> >>>>>>> https://github.com/ming1/ubdsrv/commits/master
> >>>>>>>
> >>>>>>> in which:
> >>>>>>> 	- coroutine is added for handling target io
> >>>>>>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> >>>>>>> 	supporting complicated target
> >>>>>>> 	- c++ support
> >>>>>>
> >>>>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> >>>>>> I think it is actually target(backend) design and ubd should not assume 
> >>>>>> how the backend handle IOs. 
> >>>>>>
> >>>>>> The work ubd in userspace has to be done is:
> >>>>>>
> >>>>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> >>>>>>
> >>>>>> 2) get IO completion form backend, such as ubd_complete_io_requests()
> >>>>>
> >>>>> Or the user provides/registers two callbacks: handle_io_async() and
> >>>>> io_complete(), the former is called when one request comes from ubd
> >>>>> driver, the latter(optional) is called when one io is done.
> >>>>>
> >>>>> Also you didn't mention how you notify io_uring about io completion after
> >>>>> io_uring_enter() is slept if your backend code doesn't use io_uring to
> >>>>> handle io.
> >>>>>
> >>>>> I think one communication mechanism(such as eventfd) is needed for your
> >>>>> case.
> >>>>
> >>>> Ok, I will try eventfd with io_uring.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> >>>>>>> into ubdsrv easily. The potential users could be existed usersapce
> >>>>>>> block driver projects.
> >>>>>>
> >>>>>> Yes, so many userspace storage systems can use ubd!
> >>>>>> You may look at tcmu-runner. It:
> >>>>>>
> >>>>>> 1) provides a library(libtcmu.c) for those who have a existing backend.
> >>>>>>
> >>>>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
> >>>>>>    for those who just want to run it. 
> >>>>>>    And the runner is build on top of libtcmu.
> >>>>>>
> >>>>>>>
> >>>>>>> If you don't object, I am happy to co-work with you to add the support
> >>>>>>> for libubd in ubdsrv, then we can avoid to invent a wheel
> >>>>>>
> >>>>>> +1 :)
> >>>>>
> >>>>> Thinking of further, I'd suggest to split ubdsrv into two parts:
> >>>>>
> >>>>> 1) libubdsrv
> >>>>> - provide APIs like what you did in libubd
> >>>>> - provide API for notify io_uring(handling io command) that one io is
> >>>>> completed, and the API should support handling IO from other context
> >>>>> (not same with the io_uring context for handling io command).
> >>>>>
> >>>>> 2) ubd target
> >>>>> - built on libubdsrv, such as ubd command is built on libubdsrv, and
> >>>>> specific target implementation is built on the library too.
> >>>>>
> >>>>> It shouldn't be hard to work towards this direction, and I guess this
> >>>>> way should make current target implementation more clean.
> >>>>>
> >>>>
> >>>> Yes, this is like tcmu-runner's structure: a libtcmu and some target
> >>>> Thanks, Ming.  Glad to co-work with you.
> >>>>
> >>>> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).
> >>>
> >>> I have added libublk branch for working towards this direction, if we
> >>> cowork on libublk, please write patch against this branch, then I can
> >>> apply your patch directly.
> >>>
> >>> https://github.com/ming1/ubdsrv/tree/libublk
> >>
> >> Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot
> >> because:
> >> 1) target implementation will be built on top of libubdsrv and the target
> >>    should create pthread(ubdsrv loop) itself.
> >>
> >> 2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv.
> >>    It was really a hard job. :-(
> > 
> > Both the two are not hard to do, and turns out that making libubdsrv is actually
> > one big cleanup.
> > 
> > All these works[1] are basically done:
> > 
> > 1) libublksrv
> > - built .so and .a are under lib/
> > - exported header file is include/ublksrv.h
> > - so any other application can make ublk device against this library
> > - eventfd notification is added too, so io handling doesn't have to
> > be done via io_uring, one callback of ->handle_event(), and two APIs
> > are added for this support
> > 
> > 2) ublk/ubd utility
> > - built against libublksrv, meantime it uses the private header of the
> > library too, which is fine, since the two are in same project
> > 
> > 3) two examples
> > - demo_null.c: one < 200 LOC standalone example to show how to make
> > a ubd/null block device against libublksrv
> > 
> > - demo_event.c: one simple standalone example(~300LOC) to make one ublk
> > disk by handling io via another pthread(not by io_uring) against
> > libublksrv
> > 
> > Any comments/feedback/tests are welcome.
> > 
> > 
> > [1] libublk
> > https://github.com/ming1/ubdsrv/tree/libublk
> > 
> 
> Hi Ming,
> 
> Thanks for your libublk code.
> You almost have done everything we discussed before :)
> 
> One small question:
> Could you please write another demo_event_loop.c:
> 
> 1) based on demo_event.c(handling io via another pthread)
>    but the loop backend is a fd(file). User calls pwrite/pread with the fd.
>    
> 2) User-provided buffer: data buffer is allocated by backend
>    (not posix_memalign() in ublksrv.c)
>    and it is passed to ublksrv(finally to ubd_drv in kernel) by libublksrv's API
>    ( maybe passed as an arg to ublksrv_complete_io()? )

The current design requires the buffer to be pre-allocated because
issuing each io command to ublk driver needs the buffer, see
ublksrv_queue_io_cmd():
	...
	cmd->addr       = (__u64)io->buf_addr;
	...

and we don't know when the io command returns from ublk driver, and we
can make the pages mapped to the anonymous buffer to be dropped without
swap out.

It should be easy to provide API for target code to allocate io buffers,
but as I said, it has to be pre-allocated, then one easy way is to
add callbacks of ->allocate[free]_io_buffer(struct ublksrv_queue *q, int tag)
for target code, and the callback will be called from ublksrv_queue_init(), and
still can be controlled by one flag, such as UBLK_F_NEED_TGT_IO_BUF.

Please feel free to let me know if your requirement can be reached by
adding callbacks of ->allocate[free]_io_buffer(struct ublksrv_queue *q,
int tag).


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-07-04 13:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27  8:20 [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough) Ziyang Zhang
2022-06-27 15:29 ` Ming Lei
2022-06-29  3:22   ` Ziyang Zhang
2022-06-29 11:33     ` Ming Lei
2022-06-30  7:16       ` Ziyang Zhang
2022-06-30  9:09         ` Ming Lei
2022-06-30  9:29           ` Ziyang Zhang
2022-06-30 11:45             ` Ming Lei
2022-07-04  4:08             ` Ming Lei
2022-07-04  9:49               ` Ziyang Zhang
2022-07-04 11:03                 ` Ming Lei
2022-07-04 13:50                 ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.