All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] block drivers in user space
@ 2022-02-21 19:59 Gabriel Krisman Bertazi
  2022-02-21 23:16 ` Damien Le Moal
                   ` (4 more replies)
  0 siblings, 5 replies; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-02-21 19:59 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-block

I'd like to discuss an interface to implement user space block devices,
while avoiding local network NBD solutions.  There has been reiterated
interest in the topic, both from researchers [1] and from the community,
including a proposed session in LSFMM2018 [2] (though I don't think it
happened).

I've been working on top of the Google iblock implementation to find
something upstreamable and would like to present my design and gather
feedback on some points, in particular zero-copy and overall user space
interface.

The design I'm pending towards uses special fds opened by the driver to
transfer data to/from the block driver, preferably through direct
splicing as much as possible, to keep data only in kernel space.  This
is because, in my use case, the driver usually only manipulates
metadata, while data is forwarded directly through the network, or
similar. It would be neat if we can leverage the existing
splice/copy_file_range syscalls such that we don't ever need to bring
disk data to user space, if we can avoid it.  I've also experimented
with regular pipes, But I found no way around keeping a lot of pipes
opened, one for each possible command 'slot'.

[1] https://dl.acm.org/doi/10.1145/3456727.3463768
[2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-21 19:59 [LSF/MM/BPF TOPIC] block drivers in user space Gabriel Krisman Bertazi
@ 2022-02-21 23:16 ` Damien Le Moal
  2022-02-21 23:30   ` Gabriel Krisman Bertazi
  2022-02-22  6:57 ` Hannes Reinecke
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Damien Le Moal @ 2022-02-21 23:16 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 2/22/22 04:59, Gabriel Krisman Bertazi wrote:
> I'd like to discuss an interface to implement user space block devices,
> while avoiding local network NBD solutions.  There has been reiterated
> interest in the topic, both from researchers [1] and from the community,
> including a proposed session in LSFMM2018 [2] (though I don't think it
> happened).
> 
> I've been working on top of the Google iblock implementation to find
> something upstreamable and would like to present my design and gather
> feedback on some points, in particular zero-copy and overall user space
> interface.
> 
> The design I'm pending towards uses special fds opened by the driver to
> transfer data to/from the block driver, preferably through direct
> splicing as much as possible, to keep data only in kernel space.  This
> is because, in my use case, the driver usually only manipulates
> metadata, while data is forwarded directly through the network, or
> similar. It would be neat if we can leverage the existing
> splice/copy_file_range syscalls such that we don't ever need to bring
> disk data to user space, if we can avoid it.  I've also experimented
> with regular pipes, But I found no way around keeping a lot of pipes
> opened, one for each possible command 'slot'.
> 
> [1] https://dl.acm.org/doi/10.1145/3456727.3463768

This is $15 for non ACM members... Any public download available ?

> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-21 23:16 ` Damien Le Moal
@ 2022-02-21 23:30   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-02-21 23:30 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: lsf-pc, linux-block

Damien Le Moal <damien.lemoal@opensource.wdc.com> writes:

> On 2/22/22 04:59, Gabriel Krisman Bertazi wrote:
>> I'd like to discuss an interface to implement user space block devices,
>> while avoiding local network NBD solutions.  There has been reiterated
>> interest in the topic, both from researchers [1] and from the community,
>> including a proposed session in LSFMM2018 [2] (though I don't think it
>> happened).
>> 
>> I've been working on top of the Google iblock implementation to find
>> something upstreamable and would like to present my design and gather
>> feedback on some points, in particular zero-copy and overall user space
>> interface.
>> 
>> The design I'm pending towards uses special fds opened by the driver to
>> transfer data to/from the block driver, preferably through direct
>> splicing as much as possible, to keep data only in kernel space.  This
>> is because, in my use case, the driver usually only manipulates
>> metadata, while data is forwarded directly through the network, or
>> similar. It would be neat if we can leverage the existing
>> splice/copy_file_range syscalls such that we don't ever need to bring
>> disk data to user space, if we can avoid it.  I've also experimented
>> with regular pipes, But I found no way around keeping a lot of pipes
>> opened, one for each possible command 'slot'.
>> 
>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>
> This is $15 for non ACM members... Any public download available ?


Hi Damien,

Yes. Sorry for the ACM link.  yuck.  One of the authors published the
paper here:

https://rgmacedo.github.io/files/2021/systor21-bdus/bdus-paper.pdf

>
>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>> 

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-21 19:59 [LSF/MM/BPF TOPIC] block drivers in user space Gabriel Krisman Bertazi
  2022-02-21 23:16 ` Damien Le Moal
@ 2022-02-22  6:57 ` Hannes Reinecke
  2022-02-22 14:46   ` Sagi Grimberg
                     ` (3 more replies)
  2022-02-23  5:57 ` Gao Xiang
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 54+ messages in thread
From: Hannes Reinecke @ 2022-02-22  6:57 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
> I'd like to discuss an interface to implement user space block devices,
> while avoiding local network NBD solutions.  There has been reiterated
> interest in the topic, both from researchers [1] and from the community,
> including a proposed session in LSFMM2018 [2] (though I don't think it
> happened).
> 
> I've been working on top of the Google iblock implementation to find
> something upstreamable and would like to present my design and gather
> feedback on some points, in particular zero-copy and overall user space
> interface.
> 
> The design I'm pending towards uses special fds opened by the driver to
> transfer data to/from the block driver, preferably through direct
> splicing as much as possible, to keep data only in kernel space.  This
> is because, in my use case, the driver usually only manipulates
> metadata, while data is forwarded directly through the network, or
> similar. It would be neat if we can leverage the existing
> splice/copy_file_range syscalls such that we don't ever need to bring
> disk data to user space, if we can avoid it.  I've also experimented
> with regular pipes, But I found no way around keeping a lot of pipes
> opened, one for each possible command 'slot'.
> 
> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> 
Actually, I'd rather have something like an 'inverse io_uring', where an 
application creates a memory region separated into several 'ring' for 
submission and completion.
Then the kernel could write/map the incoming data onto the rings, and 
application can read from there.
Maybe it'll be worthwhile to look at virtio here.

But in either case, using fds or pipes for commands doesn't really 
scale, as the number of fds is inherently limited. And using fds 
restricts you to serial processing (as you can read only sequentially 
from a fd); with mmap() you'll get a greater flexibility and the option 
of parallel processing.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22  6:57 ` Hannes Reinecke
@ 2022-02-22 14:46   ` Sagi Grimberg
  2022-02-22 17:46     ` Hannes Reinecke
  2022-02-22 18:05     ` Gabriel Krisman Bertazi
  2022-02-22 18:05   ` Bart Van Assche
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-02-22 14:46 UTC (permalink / raw)
  To: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block


> Actually, I'd rather have something like an 'inverse io_uring', where an 
> application creates a memory region separated into several 'ring' for 
> submission and completion.
> Then the kernel could write/map the incoming data onto the rings, and 
> application can read from there.
> Maybe it'll be worthwhile to look at virtio here.

There is lio loopback backed by tcmu... I'm assuming that nvmet can
hook into the same/similar interface. nvmet is pretty lean, and we
can probably help tcmu/equivalent scale better if that is a concern...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22 14:46   ` Sagi Grimberg
@ 2022-02-22 17:46     ` Hannes Reinecke
  2022-02-22 18:05     ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2022-02-22 17:46 UTC (permalink / raw)
  To: Sagi Grimberg, Gabriel Krisman Bertazi, lsf-pc, Mike Christie; +Cc: linux-block

On 2/22/22 15:46, Sagi Grimberg wrote:
> 
>> Actually, I'd rather have something like an 'inverse io_uring', where
>> an application creates a memory region separated into several 'ring'
>> for submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
> 
> There is lio loopback backed by tcmu... I'm assuming that nvmet can
> hook into the same/similar interface. nvmet is pretty lean, and we
> can probably help tcmu/equivalent scale better if that is a concern...

Yeah; maybe. I've had a look at tcmu, but it would need to be updated to
handle multiple rings.
Mike? What'd you say?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22  6:57 ` Hannes Reinecke
  2022-02-22 14:46   ` Sagi Grimberg
@ 2022-02-22 18:05   ` Bart Van Assche
  2022-03-02 23:04   ` Gabriel Krisman Bertazi
  2022-03-27 16:35   ` Ming Lei
  3 siblings, 0 replies; 54+ messages in thread
From: Bart Van Assche @ 2022-02-22 18:05 UTC (permalink / raw)
  To: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 2/21/22 22:57, Hannes Reinecke wrote:
> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>> I'd like to discuss an interface to implement user space block devices,
>> while avoiding local network NBD solutions.  There has been reiterated
>> interest in the topic, both from researchers [1] and from the community,
>> including a proposed session in LSFMM2018 [2] (though I don't think it
>> happened).
>>
>> I've been working on top of the Google iblock implementation to find
>> something upstreamable and would like to present my design and gather
>> feedback on some points, in particular zero-copy and overall user space
>> interface.
>>
>> The design I'm pending towards uses special fds opened by the driver to
>> transfer data to/from the block driver, preferably through direct
>> splicing as much as possible, to keep data only in kernel space.  This
>> is because, in my use case, the driver usually only manipulates
>> metadata, while data is forwarded directly through the network, or
>> similar. It would be neat if we can leverage the existing
>> splice/copy_file_range syscalls such that we don't ever need to bring
>> disk data to user space, if we can avoid it.  I've also experimented
>> with regular pipes, But I found no way around keeping a lot of pipes
>> opened, one for each possible command 'slot'.
>>
>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html

There have been more discussions about this topic, e.g. a conversation 
about the Android block-device-in-user-space implementation. See also 
https://lore.kernel.org/all/20201203215859.2719888-1-palmer@dabbelt.com/

> Actually, I'd rather have something like an 'inverse io_uring', where an 
> application creates a memory region separated into several 'ring' for 
> submission and completion.
> Then the kernel could write/map the incoming data onto the rings, and 
> application can read from there.

+1 for using command rings to communicate with user space.

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22 14:46   ` Sagi Grimberg
  2022-02-22 17:46     ` Hannes Reinecke
@ 2022-02-22 18:05     ` Gabriel Krisman Bertazi
  2022-02-24  9:37       ` Xiaoguang Wang
  2022-02-24 10:12       ` Sagi Grimberg
  1 sibling, 2 replies; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-02-22 18:05 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Hannes Reinecke, lsf-pc, linux-block

Sagi Grimberg <sagi@grimberg.me> writes:

>> Actually, I'd rather have something like an 'inverse io_uring', where
>> an application creates a memory region separated into several 'ring'
>> for submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
>
> There is lio loopback backed by tcmu... I'm assuming that nvmet can
> hook into the same/similar interface. nvmet is pretty lean, and we
> can probably help tcmu/equivalent scale better if that is a concern...

Sagi,

I looked at tcmu prior to starting this work.  Other than the tcmu
overhead, one concern was the complexity of a scsi device interface
versus sending block requests to userspace.

What would be the advantage of doing it as a nvme target over delivering
directly to userspace as a block driver?

Also, when considering the case where userspace wants to just look at the IO
descriptor, without actually sending data to userspace, I'm not sure
that would be doable with tcmu?

Another attempt to do the same thing here, now with device-mapper:

https://patchwork.kernel.org/project/dm-devel/patch/20201203215859.2719888-4-palmer@dabbelt.com/

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-21 19:59 [LSF/MM/BPF TOPIC] block drivers in user space Gabriel Krisman Bertazi
  2022-02-21 23:16 ` Damien Le Moal
  2022-02-22  6:57 ` Hannes Reinecke
@ 2022-02-23  5:57 ` Gao Xiang
  2022-02-23  7:46   ` Damien Le Moal
  2022-03-02 16:52 ` Mike Christie
  2022-03-05  7:29 ` Dongsheng Yang
  4 siblings, 1 reply; 54+ messages in thread
From: Gao Xiang @ 2022-02-23  5:57 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> I'd like to discuss an interface to implement user space block devices,
> while avoiding local network NBD solutions.  There has been reiterated
> interest in the topic, both from researchers [1] and from the community,
> including a proposed session in LSFMM2018 [2] (though I don't think it
> happened).
> 
> I've been working on top of the Google iblock implementation to find
> something upstreamable and would like to present my design and gather
> feedback on some points, in particular zero-copy and overall user space
> interface.
> 
> The design I'm pending towards uses special fds opened by the driver to
> transfer data to/from the block driver, preferably through direct
> splicing as much as possible, to keep data only in kernel space.  This
> is because, in my use case, the driver usually only manipulates
> metadata, while data is forwarded directly through the network, or
> similar. It would be neat if we can leverage the existing
> splice/copy_file_range syscalls such that we don't ever need to bring
> disk data to user space, if we can avoid it.  I've also experimented
> with regular pipes, But I found no way around keeping a lot of pipes
> opened, one for each possible command 'slot'.
> 
> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html

I'm interested in this general topic too. One of our use cases is
that we need to process network data in some degree since many
protocols are application layer protocols so it seems more reasonable
to process such protocols in userspace. And another difference is that
we may have thousands of devices in a machine since we'd better to run
containers as many as possible so the block device solution seems
suboptimal to us. Yet I'm still interested in this topic to get more
ideas.

Btw, As for general userspace block device solutions, IMHO, there could
be some deadlock issues out of direct reclaim, writeback, and userspace
implementation due to writeback user requests can be tripped back to
the kernel side (even the dependency crosses threads). I think they are
somewhat hard to fix with user block device solutions. For example,
https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com

Thanks,
Gao Xiang

> 
> -- 
> Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-23  5:57 ` Gao Xiang
@ 2022-02-23  7:46   ` Damien Le Moal
  2022-02-23  8:11     ` Gao Xiang
  0 siblings, 1 reply; 54+ messages in thread
From: Damien Le Moal @ 2022-02-23  7:46 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc, linux-block

On 2/23/22 14:57, Gao Xiang wrote:
> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
>> I'd like to discuss an interface to implement user space block devices,
>> while avoiding local network NBD solutions.  There has been reiterated
>> interest in the topic, both from researchers [1] and from the community,
>> including a proposed session in LSFMM2018 [2] (though I don't think it
>> happened).
>>
>> I've been working on top of the Google iblock implementation to find
>> something upstreamable and would like to present my design and gather
>> feedback on some points, in particular zero-copy and overall user space
>> interface.
>>
>> The design I'm pending towards uses special fds opened by the driver to
>> transfer data to/from the block driver, preferably through direct
>> splicing as much as possible, to keep data only in kernel space.  This
>> is because, in my use case, the driver usually only manipulates
>> metadata, while data is forwarded directly through the network, or
>> similar. It would be neat if we can leverage the existing
>> splice/copy_file_range syscalls such that we don't ever need to bring
>> disk data to user space, if we can avoid it.  I've also experimented
>> with regular pipes, But I found no way around keeping a lot of pipes
>> opened, one for each possible command 'slot'.
>>
>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> 
> I'm interested in this general topic too. One of our use cases is
> that we need to process network data in some degree since many
> protocols are application layer protocols so it seems more reasonable
> to process such protocols in userspace. And another difference is that
> we may have thousands of devices in a machine since we'd better to run
> containers as many as possible so the block device solution seems
> suboptimal to us. Yet I'm still interested in this topic to get more
> ideas.
> 
> Btw, As for general userspace block device solutions, IMHO, there could
> be some deadlock issues out of direct reclaim, writeback, and userspace
> implementation due to writeback user requests can be tripped back to
> the kernel side (even the dependency crosses threads). I think they are
> somewhat hard to fix with user block device solutions. For example,
> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com

This is already fixed with prctl() support. See:

https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-23  7:46   ` Damien Le Moal
@ 2022-02-23  8:11     ` Gao Xiang
  2022-02-23 22:40       ` Damien Le Moal
  0 siblings, 1 reply; 54+ messages in thread
From: Gao Xiang @ 2022-02-23  8:11 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block

On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> On 2/23/22 14:57, Gao Xiang wrote:
> > On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> >> I'd like to discuss an interface to implement user space block devices,
> >> while avoiding local network NBD solutions.  There has been reiterated
> >> interest in the topic, both from researchers [1] and from the community,
> >> including a proposed session in LSFMM2018 [2] (though I don't think it
> >> happened).
> >>
> >> I've been working on top of the Google iblock implementation to find
> >> something upstreamable and would like to present my design and gather
> >> feedback on some points, in particular zero-copy and overall user space
> >> interface.
> >>
> >> The design I'm pending towards uses special fds opened by the driver to
> >> transfer data to/from the block driver, preferably through direct
> >> splicing as much as possible, to keep data only in kernel space.  This
> >> is because, in my use case, the driver usually only manipulates
> >> metadata, while data is forwarded directly through the network, or
> >> similar. It would be neat if we can leverage the existing
> >> splice/copy_file_range syscalls such that we don't ever need to bring
> >> disk data to user space, if we can avoid it.  I've also experimented
> >> with regular pipes, But I found no way around keeping a lot of pipes
> >> opened, one for each possible command 'slot'.
> >>
> >> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> >> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > 
> > I'm interested in this general topic too. One of our use cases is
> > that we need to process network data in some degree since many
> > protocols are application layer protocols so it seems more reasonable
> > to process such protocols in userspace. And another difference is that
> > we may have thousands of devices in a machine since we'd better to run
> > containers as many as possible so the block device solution seems
> > suboptimal to us. Yet I'm still interested in this topic to get more
> > ideas.
> > 
> > Btw, As for general userspace block device solutions, IMHO, there could
> > be some deadlock issues out of direct reclaim, writeback, and userspace
> > implementation due to writeback user requests can be tripped back to
> > the kernel side (even the dependency crosses threads). I think they are
> > somewhat hard to fix with user block device solutions. For example,
> > https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> 
> This is already fixed with prctl() support. See:
> 
> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/

As I mentioned above, IMHO, we could add some per-task state to avoid
the majority of such deadlock cases (also what I mentioned above), but
there may still some potential dependency could happen between threads,
such as using another kernel workqueue and waiting on it (in principle
at least) since userspace program can call any syscall in principle (
which doesn't like in-kernel drivers). So I think it can cause some
risk due to generic userspace block device restriction, please kindly
correct me if I'm wrong.

Thanks,
Gao Xiang

> 
> 
> -- 
> Damien Le Moal
> Western Digital Research

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-23  8:11     ` Gao Xiang
@ 2022-02-23 22:40       ` Damien Le Moal
  2022-02-24  0:58         ` Gao Xiang
  0 siblings, 1 reply; 54+ messages in thread
From: Damien Le Moal @ 2022-02-23 22:40 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc, linux-block

On 2/23/22 17:11, Gao Xiang wrote:
> On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
>> On 2/23/22 14:57, Gao Xiang wrote:
>>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
>>>> I'd like to discuss an interface to implement user space block devices,
>>>> while avoiding local network NBD solutions.  There has been reiterated
>>>> interest in the topic, both from researchers [1] and from the community,
>>>> including a proposed session in LSFMM2018 [2] (though I don't think it
>>>> happened).
>>>>
>>>> I've been working on top of the Google iblock implementation to find
>>>> something upstreamable and would like to present my design and gather
>>>> feedback on some points, in particular zero-copy and overall user space
>>>> interface.
>>>>
>>>> The design I'm pending towards uses special fds opened by the driver to
>>>> transfer data to/from the block driver, preferably through direct
>>>> splicing as much as possible, to keep data only in kernel space.  This
>>>> is because, in my use case, the driver usually only manipulates
>>>> metadata, while data is forwarded directly through the network, or
>>>> similar. It would be neat if we can leverage the existing
>>>> splice/copy_file_range syscalls such that we don't ever need to bring
>>>> disk data to user space, if we can avoid it.  I've also experimented
>>>> with regular pipes, But I found no way around keeping a lot of pipes
>>>> opened, one for each possible command 'slot'.
>>>>
>>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>>>
>>> I'm interested in this general topic too. One of our use cases is
>>> that we need to process network data in some degree since many
>>> protocols are application layer protocols so it seems more reasonable
>>> to process such protocols in userspace. And another difference is that
>>> we may have thousands of devices in a machine since we'd better to run
>>> containers as many as possible so the block device solution seems
>>> suboptimal to us. Yet I'm still interested in this topic to get more
>>> ideas.
>>>
>>> Btw, As for general userspace block device solutions, IMHO, there could
>>> be some deadlock issues out of direct reclaim, writeback, and userspace
>>> implementation due to writeback user requests can be tripped back to
>>> the kernel side (even the dependency crosses threads). I think they are
>>> somewhat hard to fix with user block device solutions. For example,
>>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
>>
>> This is already fixed with prctl() support. See:
>>
>> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> 
> As I mentioned above, IMHO, we could add some per-task state to avoid
> the majority of such deadlock cases (also what I mentioned above), but
> there may still some potential dependency could happen between threads,
> such as using another kernel workqueue and waiting on it (in principle
> at least) since userspace program can call any syscall in principle (
> which doesn't like in-kernel drivers). So I think it can cause some
> risk due to generic userspace block device restriction, please kindly
> correct me if I'm wrong.

Not sure what you mean with all this. prctl() works per process/thread
and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
set. So for the case of a user block device driver, setting this means
that it cannot reenter itself during a memory allocation, regardless of
the system call it executes (FS etc): all memory allocations in any
syscall executed by the context will have GFP_NOIO.

If the kernel-side driver for the user block device driver does any
allocation that does not have GFP_NOIO, or cause any such allocation
(e.g. within a workqueue it is waiting for), then that is a kernel bug.
Block device drivers are not supposed to ever do a memory allocation in
the IO hot path without GFP_NOIO.

> 
> Thanks,
> Gao Xiang
> 
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-23 22:40       ` Damien Le Moal
@ 2022-02-24  0:58         ` Gao Xiang
  2022-06-09  2:01           ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Gao Xiang @ 2022-02-24  0:58 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block

On Thu, Feb 24, 2022 at 07:40:47AM +0900, Damien Le Moal wrote:
> On 2/23/22 17:11, Gao Xiang wrote:
> > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> >> On 2/23/22 14:57, Gao Xiang wrote:
> >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> >>>> I'd like to discuss an interface to implement user space block devices,
> >>>> while avoiding local network NBD solutions.  There has been reiterated
> >>>> interest in the topic, both from researchers [1] and from the community,
> >>>> including a proposed session in LSFMM2018 [2] (though I don't think it
> >>>> happened).
> >>>>
> >>>> I've been working on top of the Google iblock implementation to find
> >>>> something upstreamable and would like to present my design and gather
> >>>> feedback on some points, in particular zero-copy and overall user space
> >>>> interface.
> >>>>
> >>>> The design I'm pending towards uses special fds opened by the driver to
> >>>> transfer data to/from the block driver, preferably through direct
> >>>> splicing as much as possible, to keep data only in kernel space.  This
> >>>> is because, in my use case, the driver usually only manipulates
> >>>> metadata, while data is forwarded directly through the network, or
> >>>> similar. It would be neat if we can leverage the existing
> >>>> splice/copy_file_range syscalls such that we don't ever need to bring
> >>>> disk data to user space, if we can avoid it.  I've also experimented
> >>>> with regular pipes, But I found no way around keeping a lot of pipes
> >>>> opened, one for each possible command 'slot'.
> >>>>
> >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> >>>
> >>> I'm interested in this general topic too. One of our use cases is
> >>> that we need to process network data in some degree since many
> >>> protocols are application layer protocols so it seems more reasonable
> >>> to process such protocols in userspace. And another difference is that
> >>> we may have thousands of devices in a machine since we'd better to run
> >>> containers as many as possible so the block device solution seems
> >>> suboptimal to us. Yet I'm still interested in this topic to get more
> >>> ideas.
> >>>
> >>> Btw, As for general userspace block device solutions, IMHO, there could
> >>> be some deadlock issues out of direct reclaim, writeback, and userspace
> >>> implementation due to writeback user requests can be tripped back to
> >>> the kernel side (even the dependency crosses threads). I think they are
> >>> somewhat hard to fix with user block device solutions. For example,
> >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> >>
> >> This is already fixed with prctl() support. See:
> >>
> >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> > 
> > As I mentioned above, IMHO, we could add some per-task state to avoid
> > the majority of such deadlock cases (also what I mentioned above), but
> > there may still some potential dependency could happen between threads,
> > such as using another kernel workqueue and waiting on it (in principle
> > at least) since userspace program can call any syscall in principle (
> > which doesn't like in-kernel drivers). So I think it can cause some
> > risk due to generic userspace block device restriction, please kindly
> > correct me if I'm wrong.
> 
> Not sure what you mean with all this. prctl() works per process/thread
> and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
> set. So for the case of a user block device driver, setting this means
> that it cannot reenter itself during a memory allocation, regardless of
> the system call it executes (FS etc): all memory allocations in any
> syscall executed by the context will have GFP_NOIO.

I mean,

assuming PR_SET_IO_FLUSHER is already set on Thread A by using prctl,
but since it can call any valid system call, therefore, after it
received data due to direct reclaim and writeback, it is still
allowed to call some system call which may do something as follows:

   Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)

   (call some syscall which)

   submit something to Thread B
                                  
                                  ... (do something)

                                  memory allocation with GFP_KERNEL (it
                                  may trigger direct memory reclaim
                                  again and reenter the original fs.)

                                  wake up Thread A

   wait Thread B to complete

Normally such system call won't cause any problem since userspace
programs cannot be in a context out of writeback and direct reclaim.
Yet I'm not sure if it works under userspace block driver
writeback/direct reclaim cases.

> 
> If the kernel-side driver for the user block device driver does any
> allocation that does not have GFP_NOIO, or cause any such allocation
> (e.g. within a workqueue it is waiting for), then that is a kernel bug.
> Block device drivers are not supposed to ever do a memory allocation in
> the IO hot path without GFP_NOIO.

Yes, all in-kernel driver implementations needs to be audited with
proper memory allocation with GFP_NOIO, but userspace programs are
allowed to call any system call. And such system call can rely on
another process context with can do __GFP_FS allocation again. 

Thanks,
Gao Xiang

> 
> > 
> > Thanks,
> > Gao Xiang
> > 
> >>
> >>
> >> -- 
> >> Damien Le Moal
> >> Western Digital Research
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22 18:05     ` Gabriel Krisman Bertazi
@ 2022-02-24  9:37       ` Xiaoguang Wang
  2022-02-24 10:12       ` Sagi Grimberg
  1 sibling, 0 replies; 54+ messages in thread
From: Xiaoguang Wang @ 2022-02-24  9:37 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Sagi Grimberg
  Cc: Hannes Reinecke, lsf-pc, linux-block

hi,

> Sagi Grimberg <sagi@grimberg.me> writes:
>
>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>> an application creates a memory region separated into several 'ring'
>>> for submission and completion.
>>> Then the kernel could write/map the incoming data onto the rings, and
>>> application can read from there.
>>> Maybe it'll be worthwhile to look at virtio here.
>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>> hook into the same/similar interface. nvmet is pretty lean, and we
>> can probably help tcmu/equivalent scale better if that is a concern...
> Sagi,
>
> I looked at tcmu prior to starting this work.  Other than the tcmu
> overhead, one concern was the complexity of a scsi device interface
> versus sending block requests to userspace.
Yeah, Some of our costumers have tried to use tcmu and found obvious 
overhead, which
impact io throughput tremendously, especially it lacks zero-copy and 
multi-queue support.
Previously I have sent a report to tcmu community:
     https://www.spinics.net/lists/target-devel/msg21121.html

And currently I have implemented a zero-copy prototype for tcmu(not sent 
out yet), which
increases user's io throughput from 3.6GB to 11.5GB/s, fio 4 jobs, 8 
iodepth, io size 256kb.
This prototype uses remap_pfn_range() to map io requests' sg pages to 
user space, but
remap_pfn_range() have obvious overhead while intel pat is enabled. I 
also sent a mail to
mm community:
https://lore.kernel.org/linux-mm/c5526629-5ce4-1f99-e9af-36da2876b258@linux.alibaba.com/T/#u
About how to map sg pages to use space correctly, but there's no 
response yet.
If anybody is familiar with my question, may kindly give help, thanks.

Regards,
Xiaoguang Wang
>
> What would be the advantage of doing it as a nvme target over delivering
> directly to userspace as a block driver?
>
> Also, when considering the case where userspace wants to just look at the IO
> descriptor, without actually sending data to userspace, I'm not sure
> that would be doable with tcmu?
>
> Another attempt to do the same thing here, now with device-mapper:
>
> https://patchwork.kernel.org/project/dm-devel/patch/20201203215859.2719888-4-palmer@dabbelt.com/
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22 18:05     ` Gabriel Krisman Bertazi
  2022-02-24  9:37       ` Xiaoguang Wang
@ 2022-02-24 10:12       ` Sagi Grimberg
  2022-03-01 23:24         ` Khazhy Kumykov
  2022-03-02 16:16         ` Mike Christie
  1 sibling, 2 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-02-24 10:12 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: Hannes Reinecke, lsf-pc, linux-block


>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>> an application creates a memory region separated into several 'ring'
>>> for submission and completion.
>>> Then the kernel could write/map the incoming data onto the rings, and
>>> application can read from there.
>>> Maybe it'll be worthwhile to look at virtio here.
>>
>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>> hook into the same/similar interface. nvmet is pretty lean, and we
>> can probably help tcmu/equivalent scale better if that is a concern...
> 
> Sagi,
> 
> I looked at tcmu prior to starting this work.  Other than the tcmu
> overhead, one concern was the complexity of a scsi device interface
> versus sending block requests to userspace.

The complexity is understandable, though it can be viewed as a
capability as well. Note I do not have any desire to promote tcmu here,
just trying to understand if we need a brand new interface rather than
making the existing one better.

> What would be the advantage of doing it as a nvme target over delivering
> directly to userspace as a block driver?

Well, for starters you gain the features and tools that are extensively
used with nvme. Plus you get the ecosystem support (development,
features, capabilities and testing). There are clear advantages of
plugging into an established ecosystem.

> Also, when considering the case where userspace wants to just look at the IO
> descriptor, without actually sending data to userspace, I'm not sure
> that would be doable with tcmu?

Again, if tcmu is not a good starting point (never ran it myself) we can
think of starting with a clean slate.

> Another attempt to do the same thing here, now with device-mapper:
> 
> https://patchwork.kernel.org/project/dm-devel/patch/20201203215859.2719888-4-palmer@dabbelt.com/

I largely agree with the feedback given on this attempt.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-24 10:12       ` Sagi Grimberg
@ 2022-03-01 23:24         ` Khazhy Kumykov
  2022-03-02 16:16         ` Mike Christie
  1 sibling, 0 replies; 54+ messages in thread
From: Khazhy Kumykov @ 2022-03-01 23:24 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Gabriel Krisman Bertazi, Hannes Reinecke, lsf-pc, linux-block

[-- Attachment #1: Type: text/plain, Size: 2809 bytes --]

On Thu, Feb 24, 2022 at 2:12 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >>> Actually, I'd rather have something like an 'inverse io_uring', where
> >>> an application creates a memory region separated into several 'ring'
> >>> for submission and completion.
> >>> Then the kernel could write/map the incoming data onto the rings, and
> >>> application can read from there.
> >>> Maybe it'll be worthwhile to look at virtio here.

Another advantage that comes to mind, especially the userspace target
needs to operate on the data anyways, is if we're forwarding to
io_uring based networking, or user based networking, reading a direct
mapping may be quicker than opening a file & reading it.

(I think an idea for parallel/out-of-order processing was
fd-per-request, if this is too much overhead / too limited due to fd
count, perhaps mapping is just the way to go)

> >>
> >> There is lio loopback backed by tcmu... I'm assuming that nvmet can
> >> hook into the same/similar interface. nvmet is pretty lean, and we
> >> can probably help tcmu/equivalent scale better if that is a concern...
> >
> > Sagi,
> >
> > I looked at tcmu prior to starting this work.  Other than the tcmu
> > overhead, one concern was the complexity of a scsi device interface
> > versus sending block requests to userspace.
>
> The complexity is understandable, though it can be viewed as a
> capability as well. Note I do not have any desire to promote tcmu here,
> just trying to understand if we need a brand new interface rather than
> making the existing one better.
>
> > What would be the advantage of doing it as a nvme target over delivering
> > directly to userspace as a block driver?
>
> Well, for starters you gain the features and tools that are extensively
> used with nvme. Plus you get the ecosystem support (development,
> features, capabilities and testing). There are clear advantages of
> plugging into an established ecosystem.

I recall when discussing an nvme style approach, another advantage was
the nvme target impl could be re-used if exposing the same interface
via this user space block device interface, or e.g. presenting as nvme
device to a VM, etc.

That said, for a device that just needs to support read/write &
forward data to some userspace networked storage, the overhead in
implementation and interface should be considered. If there's a rich
set of tooling here already to create a custom nvme target, perhaps
that could be leveraged?

Maybe there's a middle ground? If we do a "inverse io_uring" -
forwarding the block interface into userspace, and allowing those who
choose to implement passthrough commands (to get the extra
"capability")? Providing an efficient mechanism to forward block
requests to userspace, then allowing the target to implement their
favorite flavor.

Khazhy

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3996 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-24 10:12       ` Sagi Grimberg
  2022-03-01 23:24         ` Khazhy Kumykov
@ 2022-03-02 16:16         ` Mike Christie
  2022-03-13 21:15           ` Sagi Grimberg
  1 sibling, 1 reply; 54+ messages in thread
From: Mike Christie @ 2022-03-02 16:16 UTC (permalink / raw)
  To: Sagi Grimberg, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block

On 2/24/22 4:12 AM, Sagi Grimberg wrote:
> 
>>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>>> an application creates a memory region separated into several 'ring'
>>>> for submission and completion.
>>>> Then the kernel could write/map the incoming data onto the rings, and
>>>> application can read from there.
>>>> Maybe it'll be worthwhile to look at virtio here.
>>>
>>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>>> hook into the same/similar interface. nvmet is pretty lean, and we
>>> can probably help tcmu/equivalent scale better if that is a concern...
>>
>> Sagi,
>>
>> I looked at tcmu prior to starting this work.  Other than the tcmu
>> overhead, one concern was the complexity of a scsi device interface
>> versus sending block requests to userspace.
> 
> The complexity is understandable, though it can be viewed as a
> capability as well. Note I do not have any desire to promote tcmu here,
> just trying to understand if we need a brand new interface rather than
> making the existing one better.

Ccing tcmu maintainer Bodo.

We don't want to re-use tcmu's interface.

Bodo has been looking into on a new interface to avoid issues tcmu has
and to improve performance. If it's allowed to add a tcmu like backend to
nvmet then it would be great because lio was not really made with mq and
perf in mind so it already starts with issues. I just started doing the
basics like removing locks from the main lio IO path but it seems like
there is just so much work.

> 
>> What would be the advantage of doing it as a nvme target over delivering
>> directly to userspace as a block driver?
> 
> Well, for starters you gain the features and tools that are extensively
> used with nvme. Plus you get the ecosystem support (development,
> features, capabilities and testing). There are clear advantages of
> plugging into an established ecosystem.

Yeah, tcmu has been really nice to export storage that people for whatever
reason didn't want in the kernel. We got the benefits you described where
distros have the required tools and their support teams have experience, etc.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-21 19:59 [LSF/MM/BPF TOPIC] block drivers in user space Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2022-02-23  5:57 ` Gao Xiang
@ 2022-03-02 16:52 ` Mike Christie
  2022-03-03  7:09   ` Hannes Reinecke
  2022-03-05  7:29 ` Dongsheng Yang
  4 siblings, 1 reply; 54+ messages in thread
From: Mike Christie @ 2022-03-02 16:52 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 2/21/22 1:59 PM, Gabriel Krisman Bertazi wrote:
> I'd like to discuss an interface to implement user space block devices,
> while avoiding local network NBD solutions.  There has been reiterated

Besides the tcmu approach, I've also worked on the local nbd based
solution like here:

https://github.com/gluster/nbd-runner

Have you looked into a modern take that uses io_uring's socket features
with the zero copy work that's being worked on for it? If so, what are
the issues you have hit with that? Was it mostly issues with the zero
copy part of it?


> interest in the topic, both from researchers [1] and from the community,
> including a proposed session in LSFMM2018 [2] (though I don't think it
> happened).
> 
> I've been working on top of the Google iblock implementation to find
> something upstreamable and would like to present my design and gather
> feedback on some points, in particular zero-copy and overall user space
> interface.
> 
> The design I'm pending towards uses special fds opened by the driver to
> transfer data to/from the block driver, preferably through direct
> splicing as much as possible, to keep data only in kernel space.  This
> is because, in my use case, the driver usually only manipulates
> metadata, while data is forwarded directly through the network, or
> similar. It would be neat if we can leverage the existing
> splice/copy_file_range syscalls such that we don't ever need to bring
> disk data to user space, if we can avoid it.  I've also experimented
> with regular pipes, But I found no way around keeping a lot of pipes
> opened, one for each possible command 'slot'.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22  6:57 ` Hannes Reinecke
  2022-02-22 14:46   ` Sagi Grimberg
  2022-02-22 18:05   ` Bart Van Assche
@ 2022-03-02 23:04   ` Gabriel Krisman Bertazi
  2022-03-03  7:17     ` Hannes Reinecke
  2022-03-27 16:35   ` Ming Lei
  3 siblings, 1 reply; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-02 23:04 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: lsf-pc, linux-block

Hannes Reinecke <hare@suse.de> writes:

> Actually, I'd rather have something like an 'inverse io_uring', where an
> application creates a memory region separated into several 'ring' for 
> submission and completion.
> Then the kernel could write/map the incoming data onto the rings, and
> application can read from there.
> Maybe it'll be worthwhile to look at virtio here.

>
> But in either case, using fds or pipes for commands doesn't really
> scale, as the number of fds is inherently limited. And using fds 
> restricts you to serial processing (as you can read only sequentially
> from a fd); with mmap() you'll get a greater flexibility and the option 
> of parallel processing.

Hannes,

I'm not trying to push an fd implementation, and seems clear that
io_uring is the right way to go.  But isn't fd virtually unlimited,
as they can be extended up to procfs's file-max for a specific user?

Also, I was proposing one fd per IO operation, so each request data is
independent.  But, even within each IO, it doesn't necessarily need to
be sequential.  Isn't pread/pwrite parallel on the same fd?

A FD-based implementation could also use existing io_uring operations,
IORING_OP_READV/IORING_OP_WRITEV, against the file descriptor.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-02 16:52 ` Mike Christie
@ 2022-03-03  7:09   ` Hannes Reinecke
  2022-03-14 17:04     ` Mike Christie
  0 siblings, 1 reply; 54+ messages in thread
From: Hannes Reinecke @ 2022-03-03  7:09 UTC (permalink / raw)
  To: Mike Christie, Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 3/2/22 17:52, Mike Christie wrote:
> On 2/21/22 1:59 PM, Gabriel Krisman Bertazi wrote:
>> I'd like to discuss an interface to implement user space block devices,
>> while avoiding local network NBD solutions.  There has been reiterated
> 
> Besides the tcmu approach, I've also worked on the local nbd based
> solution like here:
> 
> https://github.com/gluster/nbd-runner
> 
> Have you looked into a modern take that uses io_uring's socket features
> with the zero copy work that's being worked on for it? If so, what are
> the issues you have hit with that? Was it mostly issues with the zero
> copy part of it?
> 
> 
Problem is that we'd need an _inverse_ io_uring interface.
The current io_uring interface writes submission queue elements,
and waits for completion queue elements.

For this use-case we'd need to convert it to wait for submission queue 
elements, and write completion queue elements.

IE completely invert the operation.

Not sure if we can convert it _that_ easily ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-02 23:04   ` Gabriel Krisman Bertazi
@ 2022-03-03  7:17     ` Hannes Reinecke
  0 siblings, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2022-03-03  7:17 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: lsf-pc, linux-block

On 3/3/22 00:04, Gabriel Krisman Bertazi wrote:
> Hannes Reinecke <hare@suse.de> writes:
> 
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
> 
>>
>> But in either case, using fds or pipes for commands doesn't really
>> scale, as the number of fds is inherently limited. And using fds
>> restricts you to serial processing (as you can read only sequentially
>> from a fd); with mmap() you'll get a greater flexibility and the option
>> of parallel processing.
> 
> Hannes,
> 
> I'm not trying to push an fd implementation, and seems clear that
> io_uring is the right way to go.  But isn't fd virtually unlimited,
> as they can be extended up to procfs's file-max for a specific user?
> 
In principle, yes. But in practice there will _always_ be a limit (even 
if you raise the limit you _still_ have a limit), as essentially the 
number of files is the size of the fd array table.
So you can only have a very large number of fds, but not an infinite 
number of fds.

And experience with multipath-tools have taught us that you hit the fd 
limit far more often than you thought; we've seen installations where we 
had to increase the fd limit beyond 4096
(Mind you, multipath-tools is using only two fds per device).

So on those installations we'll be running out of fds pretty fast if we 
start using one fd per I/O.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-21 19:59 [LSF/MM/BPF TOPIC] block drivers in user space Gabriel Krisman Bertazi
                   ` (3 preceding siblings ...)
  2022-03-02 16:52 ` Mike Christie
@ 2022-03-05  7:29 ` Dongsheng Yang
  4 siblings, 0 replies; 54+ messages in thread
From: Dongsheng Yang @ 2022-03-05  7:29 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

Hi Gabriel,
     There is a project implement userspace block device: 
https://github.com/ubbd/ubbd

ubbd means Userspace Backend Block Device (as ubd was taken by UML block 
device.)

It has a kernel module ubbd.ko, userspace daemon ubbdd, and a tool ubbdadm.
(1) ubbd.ko depends on uio, providing cmd ring and complete ring.
(2) ubbdd implement different backends, currently file and rbd is done.
and ubbdd is designed for online restart. That means you can restart 
ubbdd with IO inflight.
(3) ubbdadm is an admin tool, providing map, unmap and config.

This project is under developing stage, but it is already tested by 
blktests and xfstests.

Also, there is some testing for rbd backend to compare performance with 
librbd and krbd.

You can find more information about it in github, if you are interested 
in it.

Thanx

在 2022/2/22 星期二 上午 3:59, Gabriel Krisman Bertazi 写道:
> I'd like to discuss an interface to implement user space block devices,
> while avoiding local network NBD solutions.  There has been reiterated
> interest in the topic, both from researchers [1] and from the community,
> including a proposed session in LSFMM2018 [2] (though I don't think it
> happened).
> 
> I've been working on top of the Google iblock implementation to find
> something upstreamable and would like to present my design and gather
> feedback on some points, in particular zero-copy and overall user space
> interface.
> 
> The design I'm pending towards uses special fds opened by the driver to
> transfer data to/from the block driver, preferably through direct
> splicing as much as possible, to keep data only in kernel space.  This
> is because, in my use case, the driver usually only manipulates
> metadata, while data is forwarded directly through the network, or
> similar. It would be neat if we can leverage the existing
> splice/copy_file_range syscalls such that we don't ever need to bring
> disk data to user space, if we can avoid it.  I've also experimented
> with regular pipes, But I found no way around keeping a lot of pipes
> opened, one for each possible command 'slot'.
> 
> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-02 16:16         ` Mike Christie
@ 2022-03-13 21:15           ` Sagi Grimberg
  2022-03-14 17:12             ` Mike Christie
  2022-03-14 19:21             ` Bart Van Assche
  0 siblings, 2 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-03-13 21:15 UTC (permalink / raw)
  To: Mike Christie, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block


>>>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>>>> an application creates a memory region separated into several 'ring'
>>>>> for submission and completion.
>>>>> Then the kernel could write/map the incoming data onto the rings, and
>>>>> application can read from there.
>>>>> Maybe it'll be worthwhile to look at virtio here.
>>>>
>>>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>>>> hook into the same/similar interface. nvmet is pretty lean, and we
>>>> can probably help tcmu/equivalent scale better if that is a concern...
>>>
>>> Sagi,
>>>
>>> I looked at tcmu prior to starting this work.  Other than the tcmu
>>> overhead, one concern was the complexity of a scsi device interface
>>> versus sending block requests to userspace.
>>
>> The complexity is understandable, though it can be viewed as a
>> capability as well. Note I do not have any desire to promote tcmu here,
>> just trying to understand if we need a brand new interface rather than
>> making the existing one better.
> 
> Ccing tcmu maintainer Bodo.
> 
> We don't want to re-use tcmu's interface.
> 
> Bodo has been looking into on a new interface to avoid issues tcmu has
> and to improve performance. If it's allowed to add a tcmu like backend to
> nvmet then it would be great because lio was not really made with mq and
> perf in mind so it already starts with issues. I just started doing the
> basics like removing locks from the main lio IO path but it seems like
> there is just so much work.

Good to know...

So I hear there is a desire to do this. So I think we should list the
use-cases for this first because that would lead to different design
choices.. For example one use-case is just to send read/write/flush
to userspace, another may want to passthru nvme commands to userspace
and there may be others...

>>> What would be the advantage of doing it as a nvme target over delivering
>>> directly to userspace as a block driver?
>>
>> Well, for starters you gain the features and tools that are extensively
>> used with nvme. Plus you get the ecosystem support (development,
>> features, capabilities and testing). There are clear advantages of
>> plugging into an established ecosystem.
> 
> Yeah, tcmu has been really nice to export storage that people for whatever
> reason didn't want in the kernel. We got the benefits you described where
> distros have the required tools and their support teams have experience, etc.

No surprise here...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-03  7:09   ` Hannes Reinecke
@ 2022-03-14 17:04     ` Mike Christie
  2022-03-15  6:45       ` Hannes Reinecke
  0 siblings, 1 reply; 54+ messages in thread
From: Mike Christie @ 2022-03-14 17:04 UTC (permalink / raw)
  To: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 3/3/22 1:09 AM, Hannes Reinecke wrote:
> On 3/2/22 17:52, Mike Christie wrote:
>> On 2/21/22 1:59 PM, Gabriel Krisman Bertazi wrote:
>>> I'd like to discuss an interface to implement user space block devices,
>>> while avoiding local network NBD solutions.  There has been reiterated
>>
>> Besides the tcmu approach, I've also worked on the local nbd based
>> solution like here:
>>
>> https://urldefense.com/v3/__https://github.com/gluster/nbd-runner__;!!ACWV5N9M2RV99hQ!YY39rbV9MpaNUtr7ElzgcG1TyPznVEt1yppLwAGkq32-Fw9rQkqB6FzcaHiwIdgXp00K$
>> Have you looked into a modern take that uses io_uring's socket features
>> with the zero copy work that's being worked on for it? If so, what are
>> the issues you have hit with that? Was it mostly issues with the zero
>> copy part of it?
>>
>>
> Problem is that we'd need an _inverse_ io_uring interface.
> The current io_uring interface writes submission queue elements,
> and waits for completion queue elements.

I'm not sure what you meant here.

io_uring can do recvs right? So userspace nbd would do
IORING_OP_RECVMSG to wait for drivers/block/nbd.c to send userspace
cmds via the local socket. Userspace nbd would do IORING_OP_SENDMSG
to send drivers/block/nbd.c the cmd response.

drivers/block/nbd doesn't know/care what userspace did. It's just
reading/writing from/to the socket.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-13 21:15           ` Sagi Grimberg
@ 2022-03-14 17:12             ` Mike Christie
  2022-03-15  8:03               ` Sagi Grimberg
  2022-03-14 19:21             ` Bart Van Assche
  1 sibling, 1 reply; 54+ messages in thread
From: Mike Christie @ 2022-03-14 17:12 UTC (permalink / raw)
  To: Sagi Grimberg, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block

On 3/13/22 4:15 PM, Sagi Grimberg wrote:
> 
>>>>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>>>>> an application creates a memory region separated into several 'ring'
>>>>>> for submission and completion.
>>>>>> Then the kernel could write/map the incoming data onto the rings, and
>>>>>> application can read from there.
>>>>>> Maybe it'll be worthwhile to look at virtio here.
>>>>>
>>>>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>>>>> hook into the same/similar interface. nvmet is pretty lean, and we
>>>>> can probably help tcmu/equivalent scale better if that is a concern...
>>>>
>>>> Sagi,
>>>>
>>>> I looked at tcmu prior to starting this work.  Other than the tcmu
>>>> overhead, one concern was the complexity of a scsi device interface
>>>> versus sending block requests to userspace.
>>>
>>> The complexity is understandable, though it can be viewed as a
>>> capability as well. Note I do not have any desire to promote tcmu here,
>>> just trying to understand if we need a brand new interface rather than
>>> making the existing one better.
>>
>> Ccing tcmu maintainer Bodo.
>>
>> We don't want to re-use tcmu's interface.
>>
>> Bodo has been looking into on a new interface to avoid issues tcmu has
>> and to improve performance. If it's allowed to add a tcmu like backend to
>> nvmet then it would be great because lio was not really made with mq and
>> perf in mind so it already starts with issues. I just started doing the
>> basics like removing locks from the main lio IO path but it seems like
>> there is just so much work.
> 
> Good to know...
> 
> So I hear there is a desire to do this. So I think we should list the
> use-cases for this first because that would lead to different design
> choices.. For example one use-case is just to send read/write/flush
> to userspace, another may want to passthru nvme commands to userspace
> and there may be others...

We might want to discuss at OLS or start a new thread.

Based on work we did for tcmu and local nbd, the issue is how complex
can handling nvme commands in the kernel get? If you want to run nvmet
on a single node then you can pass just read/write/flush to userspace
and it's not really an issue.

For tcmu/nbd the issue we are hitting is how to handle SCSI PGRs when
you are running lio on multiple nodes and the nodes export the same
LU to the same initiators. You can do it all in kernel like Bart did
for SCST and DLM
(https://blog.linuxplumbersconf.org/2015/ocw/sessions/2691.html).
However, for lio and tcmu some users didn't want pacemaker/corosync and
instead wanted to use their project's clustering or message passing
So pushing to user space is nice for these commands.

There are/were also issues with things like ALUA commands and handling
failover across nodes but I think nvme ANA avoids them. Like there
is nothing in nvme ANA like the SET_TARGET_PORT_GROUPS command which can
set the state of what would be remote ports right?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-13 21:15           ` Sagi Grimberg
  2022-03-14 17:12             ` Mike Christie
@ 2022-03-14 19:21             ` Bart Van Assche
  2022-03-15  6:52               ` Hannes Reinecke
  2022-03-15  8:04               ` Sagi Grimberg
  1 sibling, 2 replies; 54+ messages in thread
From: Bart Van Assche @ 2022-03-14 19:21 UTC (permalink / raw)
  To: Sagi Grimberg, Mike Christie, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block

On 3/13/22 14:15, Sagi Grimberg wrote:
>
>> We don't want to re-use tcmu's interface.
>>
>> Bodo has been looking into on a new interface to avoid issues tcmu has
>> and to improve performance. If it's allowed to add a tcmu like backend to
>> nvmet then it would be great because lio was not really made with mq and
>> perf in mind so it already starts with issues. I just started doing the
>> basics like removing locks from the main lio IO path but it seems like
>> there is just so much work.
>
> Good to know...
>
> So I hear there is a desire to do this. So I think we should list the
> use-cases for this first because that would lead to different design
> choices.. For example one use-case is just to send read/write/flush
> to userspace, another may want to passthru nvme commands to userspace
> and there may be others...

(resending my reply without truncating the Cc-list)

Hi Sagi,

Haven't these use cases already been mentioned in the email at the start 
of this thread? The use cases I am aware of are implementing 
cloud-specific block storage functionality and also block storage in 
user space for Android. Having to parse NVMe commands and PRP or SGL 
lists would be an unnecessary source of complexity and overhead for 
these use cases. My understanding is that what is needed for these use 
cases is something that is close to the block layer request interface 
(REQ_OP_* + request flags + data buffer).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-14 17:04     ` Mike Christie
@ 2022-03-15  6:45       ` Hannes Reinecke
  0 siblings, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2022-03-15  6:45 UTC (permalink / raw)
  To: Mike Christie, Gabriel Krisman Bertazi, lsf-pc; +Cc: linux-block

On 3/14/22 18:04, Mike Christie wrote:
> On 3/3/22 1:09 AM, Hannes Reinecke wrote:
>> On 3/2/22 17:52, Mike Christie wrote:
>>> On 2/21/22 1:59 PM, Gabriel Krisman Bertazi wrote:
>>>> I'd like to discuss an interface to implement user space block devices,
>>>> while avoiding local network NBD solutions.  There has been reiterated
>>>
>>> Besides the tcmu approach, I've also worked on the local nbd based
>>> solution like here:
>>>
>>> https://urldefense.com/v3/__https://github.com/gluster/nbd-runner__;!!ACWV5N9M2RV99hQ!YY39rbV9MpaNUtr7ElzgcG1TyPznVEt1yppLwAGkq32-Fw9rQkqB6FzcaHiwIdgXp00K$
>>> Have you looked into a modern take that uses io_uring's socket features
>>> with the zero copy work that's being worked on for it? If so, what are
>>> the issues you have hit with that? Was it mostly issues with the zero
>>> copy part of it?
>>>
>>>
>> Problem is that we'd need an _inverse_ io_uring interface.
>> The current io_uring interface writes submission queue elements,
>> and waits for completion queue elements.
> 
> I'm not sure what you meant here.
> 
> io_uring can do recvs right? So userspace nbd would do
> IORING_OP_RECVMSG to wait for drivers/block/nbd.c to send userspace
> cmds via the local socket. Userspace nbd would do IORING_OP_SENDMSG
> to send drivers/block/nbd.c the cmd response.
> 
> drivers/block/nbd doesn't know/care what userspace did. It's just
> reading/writing from/to the socket.

I was talking about the internal layout of io_uring.
It sets up submission and completion rings, writes the command & data
into the submission rings, and waits for the corresponding completion
to show up on the completion rings.

A userspace block driver would need the inverse; waiting for submissions
to show up in the submission rings, and writing completions into the 
completion ring.

recvmsg feels awkward here as one would need to write a recvmsg op into 
the submission ring, get the completion, handle the I/O, write a sendmsg 
op, wait for the completion.
IE we would double the number of operations.

Sure it's doable, and admittedly doesn't need (much) modifications for 
io_uring. But still feels like a waste, and we certainly can't reach max 
performance with that setup.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-14 19:21             ` Bart Van Assche
@ 2022-03-15  6:52               ` Hannes Reinecke
  2022-03-15  8:08                 ` Sagi Grimberg
  2022-03-15  8:04               ` Sagi Grimberg
  1 sibling, 1 reply; 54+ messages in thread
From: Hannes Reinecke @ 2022-03-15  6:52 UTC (permalink / raw)
  To: Bart Van Assche, Sagi Grimberg, Mike Christie, Gabriel Krisman Bertazi
  Cc: lsf-pc, linux-block

On 3/14/22 20:21, Bart Van Assche wrote:
> On 3/13/22 14:15, Sagi Grimberg wrote:
>>
>>> We don't want to re-use tcmu's interface.
>>>
>>> Bodo has been looking into on a new interface to avoid issues tcmu has
>>> and to improve performance. If it's allowed to add a tcmu like 
>>> backend to
>>> nvmet then it would be great because lio was not really made with mq and
>>> perf in mind so it already starts with issues. I just started doing the
>>> basics like removing locks from the main lio IO path but it seems like
>>> there is just so much work.
>>
>> Good to know...
>>
>> So I hear there is a desire to do this. So I think we should list the
>> use-cases for this first because that would lead to different design
>> choices.. For example one use-case is just to send read/write/flush
>> to userspace, another may want to passthru nvme commands to userspace
>> and there may be others...
> 
> (resending my reply without truncating the Cc-list)
> 
> Hi Sagi,
> 
> Haven't these use cases already been mentioned in the email at the start 
> of this thread? The use cases I am aware of are implementing 
> cloud-specific block storage functionality and also block storage in 
> user space for Android. Having to parse NVMe commands and PRP or SGL 
> lists would be an unnecessary source of complexity and overhead for 
> these use cases. My understanding is that what is needed for these use 
> cases is something that is close to the block layer request interface 
> (REQ_OP_* + request flags + data buffer).
> 

Curiously, the former was exactly my idea. I was thinking about having a 
simple nvmet userspace driver where all the transport 'magic' was 
handled in the nvmet driver, and just the NVMe SQEs passed on to the 
userland driver. The userland driver would then send the CQEs back to 
the driver.
With that the kernel driver becomes extremely simple, and would allow 
userspace to do all the magic it wants. More to the point, one could 
implement all sorts of fancy features which are out of scope for the 
current nvmet implementation.
Which is why I've been talking about 'inverse' io_uring; the userland 
driver will have to wait for SQEs, and write CQEs back to the driver.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-14 17:12             ` Mike Christie
@ 2022-03-15  8:03               ` Sagi Grimberg
  0 siblings, 0 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-03-15  8:03 UTC (permalink / raw)
  To: Mike Christie, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block



On 3/14/22 19:12, Mike Christie wrote:
> On 3/13/22 4:15 PM, Sagi Grimberg wrote:
>>
>>>>>>> Actually, I'd rather have something like an 'inverse io_uring', where
>>>>>>> an application creates a memory region separated into several 'ring'
>>>>>>> for submission and completion.
>>>>>>> Then the kernel could write/map the incoming data onto the rings, and
>>>>>>> application can read from there.
>>>>>>> Maybe it'll be worthwhile to look at virtio here.
>>>>>>
>>>>>> There is lio loopback backed by tcmu... I'm assuming that nvmet can
>>>>>> hook into the same/similar interface. nvmet is pretty lean, and we
>>>>>> can probably help tcmu/equivalent scale better if that is a concern...
>>>>>
>>>>> Sagi,
>>>>>
>>>>> I looked at tcmu prior to starting this work.  Other than the tcmu
>>>>> overhead, one concern was the complexity of a scsi device interface
>>>>> versus sending block requests to userspace.
>>>>
>>>> The complexity is understandable, though it can be viewed as a
>>>> capability as well. Note I do not have any desire to promote tcmu here,
>>>> just trying to understand if we need a brand new interface rather than
>>>> making the existing one better.
>>>
>>> Ccing tcmu maintainer Bodo.
>>>
>>> We don't want to re-use tcmu's interface.
>>>
>>> Bodo has been looking into on a new interface to avoid issues tcmu has
>>> and to improve performance. If it's allowed to add a tcmu like backend to
>>> nvmet then it would be great because lio was not really made with mq and
>>> perf in mind so it already starts with issues. I just started doing the
>>> basics like removing locks from the main lio IO path but it seems like
>>> there is just so much work.
>>
>> Good to know...
>>
>> So I hear there is a desire to do this. So I think we should list the
>> use-cases for this first because that would lead to different design
>> choices.. For example one use-case is just to send read/write/flush
>> to userspace, another may want to passthru nvme commands to userspace
>> and there may be others...
> 
> We might want to discuss at OLS or start a new thread.
> 
> Based on work we did for tcmu and local nbd, the issue is how complex
> can handling nvme commands in the kernel get? If you want to run nvmet
> on a single node then you can pass just read/write/flush to userspace
> and it's not really an issue.

As I said, I can see other use-cases that may want raw nvme commands
in a backend userspace driver...

> 
> For tcmu/nbd the issue we are hitting is how to handle SCSI PGRs when
> you are running lio on multiple nodes and the nodes export the same
> LU to the same initiators. You can do it all in kernel like Bart did
> for SCST and DLM
> (https://blog.linuxplumbersconf.org/2015/ocw/sessions/2691.html).
> However, for lio and tcmu some users didn't want pacemaker/corosync and
> instead wanted to use their project's clustering or message passing
> So pushing to user space is nice for these commands.

For this use-case we'd probably want to scan the config knobs to see
that we have what's needed (I think we should have enough to enable this
use-case).

> 
> There are/were also issues with things like ALUA commands and handling
> failover across nodes but I think nvme ANA avoids them. Like there
> is nothing in nvme ANA like the SET_TARGET_PORT_GROUPS command which can
> set the state of what would be remote ports right?

Right.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-14 19:21             ` Bart Van Assche
  2022-03-15  6:52               ` Hannes Reinecke
@ 2022-03-15  8:04               ` Sagi Grimberg
  1 sibling, 0 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-03-15  8:04 UTC (permalink / raw)
  To: Bart Van Assche, Mike Christie, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block


>> So I hear there is a desire to do this. So I think we should list the
>> use-cases for this first because that would lead to different design
>> choices.. For example one use-case is just to send read/write/flush
>> to userspace, another may want to passthru nvme commands to userspace
>> and there may be others...
> 
> (resending my reply without truncating the Cc-list)
> 
> Hi Sagi,
> 
> Haven't these use cases already been mentioned in the email at the start 
> of this thread? The use cases I am aware of are implementing 
> cloud-specific block storage functionality and also block storage in 
> user space for Android. Having to parse NVMe commands and PRP or SGL 
> lists would be an unnecessary source of complexity and overhead for 
> these use cases. My understanding is that what is needed for these use 
> cases is something that is close to the block layer request interface 
> (REQ_OP_* + request flags + data buffer).

pasting my response here as well:

Well, I can absolutely think of a use-case that will want raw nvme
commands and leverage vendor specific opcodes for applications like
computational storage or what not...

I would say that all the complexity of handling nvme commands in
userspace would be handle in a properly layered core stack with
pluggable backends that can see a simplified interface if they want, or
see a full passthru command.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-15  6:52               ` Hannes Reinecke
@ 2022-03-15  8:08                 ` Sagi Grimberg
  2022-03-15  8:12                   ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Sagi Grimberg @ 2022-03-15  8:08 UTC (permalink / raw)
  To: Hannes Reinecke, Bart Van Assche, Mike Christie, Gabriel Krisman Bertazi
  Cc: lsf-pc, linux-block


>> Hi Sagi,
>>
>> Haven't these use cases already been mentioned in the email at the 
>> start of this thread? The use cases I am aware of are implementing 
>> cloud-specific block storage functionality and also block storage in 
>> user space for Android. Having to parse NVMe commands and PRP or SGL 
>> lists would be an unnecessary source of complexity and overhead for 
>> these use cases. My understanding is that what is needed for these use 
>> cases is something that is close to the block layer request interface 
>> (REQ_OP_* + request flags + data buffer).
>>
> 
> Curiously, the former was exactly my idea. I was thinking about having a 
> simple nvmet userspace driver where all the transport 'magic' was 
> handled in the nvmet driver, and just the NVMe SQEs passed on to the 
> userland driver. The userland driver would then send the CQEs back to 
> the driver.
> With that the kernel driver becomes extremely simple, and would allow 
> userspace to do all the magic it wants. More to the point, one could 
> implement all sorts of fancy features which are out of scope for the 
> current nvmet implementation.

My thinking is that this simplification can be done in a userland
core library with a simpler interface for backends to plug into (or
a richer interface if that is what the use-case warrants).

> Which is why I've been talking about 'inverse' io_uring; the userland 
> driver will have to wait for SQEs, and write CQEs back to the driver.

"inverse" io_uring is just a ring interface, tcmu has it as well, I'm
assuming you are talking about the scalability attributes of it...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-15  8:08                 ` Sagi Grimberg
@ 2022-03-15  8:12                   ` Christoph Hellwig
  2022-03-15  8:38                     ` Sagi Grimberg
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2022-03-15  8:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Bart Van Assche, Mike Christie,
	Gabriel Krisman Bertazi, lsf-pc, linux-block

FYI, I have absolutely no interest in supporting any userspace hooks
in nvmet.  If you want a userspace nvme implementation please use SPDK.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-15  8:12                   ` Christoph Hellwig
@ 2022-03-15  8:38                     ` Sagi Grimberg
  2022-03-15  8:42                       ` Christoph Hellwig
  2022-03-23 19:42                       ` Gabriel Krisman Bertazi
  0 siblings, 2 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-03-15  8:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hannes Reinecke, Bart Van Assche, Mike Christie,
	Gabriel Krisman Bertazi, lsf-pc, linux-block


> FYI, I have absolutely no interest in supporting any userspace hooks
> in nvmet.

Don't think we are discussing adding anything specific to nvmet, a
userspace backend will most likely sit behind a block device exported
via nvmet (at least from my perspective). Although I do see issues
with using the passthru interface...

> If you want a userspace nvme implementation please use SPDK.

The original use-case did not include nvmet, I may have stirred
the pot saying that we have nvmet loopback instead of a new kind
of device with a new set of tools.

I don't think that spdk meets even the original android use-case.

Not touching nvmet is fine, it just eliminates some of the possible
use-cases. Although personally I don't see a huge issue with adding
yet another backend to nvmet...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-15  8:38                     ` Sagi Grimberg
@ 2022-03-15  8:42                       ` Christoph Hellwig
  2022-03-23 19:42                       ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2022-03-15  8:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Hannes Reinecke, Bart Van Assche,
	Mike Christie, Gabriel Krisman Bertazi, lsf-pc, linux-block

On Tue, Mar 15, 2022 at 10:38:24AM +0200, Sagi Grimberg wrote:
> 
> > FYI, I have absolutely no interest in supporting any userspace hooks
> > in nvmet.
> 
> Don't think we are discussing adding anything specific to nvmet, a
> userspace backend will most likely sit behind a block device exported
> via nvmet (at least from my perspective). Although I do see issues
> with using the passthru interface...

Well, anything that is properly hidden behind the block device
infrastructure does not matter for nvmet.  But that interface does
not support passthrough.  Anyone who wants to handle raw nvme commands
in userspace should not use nvmet.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-15  8:38                     ` Sagi Grimberg
  2022-03-15  8:42                       ` Christoph Hellwig
@ 2022-03-23 19:42                       ` Gabriel Krisman Bertazi
  2022-03-24 17:05                         ` Sagi Grimberg
  1 sibling, 1 reply; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-23 19:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Hannes Reinecke, Bart Van Assche,
	Mike Christie, lsf-pc, linux-block

Sagi Grimberg <sagi@grimberg.me> writes:

>> FYI, I have absolutely no interest in supporting any userspace hooks
>> in nvmet.
>
> Don't think we are discussing adding anything specific to nvmet, a
> userspace backend will most likely sit behind a block device exported
> via nvmet (at least from my perspective). Although I do see issues
> with using the passthru interface...
>
>> If you want a userspace nvme implementation please use SPDK.
>
> The original use-case did not include nvmet, I may have stirred
> the pot saying that we have nvmet loopback instead of a new kind
> of device with a new set of tools.
>
> I don't think that spdk meets even the original android use-case.
>
> Not touching nvmet is fine, it just eliminates some of the possible
> use-cases. Although personally I don't see a huge issue with adding
> yet another backend to nvmet...

After discussing with google for the r/w/flush use-case (cloud, not
android), they are interested in avoiding the source of complexity that
arises from implementing the NVMe protocol in the interface.  Even if it
is hidden behind a userspace library, it means converting block
rq->nvme->block rq, which might have a performance impact?

From your previous message, I think we can move forward with dissociating
the original use case from nvme passthrough, and have the userspace hook
as a block driver?

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-23 19:42                       ` Gabriel Krisman Bertazi
@ 2022-03-24 17:05                         ` Sagi Grimberg
  0 siblings, 0 replies; 54+ messages in thread
From: Sagi Grimberg @ 2022-03-24 17:05 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Christoph Hellwig, Hannes Reinecke, Bart Van Assche,
	Mike Christie, lsf-pc, linux-block


>>> FYI, I have absolutely no interest in supporting any userspace hooks
>>> in nvmet.
>>
>> Don't think we are discussing adding anything specific to nvmet, a
>> userspace backend will most likely sit behind a block device exported
>> via nvmet (at least from my perspective). Although I do see issues
>> with using the passthru interface...
>>
>>> If you want a userspace nvme implementation please use SPDK.
>>
>> The original use-case did not include nvmet, I may have stirred
>> the pot saying that we have nvmet loopback instead of a new kind
>> of device with a new set of tools.
>>
>> I don't think that spdk meets even the original android use-case.
>>
>> Not touching nvmet is fine, it just eliminates some of the possible
>> use-cases. Although personally I don't see a huge issue with adding
>> yet another backend to nvmet...
> 
> After discussing with google for the r/w/flush use-case (cloud, not
> android), they are interested in avoiding the source of complexity that
> arises from implementing the NVMe protocol in the interface.  Even if it
> is hidden behind a userspace library, it means converting block
> rq->nvme->block rq, which might have a performance impact?
> 
>  From your previous message, I think we can move forward with dissociating
> the original use case from nvme passthrough, and have the userspace hook
> as a block driver?

Yes, there is no desire to do that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-22  6:57 ` Hannes Reinecke
                     ` (2 preceding siblings ...)
  2022-03-02 23:04   ` Gabriel Krisman Bertazi
@ 2022-03-27 16:35   ` Ming Lei
  2022-03-28  5:47     ` Kanchan Joshi
                       ` (3 more replies)
  3 siblings, 4 replies; 54+ messages in thread
From: Ming Lei @ 2022-03-27 16:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
> > I'd like to discuss an interface to implement user space block devices,
> > while avoiding local network NBD solutions.  There has been reiterated
> > interest in the topic, both from researchers [1] and from the community,
> > including a proposed session in LSFMM2018 [2] (though I don't think it
> > happened).
> > 
> > I've been working on top of the Google iblock implementation to find
> > something upstreamable and would like to present my design and gather
> > feedback on some points, in particular zero-copy and overall user space
> > interface.
> > 
> > The design I'm pending towards uses special fds opened by the driver to
> > transfer data to/from the block driver, preferably through direct
> > splicing as much as possible, to keep data only in kernel space.  This
> > is because, in my use case, the driver usually only manipulates
> > metadata, while data is forwarded directly through the network, or
> > similar. It would be neat if we can leverage the existing
> > splice/copy_file_range syscalls such that we don't ever need to bring
> > disk data to user space, if we can avoid it.  I've also experimented
> > with regular pipes, But I found no way around keeping a lot of pipes
> > opened, one for each possible command 'slot'.
> > 
> > [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > 
> Actually, I'd rather have something like an 'inverse io_uring', where an
> application creates a memory region separated into several 'ring' for
> submission and completion.
> Then the kernel could write/map the incoming data onto the rings, and
> application can read from there.
> Maybe it'll be worthwhile to look at virtio here.

IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
does cover this case, the userspace part can submit SQEs beforehand
for getting notification of each incoming io request from kernel driver,
then after one io request is queued to the driver, the driver can
queue a CQE for the previous submitted SQE. Recent posted patch of
IORING_OP_URING_CMD[1] is perfect for such purpose.

I have written one such userspace block driver recently, and [2] is the
kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
Both the two parts look quite simple, but still in very early stage, so
far only ubd-loop and ubd-null targets are implemented in [3]. Not only
the io command communication channel is done via IORING_OP_URING_CMD, but
also IO handling for ubd-loop is implemented via plain io_uring too.

It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
on the ubd block device compared with same xfstests on underlying disk, and
my simple performance test on VM shows the result isn't worse than kernel loop
driver with dio, or even much better on some test situations.

Wrt. this userspace block driver things, I am more interested in the following
sub-topics:

1) zero copy
- the ubd driver[2] needs one data copy: for WRITE request, copy pages
  in io request to userspace buffer before handling the WRITE IO by ubdsrv;
  for READ request, the reverse copy is done after READ request is
  handled by ubdsrv

- I tried to apply zero copy via remap_pfn_range() for avoiding this
  data copy, but looks it can't work for ubd driver, since pages in the
  remapped vm area can't be retrieved by get_user_pages_*() which is called in
  direct io code path

- recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
  tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
  it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
  vm_insert_pages may work, but anonymous pages can not be remapped by
  vm_insert_pages.

- here the requirement is to remap either anonymous pages or page cache
  pages into userspace vm, and the mapping/unmapping can be done for
  each IO runtime. Is this requirement reasonable? If yes, is there any
  easy way to implement it in kernel?

2) batching queueing io_uring CQEs

- for ubd driver, batching is very sensitive to performance per my
  observation, if we can run batch queueing IORING_OP_URING_CMD CQEs,
  ubd_queue_rqs() can be wirted to the batching CQEs, then the whole batch
  only takes one io_uring_enter().

- not digging into io_uring code for this interface yet, but looks not
  see such interface

3) requirement on userspace block driver
- exact requirements from user viewpoint

4) apply eBPF in userspace block driver
- it is one open topic, still not have specific or exact idea yet,

- is there chance to apply ebpf for mapping ubd io into its target handling
for avoiding data copy and remapping cost for zero copy?

I am happy to join the virtual discussion on lsf/mm if there is and it
is possible.

[1] https://lore.kernel.org/linux-block/20220308152105.309618-1-joshi.k@samsung.com/#r
[2] https://github.com/ming1/linux/tree/v5.17-ubd-dev
[3] https://github.com/ming1/ubdsrv
[4] https://lore.kernel.org/linux-block/abbe51c4-873f-e96e-d421-85906689a55a@gmail.com/#r

Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` Ming Lei
@ 2022-03-28  5:47     ` Kanchan Joshi
  2022-03-28  5:48     ` Hannes Reinecke
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Kanchan Joshi @ 2022-03-28  5:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	Xiaoguang Wang, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3497 bytes --]

On Mon, Mar 28, 2022 at 12:35:33AM +0800, Ming Lei wrote:
>On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>> > I'd like to discuss an interface to implement user space block devices,
>> > while avoiding local network NBD solutions.  There has been reiterated
>> > interest in the topic, both from researchers [1] and from the community,
>> > including a proposed session in LSFMM2018 [2] (though I don't think it
>> > happened).
>> >
>> > I've been working on top of the Google iblock implementation to find
>> > something upstreamable and would like to present my design and gather
>> > feedback on some points, in particular zero-copy and overall user space
>> > interface.
>> >
>> > The design I'm pending towards uses special fds opened by the driver to
>> > transfer data to/from the block driver, preferably through direct
>> > splicing as much as possible, to keep data only in kernel space.  This
>> > is because, in my use case, the driver usually only manipulates
>> > metadata, while data is forwarded directly through the network, or
>> > similar. It would be neat if we can leverage the existing
>> > splice/copy_file_range syscalls such that we don't ever need to bring
>> > disk data to user space, if we can avoid it.  I've also experimented
>> > with regular pipes, But I found no way around keeping a lot of pipes
>> > opened, one for each possible command 'slot'.
>> >
>> > [1] https://protect2.fireeye.com/v1/url?k=894d9ec4-e83076bc-894c158b-74fe485fffb1-3de06c94a9e9abfa&q=1&e=40f886a9-b53a-42b0-8e68-c94bc3813a9c&u=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3456727.3463768
>> > [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>> >
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
>
>IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
>does cover this case, the userspace part can submit SQEs beforehand
>for getting notification of each incoming io request from kernel driver,
>then after one io request is queued to the driver, the driver can
>queue a CQE for the previous submitted SQE. Recent posted patch of
>IORING_OP_URING_CMD[1] is perfect for such purpose.
I had added that as one of the potential usecases to discuss for
uring-cmd:
https://lore.kernel.org/linux-block/20220228092511.458285-1-joshi.k@samsung.com/
And your email is already bringing lot of clarity on this.

>I have written one such userspace block driver recently, and [2] is the
>kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
>Both the two parts look quite simple, but still in very early stage, so
>far only ubd-loop and ubd-null targets are implemented in [3]. Not only
>the io command communication channel is done via IORING_OP_URING_CMD, but
>also IO handling for ubd-loop is implemented via plain io_uring too.
>
>It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
>on the ubd block device compared with same xfstests on underlying disk, and
>my simple performance test on VM shows the result isn't worse than kernel loop
>driver with dio, or even much better on some test situations.
Added this in my to-be-read list. Thanks for sharing.




[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` Ming Lei
  2022-03-28  5:47     ` Kanchan Joshi
@ 2022-03-28  5:48     ` Hannes Reinecke
  2022-03-28 20:20     ` Gabriel Krisman Bertazi
  2022-04-08  6:52     ` Xiaoguang Wang
  3 siblings, 0 replies; 54+ messages in thread
From: Hannes Reinecke @ 2022-03-28  5:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On 3/27/22 18:35, Ming Lei wrote:
> On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>>> I'd like to discuss an interface to implement user space block devices,
>>> while avoiding local network NBD solutions.  There has been reiterated
>>> interest in the topic, both from researchers [1] and from the community,
>>> including a proposed session in LSFMM2018 [2] (though I don't think it
>>> happened).
>>>
>>> I've been working on top of the Google iblock implementation to find
>>> something upstreamable and would like to present my design and gather
>>> feedback on some points, in particular zero-copy and overall user space
>>> interface.
>>>
>>> The design I'm pending towards uses special fds opened by the driver to
>>> transfer data to/from the block driver, preferably through direct
>>> splicing as much as possible, to keep data only in kernel space.  This
>>> is because, in my use case, the driver usually only manipulates
>>> metadata, while data is forwarded directly through the network, or
>>> similar. It would be neat if we can leverage the existing
>>> splice/copy_file_range syscalls such that we don't ever need to bring
>>> disk data to user space, if we can avoid it.  I've also experimented
>>> with regular pipes, But I found no way around keeping a lot of pipes
>>> opened, one for each possible command 'slot'.
>>>
>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>>>
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
> 
> IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> does cover this case, the userspace part can submit SQEs beforehand
> for getting notification of each incoming io request from kernel driver,
> then after one io request is queued to the driver, the driver can
> queue a CQE for the previous submitted SQE. Recent posted patch of
> IORING_OP_URING_CMD[1] is perfect for such purpose.
> 

Ah, cool idea.

> I have written one such userspace block driver recently, and [2] is the
> kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> Both the two parts look quite simple, but still in very early stage, so
> far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> the io command communication channel is done via IORING_OP_URING_CMD, but
> also IO handling for ubd-loop is implemented via plain io_uring too.
> 
> It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> on the ubd block device compared with same xfstests on underlying disk, and
> my simple performance test on VM shows the result isn't worse than kernel loop
> driver with dio, or even much better on some test situations.
> 
Neat. I'll have a look.

Thanks for doing that!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` Ming Lei
  2022-03-28  5:47     ` Kanchan Joshi
  2022-03-28  5:48     ` Hannes Reinecke
@ 2022-03-28 20:20     ` Gabriel Krisman Bertazi
  2022-03-29  0:30       ` Ming Lei
  2022-04-08  6:52     ` Xiaoguang Wang
  3 siblings, 1 reply; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-28 20:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

Ming Lei <ming.lei@redhat.com> writes:

> IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> does cover this case, the userspace part can submit SQEs beforehand
> for getting notification of each incoming io request from kernel driver,
> then after one io request is queued to the driver, the driver can
> queue a CQE for the previous submitted SQE. Recent posted patch of
> IORING_OP_URING_CMD[1] is perfect for such purpose.
>
> I have written one such userspace block driver recently, and [2] is the
> kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> Both the two parts look quite simple, but still in very early stage, so
> far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> the io command communication channel is done via IORING_OP_URING_CMD, but
> also IO handling for ubd-loop is implemented via plain io_uring too.
>
> It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> on the ubd block device compared with same xfstests on underlying disk, and
> my simple performance test on VM shows the result isn't worse than kernel loop
> driver with dio, or even much better on some test situations.

Thanks for sharing.  This is a very interesting implementation that
seems to cover quite well the original use case.  I'm giving it a try and
will report back.

> Wrt. this userspace block driver things, I am more interested in the following
> sub-topics:
>
> 1) zero copy
> - the ubd driver[2] needs one data copy: for WRITE request, copy pages
>   in io request to userspace buffer before handling the WRITE IO by ubdsrv;
>   for READ request, the reverse copy is done after READ request is
>   handled by ubdsrv
>
> - I tried to apply zero copy via remap_pfn_range() for avoiding this
>   data copy, but looks it can't work for ubd driver, since pages in the
>   remapped vm area can't be retrieved by get_user_pages_*() which is called in
>   direct io code path
>
> - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
>   tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
>   it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
>   vm_insert_pages may work, but anonymous pages can not be remapped by
>   vm_insert_pages.
>
> - here the requirement is to remap either anonymous pages or page cache
>   pages into userspace vm, and the mapping/unmapping can be done for
>   each IO runtime. Is this requirement reasonable? If yes, is there any
>   easy way to implement it in kernel?

I've run into the same issue with my fd implementation and haven't been
able to workaround it.

> 4) apply eBPF in userspace block driver
> - it is one open topic, still not have specific or exact idea yet,
>
> - is there chance to apply ebpf for mapping ubd io into its target handling
> for avoiding data copy and remapping cost for zero copy?

I was thinking of something like this, or having a way for the server to
only operate on the fds and do splice/sendfile.  But, I don't know if it
would be useful for many use cases.  We also want to be able to send the
data to userspace, for instance, for userspace networking.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-28 20:20     ` Gabriel Krisman Bertazi
@ 2022-03-29  0:30       ` Ming Lei
  2022-03-29 17:20         ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2022-03-29  0:30 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Mon, Mar 28, 2022 at 04:20:03PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> > does cover this case, the userspace part can submit SQEs beforehand
> > for getting notification of each incoming io request from kernel driver,
> > then after one io request is queued to the driver, the driver can
> > queue a CQE for the previous submitted SQE. Recent posted patch of
> > IORING_OP_URING_CMD[1] is perfect for such purpose.
> >
> > I have written one such userspace block driver recently, and [2] is the
> > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> > Both the two parts look quite simple, but still in very early stage, so
> > far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> > the io command communication channel is done via IORING_OP_URING_CMD, but
> > also IO handling for ubd-loop is implemented via plain io_uring too.
> >
> > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> > on the ubd block device compared with same xfstests on underlying disk, and
> > my simple performance test on VM shows the result isn't worse than kernel loop
> > driver with dio, or even much better on some test situations.
> 
> Thanks for sharing.  This is a very interesting implementation that
> seems to cover quite well the original use case.  I'm giving it a try and
> will report back.
> 
> > Wrt. this userspace block driver things, I am more interested in the following
> > sub-topics:
> >
> > 1) zero copy
> > - the ubd driver[2] needs one data copy: for WRITE request, copy pages
> >   in io request to userspace buffer before handling the WRITE IO by ubdsrv;
> >   for READ request, the reverse copy is done after READ request is
> >   handled by ubdsrv
> >
> > - I tried to apply zero copy via remap_pfn_range() for avoiding this
> >   data copy, but looks it can't work for ubd driver, since pages in the
> >   remapped vm area can't be retrieved by get_user_pages_*() which is called in
> >   direct io code path
> >
> > - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
> >   tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
> >   it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
> >   vm_insert_pages may work, but anonymous pages can not be remapped by
> >   vm_insert_pages.
> >
> > - here the requirement is to remap either anonymous pages or page cache
> >   pages into userspace vm, and the mapping/unmapping can be done for
> >   each IO runtime. Is this requirement reasonable? If yes, is there any
> >   easy way to implement it in kernel?
> 
> I've run into the same issue with my fd implementation and haven't been
> able to workaround it.
> 
> > 4) apply eBPF in userspace block driver
> > - it is one open topic, still not have specific or exact idea yet,
> >
> > - is there chance to apply ebpf for mapping ubd io into its target handling
> > for avoiding data copy and remapping cost for zero copy?
> 
> I was thinking of something like this, or having a way for the server to
> only operate on the fds and do splice/sendfile.  But, I don't know if it
> would be useful for many use cases.  We also want to be able to send the
> data to userspace, for instance, for userspace networking.

I understand the big point is that how to pass the io data to ubd driver's
request/bio pages. But splice/sendfile just transfers data between two FDs,
then how can the block request/bio's pages get filled with expected data?
Can you explain a bit in detail?

If block layer is bypassed, it won't be exposed as block disk to userspace.


thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-29  0:30       ` Ming Lei
@ 2022-03-29 17:20         ` Gabriel Krisman Bertazi
  2022-03-30  1:55           ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-29 17:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

Ming Lei <ming.lei@redhat.com> writes:

>> I was thinking of something like this, or having a way for the server to
>> only operate on the fds and do splice/sendfile.  But, I don't know if it
>> would be useful for many use cases.  We also want to be able to send the
>> data to userspace, for instance, for userspace networking.
>
> I understand the big point is that how to pass the io data to ubd driver's
> request/bio pages. But splice/sendfile just transfers data between two FDs,
> then how can the block request/bio's pages get filled with expected data?
> Can you explain a bit in detail?

Hi Ming,

My idea was to split the control and dataplanes in different file
descriptors.

A queue has a fd that is mapped to a shared memory area where the
request descriptors are.  Submission/completion are done by read/writing
the index of the request on the shared memory area.

For the data plane, each request descriptor in the queue has an
associated file descriptor to be used for data transfer, which is
preallocated at queue creation time.  I'm mapping the bio linearly, from
offset 0, on these descriptors on .queue_rq().  Userspace operates on
these data file descriptors with regular RW syscalls, direct splice to
another fd or pipe, or mmap it to move data around. The data is
available on that fd until IO is completed through the queue fd.  After
an operation is completed, the fds are reused for the next IO on that
queue position.

Hannes has pointed out the issues with fd limits. :)

> If block layer is bypassed, it won't be exposed as block disk to userspace.

I implemented it as a block-mq driver, but it still only supports one
queue.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-29 17:20         ` Gabriel Krisman Bertazi
@ 2022-03-30  1:55           ` Ming Lei
  2022-03-30 18:22             ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2022-03-30  1:55 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> >> I was thinking of something like this, or having a way for the server to
> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
> >> would be useful for many use cases.  We also want to be able to send the
> >> data to userspace, for instance, for userspace networking.
> >
> > I understand the big point is that how to pass the io data to ubd driver's
> > request/bio pages. But splice/sendfile just transfers data between two FDs,
> > then how can the block request/bio's pages get filled with expected data?
> > Can you explain a bit in detail?
> 
> Hi Ming,
> 
> My idea was to split the control and dataplanes in different file
> descriptors.
> 
> A queue has a fd that is mapped to a shared memory area where the
> request descriptors are.  Submission/completion are done by read/writing
> the index of the request on the shared memory area.
> 
> For the data plane, each request descriptor in the queue has an
> associated file descriptor to be used for data transfer, which is
> preallocated at queue creation time.  I'm mapping the bio linearly, from
> offset 0, on these descriptors on .queue_rq().  Userspace operates on
> these data file descriptors with regular RW syscalls, direct splice to
> another fd or pipe, or mmap it to move data around. The data is
> available on that fd until IO is completed through the queue fd.  After
> an operation is completed, the fds are reused for the next IO on that
> queue position.
> 
> Hannes has pointed out the issues with fd limits. :)

OK, thanks for the detailed explanation!

Also you may switch to map each request queue/disk into a FD, and every
request is mapped to one fixed extent of the 'file' via rq->tag since we
have max sectors limit for each request, then fd limits can be avoided.

But I am wondering if this way is friendly to userspace side implementation,
since there isn't buffer, only FDs visible to userspace.


thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-30  1:55           ` Ming Lei
@ 2022-03-30 18:22             ` Gabriel Krisman Bertazi
  2022-03-31  1:38               ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Gabriel Krisman Bertazi @ 2022-03-30 18:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

Ming Lei <ming.lei@redhat.com> writes:

> On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
>> Ming Lei <ming.lei@redhat.com> writes:
>> 
>> >> I was thinking of something like this, or having a way for the server to
>> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
>> >> would be useful for many use cases.  We also want to be able to send the
>> >> data to userspace, for instance, for userspace networking.
>> >
>> > I understand the big point is that how to pass the io data to ubd driver's
>> > request/bio pages. But splice/sendfile just transfers data between two FDs,
>> > then how can the block request/bio's pages get filled with expected data?
>> > Can you explain a bit in detail?
>> 
>> Hi Ming,
>> 
>> My idea was to split the control and dataplanes in different file
>> descriptors.
>> 
>> A queue has a fd that is mapped to a shared memory area where the
>> request descriptors are.  Submission/completion are done by read/writing
>> the index of the request on the shared memory area.
>> 
>> For the data plane, each request descriptor in the queue has an
>> associated file descriptor to be used for data transfer, which is
>> preallocated at queue creation time.  I'm mapping the bio linearly, from
>> offset 0, on these descriptors on .queue_rq().  Userspace operates on
>> these data file descriptors with regular RW syscalls, direct splice to
>> another fd or pipe, or mmap it to move data around. The data is
>> available on that fd until IO is completed through the queue fd.  After
>> an operation is completed, the fds are reused for the next IO on that
>> queue position.
>> 
>> Hannes has pointed out the issues with fd limits. :)
>
> OK, thanks for the detailed explanation!
>
> Also you may switch to map each request queue/disk into a FD, and every
> request is mapped to one fixed extent of the 'file' via rq->tag since we
> have max sectors limit for each request, then fd limits can be avoided.
>
> But I am wondering if this way is friendly to userspace side implementation,
> since there isn't buffer, only FDs visible to userspace.

The advantages would be not mapping the request data in userspace if we
could avoid it, since it would be possible to just forward the data
inside the kernel.  But my latest understanding is that most use cases
will want to directly manipulate the data anyway, maybe to checksum, or
even for sending through userspace networking.  It is not clear to me
anymore that we'd benefit from not always mapping the requests to
userspace.

I've been looking at your implementation and I really like how simple it
is. I think it's the most promising approach for this feature I've
reviewed so far.  I'd like to send you a few patches for bugs I found
when testing it and keep working on making it upstreamable.  How can I
send you those patches?  Is it fine to just email you or should I also
cc linux-block, even though this is yet out-of-tree code?

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-30 18:22             ` Gabriel Krisman Bertazi
@ 2022-03-31  1:38               ` Ming Lei
  2022-03-31  3:49                 ` Bart Van Assche
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2022-03-31  1:38 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On Wed, Mar 30, 2022 at 02:22:20PM -0400, Gabriel Krisman Bertazi wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > On Tue, Mar 29, 2022 at 01:20:57PM -0400, Gabriel Krisman Bertazi wrote:
> >> Ming Lei <ming.lei@redhat.com> writes:
> >> 
> >> >> I was thinking of something like this, or having a way for the server to
> >> >> only operate on the fds and do splice/sendfile.  But, I don't know if it
> >> >> would be useful for many use cases.  We also want to be able to send the
> >> >> data to userspace, for instance, for userspace networking.
> >> >
> >> > I understand the big point is that how to pass the io data to ubd driver's
> >> > request/bio pages. But splice/sendfile just transfers data between two FDs,
> >> > then how can the block request/bio's pages get filled with expected data?
> >> > Can you explain a bit in detail?
> >> 
> >> Hi Ming,
> >> 
> >> My idea was to split the control and dataplanes in different file
> >> descriptors.
> >> 
> >> A queue has a fd that is mapped to a shared memory area where the
> >> request descriptors are.  Submission/completion are done by read/writing
> >> the index of the request on the shared memory area.
> >> 
> >> For the data plane, each request descriptor in the queue has an
> >> associated file descriptor to be used for data transfer, which is
> >> preallocated at queue creation time.  I'm mapping the bio linearly, from
> >> offset 0, on these descriptors on .queue_rq().  Userspace operates on
> >> these data file descriptors with regular RW syscalls, direct splice to
> >> another fd or pipe, or mmap it to move data around. The data is
> >> available on that fd until IO is completed through the queue fd.  After
> >> an operation is completed, the fds are reused for the next IO on that
> >> queue position.
> >> 
> >> Hannes has pointed out the issues with fd limits. :)
> >
> > OK, thanks for the detailed explanation!
> >
> > Also you may switch to map each request queue/disk into a FD, and every
> > request is mapped to one fixed extent of the 'file' via rq->tag since we
> > have max sectors limit for each request, then fd limits can be avoided.
> >
> > But I am wondering if this way is friendly to userspace side implementation,
> > since there isn't buffer, only FDs visible to userspace.
> 
> The advantages would be not mapping the request data in userspace if we
> could avoid it, since it would be possible to just forward the data
> inside the kernel.  But my latest understanding is that most use cases
> will want to directly manipulate the data anyway, maybe to checksum, or
> even for sending through userspace networking.  It is not clear to me
> anymore that we'd benefit from not always mapping the requests to
> userspace.

Yeah, I think it is more flexible or usable to allow userspace to
operate on data directly as one generic solution, such as, implement one disk
to read/write on qcow2 image, or read from/write to network by parsing
protocol, or whatever.

> I've been looking at your implementation and I really like how simple it
> is. I think it's the most promising approach for this feature I've
> reviewed so far.  I'd like to send you a few patches for bugs I found
> when testing it and keep working on making it upstreamable.  How can I
> send you those patches?  Is it fine to just email you or should I also
> cc linux-block, even though this is yet out-of-tree code?

The topic has been discussed for a bit long, and looks people are still
interested in it, so I prefer to send out patches on linux-block if no
one objects. Then we can still discuss further when reviewing patches.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-31  1:38               ` Ming Lei
@ 2022-03-31  3:49                 ` Bart Van Assche
  0 siblings, 0 replies; 54+ messages in thread
From: Bart Van Assche @ 2022-03-31  3:49 UTC (permalink / raw)
  To: Ming Lei, Gabriel Krisman Bertazi
  Cc: Hannes Reinecke, lsf-pc, linux-block, Xiaoguang Wang, linux-mm

On 3/30/22 18:38, Ming Lei wrote:
> The topic has been discussed for a bit long, and looks people are still
> interested in it, so I prefer to send out patches on linux-block if no
> one objects. Then we can still discuss further when reviewing patches.

I'm in favor of the above proposal :-)

Bart.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-03-27 16:35   ` Ming Lei
                       ` (2 preceding siblings ...)
  2022-03-28 20:20     ` Gabriel Krisman Bertazi
@ 2022-04-08  6:52     ` Xiaoguang Wang
  2022-04-08  7:44       ` Ming Lei
  3 siblings, 1 reply; 54+ messages in thread
From: Xiaoguang Wang @ 2022-04-08  6:52 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Gabriel Krisman Bertazi, lsf-pc, linux-block, linux-mm

hi,

> On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>>> I'd like to discuss an interface to implement user space block devices,
>>> while avoiding local network NBD solutions.  There has been reiterated
>>> interest in the topic, both from researchers [1] and from the community,
>>> including a proposed session in LSFMM2018 [2] (though I don't think it
>>> happened).
>>>
>>> I've been working on top of the Google iblock implementation to find
>>> something upstreamable and would like to present my design and gather
>>> feedback on some points, in particular zero-copy and overall user space
>>> interface.
>>>
>>> The design I'm pending towards uses special fds opened by the driver to
>>> transfer data to/from the block driver, preferably through direct
>>> splicing as much as possible, to keep data only in kernel space.  This
>>> is because, in my use case, the driver usually only manipulates
>>> metadata, while data is forwarded directly through the network, or
>>> similar. It would be neat if we can leverage the existing
>>> splice/copy_file_range syscalls such that we don't ever need to bring
>>> disk data to user space, if we can avoid it.  I've also experimented
>>> with regular pipes, But I found no way around keeping a lot of pipes
>>> opened, one for each possible command 'slot'.
>>>
>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>>>
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
> IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> does cover this case, the userspace part can submit SQEs beforehand
> for getting notification of each incoming io request from kernel driver,
> then after one io request is queued to the driver, the driver can
> queue a CQE for the previous submitted SQE. Recent posted patch of
> IORING_OP_URING_CMD[1] is perfect for such purpose.
>
> I have written one such userspace block driver recently, and [2] is the
> kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> Both the two parts look quite simple, but still in very early stage, so
> far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> the io command communication channel is done via IORING_OP_URING_CMD, but
> also IO handling for ubd-loop is implemented via plain io_uring too.
>
> It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> on the ubd block device compared with same xfstests on underlying disk, and
> my simple performance test on VM shows the result isn't worse than kernel loop
> driver with dio, or even much better on some test situations.
I also have spent time to learn your codes, its idea is really good, thanks for this
great work. Though we're using tcmu, indeed we just needs a simple block device
based on block semantics. Tcmu is based on scsi protocol, which is somewhat
complicated and influences small io request performance. So if you like, we're
willing to participate this project, and may use it in our internal business, thanks.

Another little question, why you use raw io_uring interface rather than liburing?
Are there any special reasons?

Regards,
Xiaoguang Wang
>
> Wrt. this userspace block driver things, I am more interested in the following
> sub-topics:
>
> 1) zero copy
> - the ubd driver[2] needs one data copy: for WRITE request, copy pages
>   in io request to userspace buffer before handling the WRITE IO by ubdsrv;
>   for READ request, the reverse copy is done after READ request is
>   handled by ubdsrv
>
> - I tried to apply zero copy via remap_pfn_range() for avoiding this
>   data copy, but looks it can't work for ubd driver, since pages in the
>   remapped vm area can't be retrieved by get_user_pages_*() which is called in
>   direct io code path
>
> - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on
>   tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but
>   it has same limit of remap_pfn_range; Also Xiaoguang mentioned that
>   vm_insert_pages may work, but anonymous pages can not be remapped by
>   vm_insert_pages.
>
> - here the requirement is to remap either anonymous pages or page cache
>   pages into userspace vm, and the mapping/unmapping can be done for
>   each IO runtime. Is this requirement reasonable? If yes, is there any
>   easy way to implement it in kernel?
>
> 2) batching queueing io_uring CQEs
>
> - for ubd driver, batching is very sensitive to performance per my
>   observation, if we can run batch queueing IORING_OP_URING_CMD CQEs,
>   ubd_queue_rqs() can be wirted to the batching CQEs, then the whole batch
>   only takes one io_uring_enter().
>
> - not digging into io_uring code for this interface yet, but looks not
>   see such interface
>
> 3) requirement on userspace block driver
> - exact requirements from user viewpoint
>
> 4) apply eBPF in userspace block driver
> - it is one open topic, still not have specific or exact idea yet,
>
> - is there chance to apply ebpf for mapping ubd io into its target handling
> for avoiding data copy and remapping cost for zero copy?
>
> I am happy to join the virtual discussion on lsf/mm if there is and it
> is possible.
>
> [1] https://lore.kernel.org/linux-block/20220308152105.309618-1-joshi.k@samsung.com/#r
> [2] https://github.com/ming1/linux/tree/v5.17-ubd-dev
> [3] https://github.com/ming1/ubdsrv
> [4] https://lore.kernel.org/linux-block/abbe51c4-873f-e96e-d421-85906689a55a@gmail.com/#r
>
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-04-08  6:52     ` Xiaoguang Wang
@ 2022-04-08  7:44       ` Ming Lei
  0 siblings, 0 replies; 54+ messages in thread
From: Ming Lei @ 2022-04-08  7:44 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: Hannes Reinecke, Gabriel Krisman Bertazi, lsf-pc, linux-block, linux-mm

On Fri, Apr 08, 2022 at 02:52:35PM +0800, Xiaoguang Wang wrote:
> hi,
> 
> > On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
> >> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
> >>> I'd like to discuss an interface to implement user space block devices,
> >>> while avoiding local network NBD solutions.  There has been reiterated
> >>> interest in the topic, both from researchers [1] and from the community,
> >>> including a proposed session in LSFMM2018 [2] (though I don't think it
> >>> happened).
> >>>
> >>> I've been working on top of the Google iblock implementation to find
> >>> something upstreamable and would like to present my design and gather
> >>> feedback on some points, in particular zero-copy and overall user space
> >>> interface.
> >>>
> >>> The design I'm pending towards uses special fds opened by the driver to
> >>> transfer data to/from the block driver, preferably through direct
> >>> splicing as much as possible, to keep data only in kernel space.  This
> >>> is because, in my use case, the driver usually only manipulates
> >>> metadata, while data is forwarded directly through the network, or
> >>> similar. It would be neat if we can leverage the existing
> >>> splice/copy_file_range syscalls such that we don't ever need to bring
> >>> disk data to user space, if we can avoid it.  I've also experimented
> >>> with regular pipes, But I found no way around keeping a lot of pipes
> >>> opened, one for each possible command 'slot'.
> >>>
> >>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> >>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> >>>
> >> Actually, I'd rather have something like an 'inverse io_uring', where an
> >> application creates a memory region separated into several 'ring' for
> >> submission and completion.
> >> Then the kernel could write/map the incoming data onto the rings, and
> >> application can read from there.
> >> Maybe it'll be worthwhile to look at virtio here.
> > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
> > does cover this case, the userspace part can submit SQEs beforehand
> > for getting notification of each incoming io request from kernel driver,
> > then after one io request is queued to the driver, the driver can
> > queue a CQE for the previous submitted SQE. Recent posted patch of
> > IORING_OP_URING_CMD[1] is perfect for such purpose.
> >
> > I have written one such userspace block driver recently, and [2] is the
> > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
> > Both the two parts look quite simple, but still in very early stage, so
> > far only ubd-loop and ubd-null targets are implemented in [3]. Not only
> > the io command communication channel is done via IORING_OP_URING_CMD, but
> > also IO handling for ubd-loop is implemented via plain io_uring too.
> >
> > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
> > on the ubd block device compared with same xfstests on underlying disk, and
> > my simple performance test on VM shows the result isn't worse than kernel loop
> > driver with dio, or even much better on some test situations.
> I also have spent time to learn your codes, its idea is really good, thanks for this
> great work. Though we're using tcmu, indeed we just needs a simple block device
> based on block semantics. Tcmu is based on scsi protocol, which is somewhat
> complicated and influences small io request performance. So if you like, we're
> willing to participate this project, and may use it in our internal business, thanks.

That is great, and welcome to participate! Glad to see there is real potential
user of userspace block device.

I believe there are lots of thing to do in this area, but so far:

1) consolidate the interface between ubd driver and ubdsrv, since this
part is kabi

2) consolidate design in ubdsrv(userspace part), so that we can support
different backing or target easily, one idea is to handle all io request
via io_uring.

3) consolidate design in ubdsrv for providing stable interface to support
advanced languages(python, rust, ...), and inevitable one new complicated
target/backing should be developed meantime, such as qcow2, or other
real/popular device.

I plan to post driver formal patches out after the patchset of io_uring
command interface is merged, but maybe we can make it soon for early
review.

And the driver side should be kept as simple as possible, and as
efficient as possible. It just focuses on : forward io request
to userspace and handle data copy or zero copy, and ubd driver won't store
any state of backing/target. Also actual performance is really sensitive with
batching handling. Recently, I take task_work_add() to improve batching, and
easy to observe performance boot. Another related part is how to implement
zero copy, which exists on tcmu or other projects too.

> 
> Another little question, why you use raw io_uring interface rather than liburing?
> Are there any special reasons?

It is just for building ubdsrv easily without any dependency, and it
definitely will switch to liburing. And the change should be quite simple,
since the related glue code is put in one source file, and the current
interface is similar with liburing's too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-02-24  0:58         ` Gao Xiang
@ 2022-06-09  2:01           ` Ming Lei
  2022-06-09  2:28             ` Gao Xiang
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2022-06-09  2:01 UTC (permalink / raw)
  To: Damien Le Moal, Gabriel Krisman Bertazi, lsf-pc, linux-block, hsiangkao
  Cc: Pavel Machek, linux-fsdevel

On Thu, Feb 24, 2022 at 08:58:33AM +0800, Gao Xiang wrote:
> On Thu, Feb 24, 2022 at 07:40:47AM +0900, Damien Le Moal wrote:
> > On 2/23/22 17:11, Gao Xiang wrote:
> > > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> > >> On 2/23/22 14:57, Gao Xiang wrote:
> > >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> > >>>> I'd like to discuss an interface to implement user space block devices,
> > >>>> while avoiding local network NBD solutions.  There has been reiterated
> > >>>> interest in the topic, both from researchers [1] and from the community,
> > >>>> including a proposed session in LSFMM2018 [2] (though I don't think it
> > >>>> happened).
> > >>>>
> > >>>> I've been working on top of the Google iblock implementation to find
> > >>>> something upstreamable and would like to present my design and gather
> > >>>> feedback on some points, in particular zero-copy and overall user space
> > >>>> interface.
> > >>>>
> > >>>> The design I'm pending towards uses special fds opened by the driver to
> > >>>> transfer data to/from the block driver, preferably through direct
> > >>>> splicing as much as possible, to keep data only in kernel space.  This
> > >>>> is because, in my use case, the driver usually only manipulates
> > >>>> metadata, while data is forwarded directly through the network, or
> > >>>> similar. It would be neat if we can leverage the existing
> > >>>> splice/copy_file_range syscalls such that we don't ever need to bring
> > >>>> disk data to user space, if we can avoid it.  I've also experimented
> > >>>> with regular pipes, But I found no way around keeping a lot of pipes
> > >>>> opened, one for each possible command 'slot'.
> > >>>>
> > >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > >>>
> > >>> I'm interested in this general topic too. One of our use cases is
> > >>> that we need to process network data in some degree since many
> > >>> protocols are application layer protocols so it seems more reasonable
> > >>> to process such protocols in userspace. And another difference is that
> > >>> we may have thousands of devices in a machine since we'd better to run
> > >>> containers as many as possible so the block device solution seems
> > >>> suboptimal to us. Yet I'm still interested in this topic to get more
> > >>> ideas.
> > >>>
> > >>> Btw, As for general userspace block device solutions, IMHO, there could
> > >>> be some deadlock issues out of direct reclaim, writeback, and userspace
> > >>> implementation due to writeback user requests can be tripped back to
> > >>> the kernel side (even the dependency crosses threads). I think they are
> > >>> somewhat hard to fix with user block device solutions. For example,
> > >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> > >>
> > >> This is already fixed with prctl() support. See:
> > >>
> > >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> > > 
> > > As I mentioned above, IMHO, we could add some per-task state to avoid
> > > the majority of such deadlock cases (also what I mentioned above), but
> > > there may still some potential dependency could happen between threads,
> > > such as using another kernel workqueue and waiting on it (in principle
> > > at least) since userspace program can call any syscall in principle (
> > > which doesn't like in-kernel drivers). So I think it can cause some
> > > risk due to generic userspace block device restriction, please kindly
> > > correct me if I'm wrong.
> > 
> > Not sure what you mean with all this. prctl() works per process/thread
> > and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
> > set. So for the case of a user block device driver, setting this means
> > that it cannot reenter itself during a memory allocation, regardless of
> > the system call it executes (FS etc): all memory allocations in any
> > syscall executed by the context will have GFP_NOIO.
> 
> I mean,
> 
> assuming PR_SET_IO_FLUSHER is already set on Thread A by using prctl,
> but since it can call any valid system call, therefore, after it
> received data due to direct reclaim and writeback, it is still
> allowed to call some system call which may do something as follows:
> 
>    Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)
> 
>    (call some syscall which)
> 
>    submit something to Thread B
>                                   
>                                   ... (do something)
> 
>                                   memory allocation with GFP_KERNEL (it
>                                   may trigger direct memory reclaim
>                                   again and reenter the original fs.)
> 
>                                   wake up Thread A
> 
>    wait Thread B to complete
> 
> Normally such system call won't cause any problem since userspace
> programs cannot be in a context out of writeback and direct reclaim.
> Yet I'm not sure if it works under userspace block driver
> writeback/direct reclaim cases.

Hi Gao Xiang,

I'd rather to reply you in this original thread, and the recent
discussion is from the following link:

https://lore.kernel.org/linux-block/Yp1jRw6kiUf5jCrW@B-P7TQMD6M-0146.local/

kernel loop & nbd is really in the same situation.

For example of kernel loop, PF_MEMALLOC_NOIO is added in commit
d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread"),
so loop's worker thread can be thought as the above Thread A, and
of course, writeback/swapout IO can reach the loop worker thread(
the above Thread A), then loop just calls into FS from the worker
thread for handling the loop IO, that is same with user space driver's
case, and the kernel 'thread B' should be in FS code.

Your theory might be true, but it does depend on FS's implementation,
and we don't see such report in reality.

Also you didn't mentioned that what kernel thread B exactly is? And what
the allocation is in kernel thread B.

If you have actual report, I am happy to take account into it, otherwise not
sure if it is worth of time/effort in thinking/addressing one pure theoretical
concern.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-06-09  2:01           ` Ming Lei
@ 2022-06-09  2:28             ` Gao Xiang
  2022-06-09  4:06               ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Gao Xiang @ 2022-06-09  2:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Damien Le Moal, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	Pavel Machek, linux-fsdevel

On Thu, Jun 09, 2022 at 10:01:23AM +0800, Ming Lei wrote:
> On Thu, Feb 24, 2022 at 08:58:33AM +0800, Gao Xiang wrote:
> > On Thu, Feb 24, 2022 at 07:40:47AM +0900, Damien Le Moal wrote:
> > > On 2/23/22 17:11, Gao Xiang wrote:
> > > > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> > > >> On 2/23/22 14:57, Gao Xiang wrote:
> > > >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> > > >>>> I'd like to discuss an interface to implement user space block devices,
> > > >>>> while avoiding local network NBD solutions.  There has been reiterated
> > > >>>> interest in the topic, both from researchers [1] and from the community,
> > > >>>> including a proposed session in LSFMM2018 [2] (though I don't think it
> > > >>>> happened).
> > > >>>>
> > > >>>> I've been working on top of the Google iblock implementation to find
> > > >>>> something upstreamable and would like to present my design and gather
> > > >>>> feedback on some points, in particular zero-copy and overall user space
> > > >>>> interface.
> > > >>>>
> > > >>>> The design I'm pending towards uses special fds opened by the driver to
> > > >>>> transfer data to/from the block driver, preferably through direct
> > > >>>> splicing as much as possible, to keep data only in kernel space.  This
> > > >>>> is because, in my use case, the driver usually only manipulates
> > > >>>> metadata, while data is forwarded directly through the network, or
> > > >>>> similar. It would be neat if we can leverage the existing
> > > >>>> splice/copy_file_range syscalls such that we don't ever need to bring
> > > >>>> disk data to user space, if we can avoid it.  I've also experimented
> > > >>>> with regular pipes, But I found no way around keeping a lot of pipes
> > > >>>> opened, one for each possible command 'slot'.
> > > >>>>
> > > >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > > >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > > >>>
> > > >>> I'm interested in this general topic too. One of our use cases is
> > > >>> that we need to process network data in some degree since many
> > > >>> protocols are application layer protocols so it seems more reasonable
> > > >>> to process such protocols in userspace. And another difference is that
> > > >>> we may have thousands of devices in a machine since we'd better to run
> > > >>> containers as many as possible so the block device solution seems
> > > >>> suboptimal to us. Yet I'm still interested in this topic to get more
> > > >>> ideas.
> > > >>>
> > > >>> Btw, As for general userspace block device solutions, IMHO, there could
> > > >>> be some deadlock issues out of direct reclaim, writeback, and userspace
> > > >>> implementation due to writeback user requests can be tripped back to
> > > >>> the kernel side (even the dependency crosses threads). I think they are
> > > >>> somewhat hard to fix with user block device solutions. For example,
> > > >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> > > >>
> > > >> This is already fixed with prctl() support. See:
> > > >>
> > > >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> > > > 
> > > > As I mentioned above, IMHO, we could add some per-task state to avoid
> > > > the majority of such deadlock cases (also what I mentioned above), but
> > > > there may still some potential dependency could happen between threads,
> > > > such as using another kernel workqueue and waiting on it (in principle
> > > > at least) since userspace program can call any syscall in principle (
> > > > which doesn't like in-kernel drivers). So I think it can cause some
> > > > risk due to generic userspace block device restriction, please kindly
> > > > correct me if I'm wrong.
> > > 
> > > Not sure what you mean with all this. prctl() works per process/thread
> > > and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
> > > set. So for the case of a user block device driver, setting this means
> > > that it cannot reenter itself during a memory allocation, regardless of
> > > the system call it executes (FS etc): all memory allocations in any
> > > syscall executed by the context will have GFP_NOIO.
> > 
> > I mean,
> > 
> > assuming PR_SET_IO_FLUSHER is already set on Thread A by using prctl,
> > but since it can call any valid system call, therefore, after it
> > received data due to direct reclaim and writeback, it is still
> > allowed to call some system call which may do something as follows:
> > 
> >    Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)
> > 
> >    (call some syscall which)
> > 
> >    submit something to Thread B
> >                                   
> >                                   ... (do something)
> > 
> >                                   memory allocation with GFP_KERNEL (it
> >                                   may trigger direct memory reclaim
> >                                   again and reenter the original fs.)
> > 
> >                                   wake up Thread A
> > 
> >    wait Thread B to complete
> > 
> > Normally such system call won't cause any problem since userspace
> > programs cannot be in a context out of writeback and direct reclaim.
> > Yet I'm not sure if it works under userspace block driver
> > writeback/direct reclaim cases.
> 
> Hi Gao Xiang,
> 
> I'd rather to reply you in this original thread, and the recent
> discussion is from the following link:
> 
> https://lore.kernel.org/linux-block/Yp1jRw6kiUf5jCrW@B-P7TQMD6M-0146.local/
> 
> kernel loop & nbd is really in the same situation.
> 
> For example of kernel loop, PF_MEMALLOC_NOIO is added in commit
> d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread"),
> so loop's worker thread can be thought as the above Thread A, and
> of course, writeback/swapout IO can reach the loop worker thread(
> the above Thread A), then loop just calls into FS from the worker
> thread for handling the loop IO, that is same with user space driver's
> case, and the kernel 'thread B' should be in FS code.
> 
> Your theory might be true, but it does depend on FS's implementation,
> and we don't see such report in reality.
> 
> Also you didn't mentioned that what kernel thread B exactly is? And what
> the allocation is in kernel thread B.
> 
> If you have actual report, I am happy to take account into it, otherwise not
> sure if it is worth of time/effort in thinking/addressing one pure theoretical
> concern.

Hi Ming,

Thanks for your look & reply.

That is not a wild guess. That is a basic difference between
in-kernel native block-based drivers and user-space block drivers.

That is userspace block driver can call _any_ system call if they want.
Since users can call any system call and any _new_ system call can be
introduced later, you have to audit all system calls "Which are safe
and which are _not_ safe" all the time. Otherwise, attacker can make
use of it to hung the system if such userspace driver is used widely.

IOWs, in my humble opinion, that is quite a fundamental security
concern of all userspace block drivers.

Actually, you cannot ignore block I/O requests if they actually push
into block layer, since that is too late if I/O actually is submitted
by some FS. And you don't even know which type of such I/O is.

On the other side, user-space FS implementations can avoid this since
FS can know if under direct reclaim and don't do such I/O requests.

Thanks,
Gao Xiangn

> 
> 
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-06-09  2:28             ` Gao Xiang
@ 2022-06-09  4:06               ` Ming Lei
  2022-06-09  4:55                 ` Gao Xiang
  2022-07-28  8:23                 ` Pavel Machek
  0 siblings, 2 replies; 54+ messages in thread
From: Ming Lei @ 2022-06-09  4:06 UTC (permalink / raw)
  To: Damien Le Moal, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	Pavel Machek, linux-fsdevel

On Thu, Jun 09, 2022 at 10:28:02AM +0800, Gao Xiang wrote:
> On Thu, Jun 09, 2022 at 10:01:23AM +0800, Ming Lei wrote:
> > On Thu, Feb 24, 2022 at 08:58:33AM +0800, Gao Xiang wrote:
> > > On Thu, Feb 24, 2022 at 07:40:47AM +0900, Damien Le Moal wrote:
> > > > On 2/23/22 17:11, Gao Xiang wrote:
> > > > > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> > > > >> On 2/23/22 14:57, Gao Xiang wrote:
> > > > >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> > > > >>>> I'd like to discuss an interface to implement user space block devices,
> > > > >>>> while avoiding local network NBD solutions.  There has been reiterated
> > > > >>>> interest in the topic, both from researchers [1] and from the community,
> > > > >>>> including a proposed session in LSFMM2018 [2] (though I don't think it
> > > > >>>> happened).
> > > > >>>>
> > > > >>>> I've been working on top of the Google iblock implementation to find
> > > > >>>> something upstreamable and would like to present my design and gather
> > > > >>>> feedback on some points, in particular zero-copy and overall user space
> > > > >>>> interface.
> > > > >>>>
> > > > >>>> The design I'm pending towards uses special fds opened by the driver to
> > > > >>>> transfer data to/from the block driver, preferably through direct
> > > > >>>> splicing as much as possible, to keep data only in kernel space.  This
> > > > >>>> is because, in my use case, the driver usually only manipulates
> > > > >>>> metadata, while data is forwarded directly through the network, or
> > > > >>>> similar. It would be neat if we can leverage the existing
> > > > >>>> splice/copy_file_range syscalls such that we don't ever need to bring
> > > > >>>> disk data to user space, if we can avoid it.  I've also experimented
> > > > >>>> with regular pipes, But I found no way around keeping a lot of pipes
> > > > >>>> opened, one for each possible command 'slot'.
> > > > >>>>
> > > > >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > > > >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > > > >>>
> > > > >>> I'm interested in this general topic too. One of our use cases is
> > > > >>> that we need to process network data in some degree since many
> > > > >>> protocols are application layer protocols so it seems more reasonable
> > > > >>> to process such protocols in userspace. And another difference is that
> > > > >>> we may have thousands of devices in a machine since we'd better to run
> > > > >>> containers as many as possible so the block device solution seems
> > > > >>> suboptimal to us. Yet I'm still interested in this topic to get more
> > > > >>> ideas.
> > > > >>>
> > > > >>> Btw, As for general userspace block device solutions, IMHO, there could
> > > > >>> be some deadlock issues out of direct reclaim, writeback, and userspace
> > > > >>> implementation due to writeback user requests can be tripped back to
> > > > >>> the kernel side (even the dependency crosses threads). I think they are
> > > > >>> somewhat hard to fix with user block device solutions. For example,
> > > > >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> > > > >>
> > > > >> This is already fixed with prctl() support. See:
> > > > >>
> > > > >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> > > > > 
> > > > > As I mentioned above, IMHO, we could add some per-task state to avoid
> > > > > the majority of such deadlock cases (also what I mentioned above), but
> > > > > there may still some potential dependency could happen between threads,
> > > > > such as using another kernel workqueue and waiting on it (in principle
> > > > > at least) since userspace program can call any syscall in principle (
> > > > > which doesn't like in-kernel drivers). So I think it can cause some
> > > > > risk due to generic userspace block device restriction, please kindly
> > > > > correct me if I'm wrong.
> > > > 
> > > > Not sure what you mean with all this. prctl() works per process/thread
> > > > and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
> > > > set. So for the case of a user block device driver, setting this means
> > > > that it cannot reenter itself during a memory allocation, regardless of
> > > > the system call it executes (FS etc): all memory allocations in any
> > > > syscall executed by the context will have GFP_NOIO.
> > > 
> > > I mean,
> > > 
> > > assuming PR_SET_IO_FLUSHER is already set on Thread A by using prctl,
> > > but since it can call any valid system call, therefore, after it
> > > received data due to direct reclaim and writeback, it is still
> > > allowed to call some system call which may do something as follows:
> > > 
> > >    Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)
> > > 
> > >    (call some syscall which)
> > > 
> > >    submit something to Thread B
> > >                                   
> > >                                   ... (do something)
> > > 
> > >                                   memory allocation with GFP_KERNEL (it
> > >                                   may trigger direct memory reclaim
> > >                                   again and reenter the original fs.)
> > > 
> > >                                   wake up Thread A
> > > 
> > >    wait Thread B to complete
> > > 
> > > Normally such system call won't cause any problem since userspace
> > > programs cannot be in a context out of writeback and direct reclaim.
> > > Yet I'm not sure if it works under userspace block driver
> > > writeback/direct reclaim cases.
> > 
> > Hi Gao Xiang,
> > 
> > I'd rather to reply you in this original thread, and the recent
> > discussion is from the following link:
> > 
> > https://lore.kernel.org/linux-block/Yp1jRw6kiUf5jCrW@B-P7TQMD6M-0146.local/
> > 
> > kernel loop & nbd is really in the same situation.
> > 
> > For example of kernel loop, PF_MEMALLOC_NOIO is added in commit
> > d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread"),
> > so loop's worker thread can be thought as the above Thread A, and
> > of course, writeback/swapout IO can reach the loop worker thread(
> > the above Thread A), then loop just calls into FS from the worker
> > thread for handling the loop IO, that is same with user space driver's
> > case, and the kernel 'thread B' should be in FS code.
> > 
> > Your theory might be true, but it does depend on FS's implementation,
> > and we don't see such report in reality.
> > 
> > Also you didn't mentioned that what kernel thread B exactly is? And what
> > the allocation is in kernel thread B.
> > 
> > If you have actual report, I am happy to take account into it, otherwise not
> > sure if it is worth of time/effort in thinking/addressing one pure theoretical
> > concern.
> 
> Hi Ming,
> 
> Thanks for your look & reply.
> 
> That is not a wild guess. That is a basic difference between
> in-kernel native block-based drivers and user-space block drivers.

Please look at my comment, wrt. your pure theoretical concern, userspace
block driver is same with kernel loop/nbd.

Did you see such report on loop & nbd? Can you answer my questions wrt.
kernel thread B?

> 
> That is userspace block driver can call _any_ system call if they want.
> Since users can call any system call and any _new_ system call can be
> introduced later, you have to audit all system calls "Which are safe
> and which are _not_ safe" all the time. Otherwise, attacker can make

Isn't nbd server capable of calling any system call? Is there any
security risk for nbd?

> use of it to hung the system if such userspace driver is used widely.

From the beginning, only ADMIN can create ubd, that is same with
nbd/loop, and it gets default permission as disk device.

ubd is really in same situation with nbd wrt. security, the only difference
is just that nbd uses socket for communication, and ubd uses io_uring, that
is all.

Yeah, Stefan Hajnoczi and I discussed to make ubd as one container
block device, so normal user can create & use ubd, but it won't be done
from the beginning, and won't be enabled until the potential security
risks are addressed, and there should be more limits on ubd when normal user
can create & use it, such as:

- not allow unprivileged ubd device to be mounted
- not allow unprivileged ubd device's partition table to be read from
  kernel
- not support buffered io for unprivileged ubd device, and only direct io
  is allowed
- maybe more limit for minimizing security risk.

> 
> IOWs, in my humble opinion, that is quite a fundamental security
> concern of all userspace block drivers.

But nbd is still there and widely used, and there are lots of people who
shows interest in userspace block device. Then think about who is wrong?

As one userspace block driver, it is normal to see some limits there,
but I don't agree that there is fundamental security issue.

> 
> Actually, you cannot ignore block I/O requests if they actually push

Who wants to ignore block I/O? And why ignore it?

> into block layer, since that is too late if I/O actually is submitted
> by some FS. And you don't even know which type of such I/O is.

We do know the I/O type.

> 
> On the other side, user-space FS implementations can avoid this since
> FS can know if under direct reclaim and don't do such I/O requests.

But it is nothing to do with userspace block device.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-06-09  4:06               ` Ming Lei
@ 2022-06-09  4:55                 ` Gao Xiang
  2022-06-10  1:52                   ` Ming Lei
  2022-07-28  8:23                 ` Pavel Machek
  1 sibling, 1 reply; 54+ messages in thread
From: Gao Xiang @ 2022-06-09  4:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Damien Le Moal, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	Pavel Machek, linux-fsdevel

On Thu, Jun 09, 2022 at 12:06:48PM +0800, Ming Lei wrote:
> On Thu, Jun 09, 2022 at 10:28:02AM +0800, Gao Xiang wrote:
> > On Thu, Jun 09, 2022 at 10:01:23AM +0800, Ming Lei wrote:
> > > On Thu, Feb 24, 2022 at 08:58:33AM +0800, Gao Xiang wrote:
> > > > On Thu, Feb 24, 2022 at 07:40:47AM +0900, Damien Le Moal wrote:
> > > > > On 2/23/22 17:11, Gao Xiang wrote:
> > > > > > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> > > > > >> On 2/23/22 14:57, Gao Xiang wrote:
> > > > > >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> > > > > >>>> I'd like to discuss an interface to implement user space block devices,
> > > > > >>>> while avoiding local network NBD solutions.  There has been reiterated
> > > > > >>>> interest in the topic, both from researchers [1] and from the community,
> > > > > >>>> including a proposed session in LSFMM2018 [2] (though I don't think it
> > > > > >>>> happened).
> > > > > >>>>
> > > > > >>>> I've been working on top of the Google iblock implementation to find
> > > > > >>>> something upstreamable and would like to present my design and gather
> > > > > >>>> feedback on some points, in particular zero-copy and overall user space
> > > > > >>>> interface.
> > > > > >>>>
> > > > > >>>> The design I'm pending towards uses special fds opened by the driver to
> > > > > >>>> transfer data to/from the block driver, preferably through direct
> > > > > >>>> splicing as much as possible, to keep data only in kernel space.  This
> > > > > >>>> is because, in my use case, the driver usually only manipulates
> > > > > >>>> metadata, while data is forwarded directly through the network, or
> > > > > >>>> similar. It would be neat if we can leverage the existing
> > > > > >>>> splice/copy_file_range syscalls such that we don't ever need to bring
> > > > > >>>> disk data to user space, if we can avoid it.  I've also experimented
> > > > > >>>> with regular pipes, But I found no way around keeping a lot of pipes
> > > > > >>>> opened, one for each possible command 'slot'.
> > > > > >>>>
> > > > > >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > > > > >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > > > > >>>
> > > > > >>> I'm interested in this general topic too. One of our use cases is
> > > > > >>> that we need to process network data in some degree since many
> > > > > >>> protocols are application layer protocols so it seems more reasonable
> > > > > >>> to process such protocols in userspace. And another difference is that
> > > > > >>> we may have thousands of devices in a machine since we'd better to run
> > > > > >>> containers as many as possible so the block device solution seems
> > > > > >>> suboptimal to us. Yet I'm still interested in this topic to get more
> > > > > >>> ideas.
> > > > > >>>
> > > > > >>> Btw, As for general userspace block device solutions, IMHO, there could
> > > > > >>> be some deadlock issues out of direct reclaim, writeback, and userspace
> > > > > >>> implementation due to writeback user requests can be tripped back to
> > > > > >>> the kernel side (even the dependency crosses threads). I think they are
> > > > > >>> somewhat hard to fix with user block device solutions. For example,
> > > > > >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> > > > > >>
> > > > > >> This is already fixed with prctl() support. See:
> > > > > >>
> > > > > >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> > > > > > 
> > > > > > As I mentioned above, IMHO, we could add some per-task state to avoid
> > > > > > the majority of such deadlock cases (also what I mentioned above), but
> > > > > > there may still some potential dependency could happen between threads,
> > > > > > such as using another kernel workqueue and waiting on it (in principle
> > > > > > at least) since userspace program can call any syscall in principle (
> > > > > > which doesn't like in-kernel drivers). So I think it can cause some
> > > > > > risk due to generic userspace block device restriction, please kindly
> > > > > > correct me if I'm wrong.
> > > > > 
> > > > > Not sure what you mean with all this. prctl() works per process/thread
> > > > > and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
> > > > > set. So for the case of a user block device driver, setting this means
> > > > > that it cannot reenter itself during a memory allocation, regardless of
> > > > > the system call it executes (FS etc): all memory allocations in any
> > > > > syscall executed by the context will have GFP_NOIO.
> > > > 
> > > > I mean,
> > > > 
> > > > assuming PR_SET_IO_FLUSHER is already set on Thread A by using prctl,
> > > > but since it can call any valid system call, therefore, after it
> > > > received data due to direct reclaim and writeback, it is still
> > > > allowed to call some system call which may do something as follows:
> > > > 
> > > >    Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)
> > > > 
> > > >    (call some syscall which)
> > > > 
> > > >    submit something to Thread B
> > > >                                   
> > > >                                   ... (do something)
> > > > 
> > > >                                   memory allocation with GFP_KERNEL (it
> > > >                                   may trigger direct memory reclaim
> > > >                                   again and reenter the original fs.)
> > > > 
> > > >                                   wake up Thread A
> > > > 
> > > >    wait Thread B to complete
> > > > 
> > > > Normally such system call won't cause any problem since userspace
> > > > programs cannot be in a context out of writeback and direct reclaim.
> > > > Yet I'm not sure if it works under userspace block driver
> > > > writeback/direct reclaim cases.
> > > 
> > > Hi Gao Xiang,
> > > 
> > > I'd rather to reply you in this original thread, and the recent
> > > discussion is from the following link:
> > > 
> > > https://lore.kernel.org/linux-block/Yp1jRw6kiUf5jCrW@B-P7TQMD6M-0146.local/
> > > 
> > > kernel loop & nbd is really in the same situation.
> > > 
> > > For example of kernel loop, PF_MEMALLOC_NOIO is added in commit
> > > d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread"),
> > > so loop's worker thread can be thought as the above Thread A, and
> > > of course, writeback/swapout IO can reach the loop worker thread(
> > > the above Thread A), then loop just calls into FS from the worker
> > > thread for handling the loop IO, that is same with user space driver's
> > > case, and the kernel 'thread B' should be in FS code.
> > > 
> > > Your theory might be true, but it does depend on FS's implementation,
> > > and we don't see such report in reality.
> > > 
> > > Also you didn't mentioned that what kernel thread B exactly is? And what
> > > the allocation is in kernel thread B.
> > > 
> > > If you have actual report, I am happy to take account into it, otherwise not
> > > sure if it is worth of time/effort in thinking/addressing one pure theoretical
> > > concern.
> > 
> > Hi Ming,
> > 
> > Thanks for your look & reply.
> > 
> > That is not a wild guess. That is a basic difference between
> > in-kernel native block-based drivers and user-space block drivers.
> 
> Please look at my comment, wrt. your pure theoretical concern, userspace
> block driver is same with kernel loop/nbd.

Hi Ming,

I don't have time to audit some potential risky system call, but I guess
security folks or researchers may be interested in finding such path.

The big problem is, you cannot avoid people to write such system call (or 
ioctls) in their user daemon, since most system call (or ioctls)
implementation assumes that they're never called under the kernel memory
direct reclaim context (even with PR_SET_IO_FLUSHER) but userspace block
driver can give such context to userspace and user problems can do
whatever they do in principle.

IOWs, we can audit in-kernel block drivers and fix all buggy paths with
GFP_NOIO since the source code is already there and they should be fixed.

But you have no way to audit all user programs to call proper system calls
or random ioctls which can be safely worked in the direct reclaim context
(even with PR_SET_IO_FLUSHER).

> 
> Did you see such report on loop & nbd? Can you answer my questions wrt.
> kernel thread B?

I don't think it has some relationship with in-kernel loop device, since
the loop device I/O paths are all under control.

> 
> > 
> > That is userspace block driver can call _any_ system call if they want.
> > Since users can call any system call and any _new_ system call can be
> > introduced later, you have to audit all system calls "Which are safe
> > and which are _not_ safe" all the time. Otherwise, attacker can make
> 
> Isn't nbd server capable of calling any system call? Is there any
> security risk for nbd?

Note that I wrote this email initially as a generic concern (prior to your
ubd annoucement ), so that isn't related to your ubd from my POV.

> 
> > use of it to hung the system if such userspace driver is used widely.
> 
> >From the beginning, only ADMIN can create ubd, that is same with
> nbd/loop, and it gets default permission as disk device.

loop device is different since the path can be totally controlled by the
kernel.

> 
> ubd is really in same situation with nbd wrt. security, the only difference
> is just that nbd uses socket for communication, and ubd uses io_uring, that
> is all.
> 
> Yeah, Stefan Hajnoczi and I discussed to make ubd as one container
> block device, so normal user can create & use ubd, but it won't be done
> from the beginning, and won't be enabled until the potential security
> risks are addressed, and there should be more limits on ubd when normal user
> can create & use it, such as:
> 
> - not allow unprivileged ubd device to be mounted
> - not allow unprivileged ubd device's partition table to be read from
>   kernel
> - not support buffered io for unprivileged ubd device, and only direct io
>   is allowed

How could you do that? I think it needs a wide modification to mm/fs.
and how about mmap I/O?

> - maybe more limit for minimizing security risk.
> 
> > 
> > IOWs, in my humble opinion, that is quite a fundamental security
> > concern of all userspace block drivers.
> 
> But nbd is still there and widely used, and there are lots of people who
> shows interest in userspace block device. Then think about who is wrong?
> 
> As one userspace block driver, it is normal to see some limits there,
> but I don't agree that there is fundamental security issue.

That depends, if you think it's a real security issue that there could be
a path reported to public to trigger that after it's widely used, that is
fine.

> 
> > 
> > Actually, you cannot ignore block I/O requests if they actually push
> 
> Who wants to ignore block I/O? And why ignore it?

I don't know how to express that properly. Sorry for my bad English.

For example, userspace FS implementation can ignore any fs operations
triggered under direct reclaim.

But if you runs a userspace block driver under a random fs, they will
just send data & metadata I/O to your driver unconditionally. I think
that is too late to avoid such deadlock.

> 
> > into block layer, since that is too late if I/O actually is submitted
> > by some FS. And you don't even know which type of such I/O is.
> 
> We do know the I/O type.

1) you don't know meta or data I/O. I know there is a REQ_META, but
   that is not a strict mark.

2) even you know an I/O is under direct reclaim, how to deal with that?
  just send to userspace unconditionally?

> 
> > 
> > On the other side, user-space FS implementations can avoid this since
> > FS can know if under direct reclaim and don't do such I/O requests.
> 
> But it is nothing to do with userspace block device.

Anyway, it is just a random side note for this topic
"block drivers in user space"

I'm just expressing my own concern about such architecture, you could
ignore my concern above of course.

Thanks,
Gao Xiang


> 
> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-06-09  4:55                 ` Gao Xiang
@ 2022-06-10  1:52                   ` Ming Lei
  0 siblings, 0 replies; 54+ messages in thread
From: Ming Lei @ 2022-06-10  1:52 UTC (permalink / raw)
  To: Damien Le Moal, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	Pavel Machek, linux-fsdevel

On Thu, Jun 09, 2022 at 12:55:59PM +0800, Gao Xiang wrote:
> On Thu, Jun 09, 2022 at 12:06:48PM +0800, Ming Lei wrote:
> > On Thu, Jun 09, 2022 at 10:28:02AM +0800, Gao Xiang wrote:
> > > On Thu, Jun 09, 2022 at 10:01:23AM +0800, Ming Lei wrote:
> > > > On Thu, Feb 24, 2022 at 08:58:33AM +0800, Gao Xiang wrote:
> > > > > On Thu, Feb 24, 2022 at 07:40:47AM +0900, Damien Le Moal wrote:
> > > > > > On 2/23/22 17:11, Gao Xiang wrote:
> > > > > > > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote:
> > > > > > >> On 2/23/22 14:57, Gao Xiang wrote:
> > > > > > >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote:
> > > > > > >>>> I'd like to discuss an interface to implement user space block devices,
> > > > > > >>>> while avoiding local network NBD solutions.  There has been reiterated
> > > > > > >>>> interest in the topic, both from researchers [1] and from the community,
> > > > > > >>>> including a proposed session in LSFMM2018 [2] (though I don't think it
> > > > > > >>>> happened).
> > > > > > >>>>
> > > > > > >>>> I've been working on top of the Google iblock implementation to find
> > > > > > >>>> something upstreamable and would like to present my design and gather
> > > > > > >>>> feedback on some points, in particular zero-copy and overall user space
> > > > > > >>>> interface.
> > > > > > >>>>
> > > > > > >>>> The design I'm pending towards uses special fds opened by the driver to
> > > > > > >>>> transfer data to/from the block driver, preferably through direct
> > > > > > >>>> splicing as much as possible, to keep data only in kernel space.  This
> > > > > > >>>> is because, in my use case, the driver usually only manipulates
> > > > > > >>>> metadata, while data is forwarded directly through the network, or
> > > > > > >>>> similar. It would be neat if we can leverage the existing
> > > > > > >>>> splice/copy_file_range syscalls such that we don't ever need to bring
> > > > > > >>>> disk data to user space, if we can avoid it.  I've also experimented
> > > > > > >>>> with regular pipes, But I found no way around keeping a lot of pipes
> > > > > > >>>> opened, one for each possible command 'slot'.
> > > > > > >>>>
> > > > > > >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768
> > > > > > >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
> > > > > > >>>
> > > > > > >>> I'm interested in this general topic too. One of our use cases is
> > > > > > >>> that we need to process network data in some degree since many
> > > > > > >>> protocols are application layer protocols so it seems more reasonable
> > > > > > >>> to process such protocols in userspace. And another difference is that
> > > > > > >>> we may have thousands of devices in a machine since we'd better to run
> > > > > > >>> containers as many as possible so the block device solution seems
> > > > > > >>> suboptimal to us. Yet I'm still interested in this topic to get more
> > > > > > >>> ideas.
> > > > > > >>>
> > > > > > >>> Btw, As for general userspace block device solutions, IMHO, there could
> > > > > > >>> be some deadlock issues out of direct reclaim, writeback, and userspace
> > > > > > >>> implementation due to writeback user requests can be tripped back to
> > > > > > >>> the kernel side (even the dependency crosses threads). I think they are
> > > > > > >>> somewhat hard to fix with user block device solutions. For example,
> > > > > > >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@mail.gmail.com
> > > > > > >>
> > > > > > >> This is already fixed with prctl() support. See:
> > > > > > >>
> > > > > > >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@redhat.com/
> > > > > > > 
> > > > > > > As I mentioned above, IMHO, we could add some per-task state to avoid
> > > > > > > the majority of such deadlock cases (also what I mentioned above), but
> > > > > > > there may still some potential dependency could happen between threads,
> > > > > > > such as using another kernel workqueue and waiting on it (in principle
> > > > > > > at least) since userspace program can call any syscall in principle (
> > > > > > > which doesn't like in-kernel drivers). So I think it can cause some
> > > > > > > risk due to generic userspace block device restriction, please kindly
> > > > > > > correct me if I'm wrong.
> > > > > > 
> > > > > > Not sure what you mean with all this. prctl() works per process/thread
> > > > > > and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO
> > > > > > set. So for the case of a user block device driver, setting this means
> > > > > > that it cannot reenter itself during a memory allocation, regardless of
> > > > > > the system call it executes (FS etc): all memory allocations in any
> > > > > > syscall executed by the context will have GFP_NOIO.
> > > > > 
> > > > > I mean,
> > > > > 
> > > > > assuming PR_SET_IO_FLUSHER is already set on Thread A by using prctl,
> > > > > but since it can call any valid system call, therefore, after it
> > > > > received data due to direct reclaim and writeback, it is still
> > > > > allowed to call some system call which may do something as follows:
> > > > > 
> > > > >    Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)
> > > > > 
> > > > >    (call some syscall which)
> > > > > 
> > > > >    submit something to Thread B
> > > > >                                   
> > > > >                                   ... (do something)
> > > > > 
> > > > >                                   memory allocation with GFP_KERNEL (it
> > > > >                                   may trigger direct memory reclaim
> > > > >                                   again and reenter the original fs.)
> > > > > 
> > > > >                                   wake up Thread A
> > > > > 
> > > > >    wait Thread B to complete
> > > > > 
> > > > > Normally such system call won't cause any problem since userspace
> > > > > programs cannot be in a context out of writeback and direct reclaim.
> > > > > Yet I'm not sure if it works under userspace block driver
> > > > > writeback/direct reclaim cases.
> > > > 
> > > > Hi Gao Xiang,
> > > > 
> > > > I'd rather to reply you in this original thread, and the recent
> > > > discussion is from the following link:
> > > > 
> > > > https://lore.kernel.org/linux-block/Yp1jRw6kiUf5jCrW@B-P7TQMD6M-0146.local/
> > > > 
> > > > kernel loop & nbd is really in the same situation.
> > > > 
> > > > For example of kernel loop, PF_MEMALLOC_NOIO is added in commit
> > > > d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread"),
> > > > so loop's worker thread can be thought as the above Thread A, and
> > > > of course, writeback/swapout IO can reach the loop worker thread(
> > > > the above Thread A), then loop just calls into FS from the worker
> > > > thread for handling the loop IO, that is same with user space driver's
> > > > case, and the kernel 'thread B' should be in FS code.
> > > > 
> > > > Your theory might be true, but it does depend on FS's implementation,
> > > > and we don't see such report in reality.
> > > > 
> > > > Also you didn't mentioned that what kernel thread B exactly is? And what
> > > > the allocation is in kernel thread B.
> > > > 
> > > > If you have actual report, I am happy to take account into it, otherwise not
> > > > sure if it is worth of time/effort in thinking/addressing one pure theoretical
> > > > concern.
> > > 
> > > Hi Ming,
> > > 
> > > Thanks for your look & reply.
> > > 
> > > That is not a wild guess. That is a basic difference between
> > > in-kernel native block-based drivers and user-space block drivers.
> > 
> > Please look at my comment, wrt. your pure theoretical concern, userspace
> > block driver is same with kernel loop/nbd.
> 
> Hi Ming,
> 
> I don't have time to audit some potential risky system call, but I guess
> security folks or researchers may be interested in finding such path.

Why do you think system call has potential risk? Isn't syscall designed
for userspace? Any syscall called from the userspace context is covered
by PR_SET_IO_FLUSHER, and your concern is just in Kernel thread B,
right?

If yes, let's focus on this scenario, so I posted it one more time:

>    Thread A (PR_SET_IO_FLUSHER)   Kernel thread B (another context)
> 
>    (call some syscall which)
> 
>    submit something to Thread B
>                                   
>                                   ... (do something)
> 
>                                   memory allocation with GFP_KERNEL (it
>                                   may trigger direct memory reclaim
>                                   again and reenter the original fs.)
> 
>                                   wake up Thread A
> 
>    wait Thread B to complete

You didn't mention why normal writeback IO from other context won't call
into this kind of kernel thread B too, so can you explain it a bit?

As I said, both loop/nbd has same situation, for example of loop, thread
A is loop worker thread with PF_MEMALLOC_NOIO, and generic FS code(read,
write, fallocate, fsync, ...) is called into from the worker thread, so
there might be the so called kernel thread B for loop. But we don't see
such report.

Yeah, you may argue that other non-FS syscalls may be involved in
userspace driver. But in reality, userspace block driver should only deal
with FS and network IO most of times, and both network and FS code path
are already in normal IO code path for long time, so your direct claim
concern shouldn't be one problem. Not mention nbd/tcmu/... have been used
or long long time, so far so good. 

If you think it is real risk, please find it for nbd/tcmu/dm-multipath/...
first. IMO, it isn't useful to say there is such generic concern without
further investigation and without providing any detail, and devil is always
in details.

> 
> The big problem is, you cannot avoid people to write such system call (or 
> ioctls) in their user daemon, since most system call (or ioctls)
> implementation assumes that they're never called under the kernel memory
> direct reclaim context (even with PR_SET_IO_FLUSHER) but userspace block
> driver can give such context to userspace and user problems can do
> whatever they do in principle.
> 
> IOWs, we can audit in-kernel block drivers and fix all buggy paths with
> GFP_NOIO since the source code is already there and they should be fixed.
> 
> But you have no way to audit all user programs to call proper system calls
> or random ioctls which can be safely worked in the direct reclaim context
> (even with PR_SET_IO_FLUSHER).
> 
> > 
> > Did you see such report on loop & nbd? Can you answer my questions wrt.
> > kernel thread B?
> 
> I don't think it has some relationship with in-kernel loop device, since
> the loop device I/O paths are all under control.

No, it is completely same situation wrt. your concern, please look at the above
scenario.

> 
> > 
> > > 
> > > That is userspace block driver can call _any_ system call if they want.
> > > Since users can call any system call and any _new_ system call can be
> > > introduced later, you have to audit all system calls "Which are safe
> > > and which are _not_ safe" all the time. Otherwise, attacker can make
> > 
> > Isn't nbd server capable of calling any system call? Is there any
> > security risk for nbd?
> 
> Note that I wrote this email initially as a generic concern (prior to your
> ubd annoucement ), so that isn't related to your ubd from my POV.

OK, I guess I needn't to waste time on this 'generic concern'.

> 
> > 
> > > use of it to hung the system if such userspace driver is used widely.
> > 
> > >From the beginning, only ADMIN can create ubd, that is same with
> > nbd/loop, and it gets default permission as disk device.
> 
> loop device is different since the path can be totally controlled by the
> kernel.
> 
> > 
> > ubd is really in same situation with nbd wrt. security, the only difference
> > is just that nbd uses socket for communication, and ubd uses io_uring, that
> > is all.
> > 
> > Yeah, Stefan Hajnoczi and I discussed to make ubd as one container
> > block device, so normal user can create & use ubd, but it won't be done
> > from the beginning, and won't be enabled until the potential security
> > risks are addressed, and there should be more limits on ubd when normal user
> > can create & use it, such as:
> > 
> > - not allow unprivileged ubd device to be mounted
> > - not allow unprivileged ubd device's partition table to be read from
> >   kernel
> > - not support buffered io for unprivileged ubd device, and only direct io
> >   is allowed
> 
> How could you do that? I think it needs a wide modification to mm/fs.
> and how about mmap I/O?

Firstly mount isn't allowed, then we can deal with mmap on def_blk_fops, and
only allow open with O_DIRECT.

> 
> > - maybe more limit for minimizing security risk.
> > 
> > > 
> > > IOWs, in my humble opinion, that is quite a fundamental security
> > > concern of all userspace block drivers.
> > 
> > But nbd is still there and widely used, and there are lots of people who
> > shows interest in userspace block device. Then think about who is wrong?
> > 
> > As one userspace block driver, it is normal to see some limits there,
> > but I don't agree that there is fundamental security issue.
> 
> That depends, if you think it's a real security issue that there could be
> a path reported to public to trigger that after it's widely used, that is
> fine.

But nbd/tcmu is widely used already...

> 
> > 
> > > 
> > > Actually, you cannot ignore block I/O requests if they actually push
> > 
> > Who wants to ignore block I/O? And why ignore it?
> 
> I don't know how to express that properly. Sorry for my bad English.
> 
> For example, userspace FS implementation can ignore any fs operations
> triggered under direct reclaim.
> 
> But if you runs a userspace block driver under a random fs, they will
> just send data & metadata I/O to your driver unconditionally. I think
> that is too late to avoid such deadlock.

What is the deadlock? Is that triggered with your kernel thread B deadlock?

> 
> > 
> > > into block layer, since that is too late if I/O actually is submitted
> > > by some FS. And you don't even know which type of such I/O is.
> > 
> > We do know the I/O type.
> 
> 1) you don't know meta or data I/O. I know there is a REQ_META, but
>    that is not a strict mark.
> 
> 2) even you know an I/O is under direct reclaim, how to deal with that?
>   just send to userspace unconditionally?

All block driver doesn't care REQ_META, why is it special for userspace
block driver?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [LSF/MM/BPF TOPIC] block drivers in user space
  2022-06-09  4:06               ` Ming Lei
  2022-06-09  4:55                 ` Gao Xiang
@ 2022-07-28  8:23                 ` Pavel Machek
  1 sibling, 0 replies; 54+ messages in thread
From: Pavel Machek @ 2022-07-28  8:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Damien Le Moal, Gabriel Krisman Bertazi, lsf-pc, linux-block,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

Hi!

> > > If you have actual report, I am happy to take account into it, otherwise not
> > > sure if it is worth of time/effort in thinking/addressing one pure theoretical
> > > concern.
> > 
> > Hi Ming,
> > 
> > Thanks for your look & reply.
> > 
> > That is not a wild guess. That is a basic difference between
> > in-kernel native block-based drivers and user-space block drivers.
> 
> Please look at my comment, wrt. your pure theoretical concern, userspace
> block driver is same with kernel loop/nbd.
> 
> Did you see such report on loop & nbd? Can you answer my questions wrt.
> kernel thread B?

Yes, nbd is known to deadlock under high loads. Don't do that.

									Pavel
-- 
People of Russia, stop Putin before his war on Ukraine escalates.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2022-07-28  8:23 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 19:59 [LSF/MM/BPF TOPIC] block drivers in user space Gabriel Krisman Bertazi
2022-02-21 23:16 ` Damien Le Moal
2022-02-21 23:30   ` Gabriel Krisman Bertazi
2022-02-22  6:57 ` Hannes Reinecke
2022-02-22 14:46   ` Sagi Grimberg
2022-02-22 17:46     ` Hannes Reinecke
2022-02-22 18:05     ` Gabriel Krisman Bertazi
2022-02-24  9:37       ` Xiaoguang Wang
2022-02-24 10:12       ` Sagi Grimberg
2022-03-01 23:24         ` Khazhy Kumykov
2022-03-02 16:16         ` Mike Christie
2022-03-13 21:15           ` Sagi Grimberg
2022-03-14 17:12             ` Mike Christie
2022-03-15  8:03               ` Sagi Grimberg
2022-03-14 19:21             ` Bart Van Assche
2022-03-15  6:52               ` Hannes Reinecke
2022-03-15  8:08                 ` Sagi Grimberg
2022-03-15  8:12                   ` Christoph Hellwig
2022-03-15  8:38                     ` Sagi Grimberg
2022-03-15  8:42                       ` Christoph Hellwig
2022-03-23 19:42                       ` Gabriel Krisman Bertazi
2022-03-24 17:05                         ` Sagi Grimberg
2022-03-15  8:04               ` Sagi Grimberg
2022-02-22 18:05   ` Bart Van Assche
2022-03-02 23:04   ` Gabriel Krisman Bertazi
2022-03-03  7:17     ` Hannes Reinecke
2022-03-27 16:35   ` Ming Lei
2022-03-28  5:47     ` Kanchan Joshi
2022-03-28  5:48     ` Hannes Reinecke
2022-03-28 20:20     ` Gabriel Krisman Bertazi
2022-03-29  0:30       ` Ming Lei
2022-03-29 17:20         ` Gabriel Krisman Bertazi
2022-03-30  1:55           ` Ming Lei
2022-03-30 18:22             ` Gabriel Krisman Bertazi
2022-03-31  1:38               ` Ming Lei
2022-03-31  3:49                 ` Bart Van Assche
2022-04-08  6:52     ` Xiaoguang Wang
2022-04-08  7:44       ` Ming Lei
2022-02-23  5:57 ` Gao Xiang
2022-02-23  7:46   ` Damien Le Moal
2022-02-23  8:11     ` Gao Xiang
2022-02-23 22:40       ` Damien Le Moal
2022-02-24  0:58         ` Gao Xiang
2022-06-09  2:01           ` Ming Lei
2022-06-09  2:28             ` Gao Xiang
2022-06-09  4:06               ` Ming Lei
2022-06-09  4:55                 ` Gao Xiang
2022-06-10  1:52                   ` Ming Lei
2022-07-28  8:23                 ` Pavel Machek
2022-03-02 16:52 ` Mike Christie
2022-03-03  7:09   ` Hannes Reinecke
2022-03-14 17:04     ` Mike Christie
2022-03-15  6:45       ` Hannes Reinecke
2022-03-05  7:29 ` Dongsheng Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.