All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
@ 2023-02-06 15:00 Ming Lei
  2023-02-06 17:53 ` Hannes Reinecke
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Ming Lei @ 2023-02-06 15:00 UTC (permalink / raw)
  To: linux-block, lsf-pc
  Cc: ming.lei, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, Stefan Hajnoczi, ZiyangZhang

Hello,

So far UBLK is only used for implementing virtual block device from
userspace, such as loop, nbd, qcow2, ...[1].

It could be useful for UBLK to cover real storage hardware too:

- for fast prototype or performance evaluation

- some network storages are attached to host, such as iscsi and nvme-tcp,
the current UBLK interface doesn't support such devices, since it needs
all LUNs/Namespaces to share host resources(such as tag)

- SPDK has supported user space driver for real hardware

So propose to extend UBLK for supporting real hardware device:

1) extend UBLK ABI interface to support disks attached to host, such
as SCSI Luns/NVME Namespaces

2) the followings are related with operating hardware from userspace,
so userspace driver has to be trusted, and root is required, and
can't support unprivileged UBLK device

3) how to operating hardware memory space
- unbind kernel driver and rebind with uio/vfio
- map PCI BAR into userspace[2], then userspace can operate hardware
with mapped user address via MMIO

4) DMA
- DMA requires physical memory address, UBLK driver actually has
block request pages, so can we export request SG list(each segment
physical address, offset, len) into userspace? If the max_segments
limit is not too big(<=64), the needed buffer for holding SG list
can be small enough.

- small amount of physical memory for using as DMA descriptor can be
pre-allocated from userspace, and ask kernel to pin pages, then still
return physical address to userspace for programming DMA

- this way is still zero copy

5) notification from hardware: interrupt or polling
- SPDK applies userspace polling, this way is doable, but
eat CPU, so it is only one choice

- io_uring command has been proved as very efficient, if io_uring
command is applied(similar way with UBLK for forwarding blk io
command from kernel to userspace) to uio/vfio for delivering interrupt,
which should be efficient too, given batching processes are done after
the io_uring command is completed

- or it could be flexible by hybrid interrupt & polling, given
userspace single pthread/queue implementation can retrieve all
kinds of inflight IO info in very cheap way, and maybe it is likely
to apply some ML model to learn & predict when IO will be completed

6) others?



[1] https://github.com/ming1/ubdsrv
[2] https://spdk.io/doc/userspace.html
 

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei
@ 2023-02-06 17:53 ` Hannes Reinecke
  2023-03-08  8:50   ` Hans Holmberg
  2023-02-06 18:26 ` Bart Van Assche
  2023-02-06 20:27 ` Stefan Hajnoczi
  2 siblings, 1 reply; 34+ messages in thread
From: Hannes Reinecke @ 2023-02-06 17:53 UTC (permalink / raw)
  To: Ming Lei, linux-block, lsf-pc
  Cc: Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling,
	hch, Stefan Hajnoczi, ZiyangZhang

On 2/6/23 16:00, Ming Lei wrote:
> Hello,
> 
> So far UBLK is only used for implementing virtual block device from
> userspace, such as loop, nbd, qcow2, ...[1].
> 
> It could be useful for UBLK to cover real storage hardware too:
> 
> - for fast prototype or performance evaluation
> 
> - some network storages are attached to host, such as iscsi and nvme-tcp,
> the current UBLK interface doesn't support such devices, since it needs
> all LUNs/Namespaces to share host resources(such as tag)
> 
> - SPDK has supported user space driver for real hardware
> 
> So propose to extend UBLK for supporting real hardware device:
> 
> 1) extend UBLK ABI interface to support disks attached to host, such
> as SCSI Luns/NVME Namespaces
> 
> 2) the followings are related with operating hardware from userspace,
> so userspace driver has to be trusted, and root is required, and
> can't support unprivileged UBLK device
> 
> 3) how to operating hardware memory space
> - unbind kernel driver and rebind with uio/vfio
> - map PCI BAR into userspace[2], then userspace can operate hardware
> with mapped user address via MMIO
> 
> 4) DMA
> - DMA requires physical memory address, UBLK driver actually has
> block request pages, so can we export request SG list(each segment
> physical address, offset, len) into userspace? If the max_segments
> limit is not too big(<=64), the needed buffer for holding SG list
> can be small enough.
> 
> - small amount of physical memory for using as DMA descriptor can be
> pre-allocated from userspace, and ask kernel to pin pages, then still
> return physical address to userspace for programming DMA
> 
> - this way is still zero copy
> 
> 5) notification from hardware: interrupt or polling
> - SPDK applies userspace polling, this way is doable, but
> eat CPU, so it is only one choice
> 
> - io_uring command has been proved as very efficient, if io_uring
> command is applied(similar way with UBLK for forwarding blk io
> command from kernel to userspace) to uio/vfio for delivering interrupt,
> which should be efficient too, given batching processes are done after
> the io_uring command is completed
> 
> - or it could be flexible by hybrid interrupt & polling, given
> userspace single pthread/queue implementation can retrieve all
> kinds of inflight IO info in very cheap way, and maybe it is likely
> to apply some ML model to learn & predict when IO will be completed
> 
> 6) others?
> 
> 
Good idea.
I'd love to have this discussion.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei
  2023-02-06 17:53 ` Hannes Reinecke
@ 2023-02-06 18:26 ` Bart Van Assche
  2023-02-08  1:38   ` Ming Lei
  2023-02-06 20:27 ` Stefan Hajnoczi
  2 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2023-02-06 18:26 UTC (permalink / raw)
  To: Ming Lei, linux-block, lsf-pc
  Cc: Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling,
	hch, Stefan Hajnoczi, ZiyangZhang

On 2/6/23 07:00, Ming Lei wrote:
> 4) DMA
> - DMA requires physical memory address, UBLK driver actually has
> block request pages, so can we export request SG list(each segment
> physical address, offset, len) into userspace? If the max_segments
> limit is not too big(<=64), the needed buffer for holding SG list
> can be small enough.
> 
> - small amount of physical memory for using as DMA descriptor can be
> pre-allocated from userspace, and ask kernel to pin pages, then still
> return physical address to userspace for programming DMA
> 
> - this way is still zero copy

Would it be possible to use vfio in such a way that zero-copy
functionality is achieved? I'm concerned about the code duplication that
would result if a new interface similar to vfio is introduced.

In case it wouldn't be clear, I'm also interested in this topic.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei
  2023-02-06 17:53 ` Hannes Reinecke
  2023-02-06 18:26 ` Bart Van Assche
@ 2023-02-06 20:27 ` Stefan Hajnoczi
  2023-02-08  2:12   ` Ming Lei
  2 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-06 20:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 6655 bytes --]

On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> Hello,
> 
> So far UBLK is only used for implementing virtual block device from
> userspace, such as loop, nbd, qcow2, ...[1].

I won't be at LSF/MM so here are my thoughts:

> 
> It could be useful for UBLK to cover real storage hardware too:
> 
> - for fast prototype or performance evaluation
> 
> - some network storages are attached to host, such as iscsi and nvme-tcp,
> the current UBLK interface doesn't support such devices, since it needs
> all LUNs/Namespaces to share host resources(such as tag)

Can you explain this in more detail? It seems like an iSCSI or
NVMe-over-TCP initiator could be implemented as a ublk server today.
What am I missing?

> 
> - SPDK has supported user space driver for real hardware

I think this could already be implemented today. There will be extra
memory copies because SPDK won't have access to the application's memory
pages.

> 
> So propose to extend UBLK for supporting real hardware device:
> 
> 1) extend UBLK ABI interface to support disks attached to host, such
> as SCSI Luns/NVME Namespaces
> 
> 2) the followings are related with operating hardware from userspace,
> so userspace driver has to be trusted, and root is required, and
> can't support unprivileged UBLK device

Linux VFIO provides a safe userspace API for userspace device drivers.
That means memory and interrupts are isolated. Neither userspace nor the
hardware device can access memory or interrupts that the userspace
process is not allowed to access.

I think there are still limitations like all memory pages exposed to the
device need to be pinned. So effectively you might still need privileges
to get the mlock resource limits.

But overall I think what you're saying about root and unprivileged ublk
devices is not true. Hardware support should be developed with the goal
of supporting unprivileged userspace ublk servers.

Those unprivileged userspace ublk servers cannot claim any PCI device
they want. The user/admin will need to give them permission to open a
network card, SCSI HBA, etc.

> 
> 3) how to operating hardware memory space
> - unbind kernel driver and rebind with uio/vfio
> - map PCI BAR into userspace[2], then userspace can operate hardware
> with mapped user address via MMIO
>
> 4) DMA
> - DMA requires physical memory address, UBLK driver actually has
> block request pages, so can we export request SG list(each segment
> physical address, offset, len) into userspace? If the max_segments
> limit is not too big(<=64), the needed buffer for holding SG list
> can be small enough.

DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
address. The IOVA space is defined by the IOMMU page tables. Userspace
controls the IOMMU page tables via Linux VFIO ioctls.

For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
IOMMU mapping that makes a range of userspace virtual addresses
available at a given IOVA.

Mapping and unmapping operations are not free. Similar to mmap(2), the
program will be slow if it does this frequently.

I think it's effectively the same problem as ublk zero-copy. We want to
give the ublk server access to just the I/O buffers that it currently
needs, but doing so would be expensive :(.

I think Linux has strategies for avoiding the expense like
iommu.strict=0 and swiotlb. The drawback is that in our case userspace
and/or the hardware device controller by userspace would still have
access to the memory pages after I/O has completed. This reduces memory
isolation :(.

DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.

What I'm trying to get at is that either memory isolation is compromised
or performance is reduced. It's hard to have good performance together
with memory isolation.

I think ublk should follow the VFIO philosophy of being a safe
kernel/userspace interface. If userspace is malicious or buggy, the
kernel's and other process' memory should not be corrupted.

> 
> - small amount of physical memory for using as DMA descriptor can be
> pre-allocated from userspace, and ask kernel to pin pages, then still
> return physical address to userspace for programming DMA

I think this is possible today. The ublk server owns the I/O buffers. It
can mlock them and DMA map them via VFIO. ublk doesn't need to know
anything about this.

> - this way is still zero copy

True zero-copy would be when an application does O_DIRECT I/O and the
hardware device DMAs to/from the application's memory pages. ublk
doesn't do that today and when combined with VFIO it doesn't get any
easier. I don't think it's possible because you cannot allow userspace
to control a hardware device and grant DMA access to pages that
userspace isn't allowed to access. A malicious userspace will program
the device to access those pages :).

> 
> 5) notification from hardware: interrupt or polling
> - SPDK applies userspace polling, this way is doable, but
> eat CPU, so it is only one choice
> 
> - io_uring command has been proved as very efficient, if io_uring
> command is applied(similar way with UBLK for forwarding blk io
> command from kernel to userspace) to uio/vfio for delivering interrupt,
> which should be efficient too, given batching processes are done after
> the io_uring command is completed

I wonder how much difference there is between the new io_uring command
for receiving VFIO irqs that you are suggesting compared to the existing
io_uring approach IORING_OP_READ eventfd.

> - or it could be flexible by hybrid interrupt & polling, given
> userspace single pthread/queue implementation can retrieve all
> kinds of inflight IO info in very cheap way, and maybe it is likely
> to apply some ML model to learn & predict when IO will be completed

Stefano Garzarella and I have discussed but not yet attempted to add a
userspace memory polling command to io_uring. IORING_OP_POLL_MEMORY
would be useful together with IORING_SETUP_IOPOLL. That way kernel
polling can be combined with userspace polling on a single CPU.

I'm not sure it's useful for ublk because you may not have any reason to
use IORING_SETUP_IOPOLL. But applications that have an Linux NVMe block
device open with IORING_SETUP_IOPOLL could use the new
IORING_OP_POLL_MEMORY command to also watch for activity on a VIRTIO or
VFIO PCI device or maybe just to get kicked by another userspace thread.

> 6) others?
> 
> 
> 
> [1] https://github.com/ming1/ubdsrv
> [2] https://spdk.io/doc/userspace.html
>  
> 
> Thanks, 
> Ming
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-06 18:26 ` Bart Van Assche
@ 2023-02-08  1:38   ` Ming Lei
  2023-02-08 18:02     ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-08  1:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, Stefan Hajnoczi, ZiyangZhang,
	ming.lei

On Mon, Feb 06, 2023 at 10:26:55AM -0800, Bart Van Assche wrote:
> On 2/6/23 07:00, Ming Lei wrote:
> > 4) DMA
> > - DMA requires physical memory address, UBLK driver actually has
> > block request pages, so can we export request SG list(each segment
> > physical address, offset, len) into userspace? If the max_segments
> > limit is not too big(<=64), the needed buffer for holding SG list
> > can be small enough.
> > 
> > - small amount of physical memory for using as DMA descriptor can be
> > pre-allocated from userspace, and ask kernel to pin pages, then still
> > return physical address to userspace for programming DMA
> > 
> > - this way is still zero copy
> 
> Would it be possible to use vfio in such a way that zero-copy
> functionality is achieved? I'm concerned about the code duplication that
> would result if a new interface similar to vfio is introduced.

Here I meant we can export physical address of request sg from
/dev/ublkb* to userspace, which can program the DMA controller
using exported physical address. With this way, the userspace driver
can submit IO without entering kernel, then with high performance.

This should how SPDK/nvme-pci[1] is implemented, but SPDK allocates
hugepage for getting its physical address.

[1] https://spdk.io/doc/memory.html

Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-06 20:27 ` Stefan Hajnoczi
@ 2023-02-08  2:12   ` Ming Lei
  2023-02-08 12:17     ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-08  2:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang, ming.lei

On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > Hello,
> > 
> > So far UBLK is only used for implementing virtual block device from
> > userspace, such as loop, nbd, qcow2, ...[1].
> 
> I won't be at LSF/MM so here are my thoughts:

Thanks for the thoughts, :-)

> 
> > 
> > It could be useful for UBLK to cover real storage hardware too:
> > 
> > - for fast prototype or performance evaluation
> > 
> > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > the current UBLK interface doesn't support such devices, since it needs
> > all LUNs/Namespaces to share host resources(such as tag)
> 
> Can you explain this in more detail? It seems like an iSCSI or
> NVMe-over-TCP initiator could be implemented as a ublk server today.
> What am I missing?

The current ublk can't do that yet, because the interface doesn't
support multiple ublk disks sharing single host, which is exactly
the case of scsi and nvme.

> 
> > 
> > - SPDK has supported user space driver for real hardware
> 
> I think this could already be implemented today. There will be extra
> memory copies because SPDK won't have access to the application's memory
> pages.

Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
such extra copy per my understanding.

> 
> > 
> > So propose to extend UBLK for supporting real hardware device:
> > 
> > 1) extend UBLK ABI interface to support disks attached to host, such
> > as SCSI Luns/NVME Namespaces
> > 
> > 2) the followings are related with operating hardware from userspace,
> > so userspace driver has to be trusted, and root is required, and
> > can't support unprivileged UBLK device
> 
> Linux VFIO provides a safe userspace API for userspace device drivers.
> That means memory and interrupts are isolated. Neither userspace nor the
> hardware device can access memory or interrupts that the userspace
> process is not allowed to access.
> 
> I think there are still limitations like all memory pages exposed to the
> device need to be pinned. So effectively you might still need privileges
> to get the mlock resource limits.
> 
> But overall I think what you're saying about root and unprivileged ublk
> devices is not true. Hardware support should be developed with the goal
> of supporting unprivileged userspace ublk servers.
> 
> Those unprivileged userspace ublk servers cannot claim any PCI device
> they want. The user/admin will need to give them permission to open a
> network card, SCSI HBA, etc.

It depends on implementation, please see

	https://spdk.io/doc/userspace.html

	```
	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
	then follows along with the NVMe Specification to initialize the device,
	create queue pairs, and ultimately send I/O.
	```

The above way needs userspace to operating hardware by the mapped BAR,
which can't be allowed for unprivileged user.

> 
> > 
> > 3) how to operating hardware memory space
> > - unbind kernel driver and rebind with uio/vfio
> > - map PCI BAR into userspace[2], then userspace can operate hardware
> > with mapped user address via MMIO
> >
> > 4) DMA
> > - DMA requires physical memory address, UBLK driver actually has
> > block request pages, so can we export request SG list(each segment
> > physical address, offset, len) into userspace? If the max_segments
> > limit is not too big(<=64), the needed buffer for holding SG list
> > can be small enough.
> 
> DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> address. The IOVA space is defined by the IOMMU page tables. Userspace
> controls the IOMMU page tables via Linux VFIO ioctls.
> 
> For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> IOMMU mapping that makes a range of userspace virtual addresses
> available at a given IOVA.
> 
> Mapping and unmapping operations are not free. Similar to mmap(2), the
> program will be slow if it does this frequently.

Yeah, but SPDK shouldn't use vfio DMA interface, see:

https://spdk.io/doc/memory.html

they just programs DMA directly with physical address of pinned hugepages.

> 
> I think it's effectively the same problem as ublk zero-copy. We want to
> give the ublk server access to just the I/O buffers that it currently
> needs, but doing so would be expensive :(.
> 
> I think Linux has strategies for avoiding the expense like
> iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> and/or the hardware device controller by userspace would still have
> access to the memory pages after I/O has completed. This reduces memory
> isolation :(.
> 
> DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.

Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.

> 
> What I'm trying to get at is that either memory isolation is compromised
> or performance is reduced. It's hard to have good performance together
> with memory isolation.
> 
> I think ublk should follow the VFIO philosophy of being a safe
> kernel/userspace interface. If userspace is malicious or buggy, the
> kernel's and other process' memory should not be corrupted.

It is tradeoff between performance and isolation, that is why I mention
that directing programming hardware in userspace can be done by root
only.

> 
> > 
> > - small amount of physical memory for using as DMA descriptor can be
> > pre-allocated from userspace, and ask kernel to pin pages, then still
> > return physical address to userspace for programming DMA
> 
> I think this is possible today. The ublk server owns the I/O buffers. It
> can mlock them and DMA map them via VFIO. ublk doesn't need to know
> anything about this.

It depends on if such VFIO DMA mapping is required for each IO. If it
is required, that won't help one high performance driver.

> 
> > - this way is still zero copy
> 
> True zero-copy would be when an application does O_DIRECT I/O and the
> hardware device DMAs to/from the application's memory pages. ublk
> doesn't do that today and when combined with VFIO it doesn't get any
> easier. I don't think it's possible because you cannot allow userspace
> to control a hardware device and grant DMA access to pages that
> userspace isn't allowed to access. A malicious userspace will program
> the device to access those pages :).

But that should be what SPDK nvme/pci is doing per the above links, :-)

> 
> > 
> > 5) notification from hardware: interrupt or polling
> > - SPDK applies userspace polling, this way is doable, but
> > eat CPU, so it is only one choice
> > 
> > - io_uring command has been proved as very efficient, if io_uring
> > command is applied(similar way with UBLK for forwarding blk io
> > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > which should be efficient too, given batching processes are done after
> > the io_uring command is completed
> 
> I wonder how much difference there is between the new io_uring command
> for receiving VFIO irqs that you are suggesting compared to the existing
> io_uring approach IORING_OP_READ eventfd.

eventfd needs extra read/write on the event fd, so more syscalls are
required.

> 
> > - or it could be flexible by hybrid interrupt & polling, given
> > userspace single pthread/queue implementation can retrieve all
> > kinds of inflight IO info in very cheap way, and maybe it is likely
> > to apply some ML model to learn & predict when IO will be completed
> 
> Stefano Garzarella and I have discussed but not yet attempted to add a
> userspace memory polling command to io_uring. IORING_OP_POLL_MEMORY
> would be useful together with IORING_SETUP_IOPOLL. That way kernel
> polling can be combined with userspace polling on a single CPU.

Here I meant the direct polling on mmio or DMA descriptor, so no
need any syscall:

https://spdk.io/doc/userspace.html

```
Polling an NVMe device is fast because only host memory needs to be
read (no MMIO) to check a queue pair for a bit flip and technologies such
as Intel's DDIO will ensure that the host memory being checked is present
in the CPU cache after an update by the device.
```

With the above mentioned direct programming DMA & this kind of polling,
handling IO won't require any syscall, but the userspace has to be
trusted.

> 
> I'm not sure it's useful for ublk because you may not have any reason to
> use IORING_SETUP_IOPOLL. But applications that have an Linux NVMe block

I think it is reasonable for ublk to poll target io, which isn't different
with other polling cases, which should help network recv, IMO.

So ublk is going to support io polling for target io only, but
can't be done for io command.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-08  2:12   ` Ming Lei
@ 2023-02-08 12:17     ` Stefan Hajnoczi
  2023-02-13  3:47       ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-08 12:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 11133 bytes --]

On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > Hello,
> > > 
> > > So far UBLK is only used for implementing virtual block device from
> > > userspace, such as loop, nbd, qcow2, ...[1].
> > 
> > I won't be at LSF/MM so here are my thoughts:
> 
> Thanks for the thoughts, :-)
> 
> > 
> > > 
> > > It could be useful for UBLK to cover real storage hardware too:
> > > 
> > > - for fast prototype or performance evaluation
> > > 
> > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > the current UBLK interface doesn't support such devices, since it needs
> > > all LUNs/Namespaces to share host resources(such as tag)
> > 
> > Can you explain this in more detail? It seems like an iSCSI or
> > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > What am I missing?
> 
> The current ublk can't do that yet, because the interface doesn't
> support multiple ublk disks sharing single host, which is exactly
> the case of scsi and nvme.

Can you give an example that shows exactly where a problem is hit?

I took a quick look at the ublk source code and didn't spot a place
where it prevents a single ublk server process from handling multiple
devices.

Regarding "host resources(such as tag)", can the ublk server deal with
that in userspace? The Linux block layer doesn't have the concept of a
"host", that would come in at the SCSI/NVMe level that's implemented in
userspace.

I don't understand yet...

> 
> > 
> > > 
> > > - SPDK has supported user space driver for real hardware
> > 
> > I think this could already be implemented today. There will be extra
> > memory copies because SPDK won't have access to the application's memory
> > pages.
> 
> Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> such extra copy per my understanding.
> 
> > 
> > > 
> > > So propose to extend UBLK for supporting real hardware device:
> > > 
> > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > as SCSI Luns/NVME Namespaces
> > > 
> > > 2) the followings are related with operating hardware from userspace,
> > > so userspace driver has to be trusted, and root is required, and
> > > can't support unprivileged UBLK device
> > 
> > Linux VFIO provides a safe userspace API for userspace device drivers.
> > That means memory and interrupts are isolated. Neither userspace nor the
> > hardware device can access memory or interrupts that the userspace
> > process is not allowed to access.
> > 
> > I think there are still limitations like all memory pages exposed to the
> > device need to be pinned. So effectively you might still need privileges
> > to get the mlock resource limits.
> > 
> > But overall I think what you're saying about root and unprivileged ublk
> > devices is not true. Hardware support should be developed with the goal
> > of supporting unprivileged userspace ublk servers.
> > 
> > Those unprivileged userspace ublk servers cannot claim any PCI device
> > they want. The user/admin will need to give them permission to open a
> > network card, SCSI HBA, etc.
> 
> It depends on implementation, please see
> 
> 	https://spdk.io/doc/userspace.html
> 
> 	```
> 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> 	then follows along with the NVMe Specification to initialize the device,
> 	create queue pairs, and ultimately send I/O.
> 	```
> 
> The above way needs userspace to operating hardware by the mapped BAR,
> which can't be allowed for unprivileged user.

From https://spdk.io/doc/system_configuration.html:

  Running SPDK as non-privileged user

  One of the benefits of using the VFIO Linux kernel driver is the
  ability to perform DMA operations with peripheral devices as
  unprivileged user. The permissions to access particular devices still
  need to be granted by the system administrator, but only on a one-time
  basis. Note that this functionality is supported with DPDK starting
  from version 18.11.

This is what I had described in my previous reply.

> 
> > 
> > > 
> > > 3) how to operating hardware memory space
> > > - unbind kernel driver and rebind with uio/vfio
> > > - map PCI BAR into userspace[2], then userspace can operate hardware
> > > with mapped user address via MMIO
> > >
> > > 4) DMA
> > > - DMA requires physical memory address, UBLK driver actually has
> > > block request pages, so can we export request SG list(each segment
> > > physical address, offset, len) into userspace? If the max_segments
> > > limit is not too big(<=64), the needed buffer for holding SG list
> > > can be small enough.
> > 
> > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> > address. The IOVA space is defined by the IOMMU page tables. Userspace
> > controls the IOMMU page tables via Linux VFIO ioctls.
> > 
> > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> > IOMMU mapping that makes a range of userspace virtual addresses
> > available at a given IOVA.
> > 
> > Mapping and unmapping operations are not free. Similar to mmap(2), the
> > program will be slow if it does this frequently.
> 
> Yeah, but SPDK shouldn't use vfio DMA interface, see:
> 
> https://spdk.io/doc/memory.html
> 
> they just programs DMA directly with physical address of pinned hugepages.

From the page you linked:

  IOMMU Support

  ...

  This is a future-proof, hardware-accelerated solution for performing
  DMA operations into and out of a user space process and forms the
  long-term foundation for SPDK and DPDK's memory management strategy.
  We highly recommend that applications are deployed using vfio and the
  IOMMU enabled, which is fully supported today.

Yes, SPDK supports running without IOMMU, but they recommend running
with the IOMMU.

> 
> > 
> > I think it's effectively the same problem as ublk zero-copy. We want to
> > give the ublk server access to just the I/O buffers that it currently
> > needs, but doing so would be expensive :(.
> > 
> > I think Linux has strategies for avoiding the expense like
> > iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> > and/or the hardware device controller by userspace would still have
> > access to the memory pages after I/O has completed. This reduces memory
> > isolation :(.
> > 
> > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.
> 
> Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.

When using VFIO (recommended by the docs), SPDK uses long-lived DMA
mappings. Here are places in the SPDK/DPDK source code where VFIO DMA
mapping is used:
https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371
https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164

> 
> > 
> > What I'm trying to get at is that either memory isolation is compromised
> > or performance is reduced. It's hard to have good performance together
> > with memory isolation.
> > 
> > I think ublk should follow the VFIO philosophy of being a safe
> > kernel/userspace interface. If userspace is malicious or buggy, the
> > kernel's and other process' memory should not be corrupted.
> 
> It is tradeoff between performance and isolation, that is why I mention
> that directing programming hardware in userspace can be done by root
> only.

Yes, there is a trade-off. Over the years the use of unsafe approaches
has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As
secure boot, integrity architecture, and stuff like that becomes more
widely used, it's harder to include features that break memory isolation
in software in mainstream distros. There can be an option to sacrifice
memory isolation for performance and some users may be willing to accept
the trade-off. I think it should be an option feature though.

I did want to point out that the statement that "direct programming
hardware in userspace can be done by root only" is false (see VFIO).

> > 
> > > 
> > > - small amount of physical memory for using as DMA descriptor can be
> > > pre-allocated from userspace, and ask kernel to pin pages, then still
> > > return physical address to userspace for programming DMA
> > 
> > I think this is possible today. The ublk server owns the I/O buffers. It
> > can mlock them and DMA map them via VFIO. ublk doesn't need to know
> > anything about this.
> 
> It depends on if such VFIO DMA mapping is required for each IO. If it
> is required, that won't help one high performance driver.

It is not necessary to perform a DMA mapping for each IO. ublk's
existing model is sufficient:
1. ublk server allocates I/O buffers and VFIO DMA maps them on startup.
2. At runtime the ublk server provides these I/O buffers to the kernel,
   no further DMA mapping is required.

Unfortunately there's still the kernel<->userspace copy that existing
ublk applications have, but there's no new overhead related to VFIO.

> > 
> > > - this way is still zero copy
> > 
> > True zero-copy would be when an application does O_DIRECT I/O and the
> > hardware device DMAs to/from the application's memory pages. ublk
> > doesn't do that today and when combined with VFIO it doesn't get any
> > easier. I don't think it's possible because you cannot allow userspace
> > to control a hardware device and grant DMA access to pages that
> > userspace isn't allowed to access. A malicious userspace will program
> > the device to access those pages :).
> 
> But that should be what SPDK nvme/pci is doing per the above links, :-)

Sure, it's possible to break memory isolation. Breaking memory isolation
isn't specific to ublk servers that access hardware. The same unsafe
zero-copy approach would probably also work for regular ublk servers.
This is basically bringing back /dev/kmem :).

> 
> > 
> > > 
> > > 5) notification from hardware: interrupt or polling
> > > - SPDK applies userspace polling, this way is doable, but
> > > eat CPU, so it is only one choice
> > > 
> > > - io_uring command has been proved as very efficient, if io_uring
> > > command is applied(similar way with UBLK for forwarding blk io
> > > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > > which should be efficient too, given batching processes are done after
> > > the io_uring command is completed
> > 
> > I wonder how much difference there is between the new io_uring command
> > for receiving VFIO irqs that you are suggesting compared to the existing
> > io_uring approach IORING_OP_READ eventfd.
> 
> eventfd needs extra read/write on the event fd, so more syscalls are
> required.

No extra syscall is required because IORING_OP_READ is used to read the
eventfd, but maybe you were referring to bypassing the
file->f_op->read() code path?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-08  1:38   ` Ming Lei
@ 2023-02-08 18:02     ` Bart Van Assche
  0 siblings, 0 replies; 34+ messages in thread
From: Bart Van Assche @ 2023-02-08 18:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, Stefan Hajnoczi, ZiyangZhang

On 2/7/23 17:38, Ming Lei wrote:
> Here I meant we can export physical address of request sg from
> /dev/ublkb* to userspace, which can program the DMA controller
> using exported physical address. With this way, the userspace driver
> can submit IO without entering kernel, then with high performance.

Hmm ... security experts might be very unhappy about allowing user space 
software to program iova addresses, PASIDs etc. in DMA controllers 
without having this data verified by the kernel. Additionally, hardware 
designers every now and then propose new device multiplexing mechanisms, 
e.g. scalable IOV which is an alternative for SRIOV. Shouldn't we make 
the kernel deal with these mechanisms instead of user space?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-08 12:17     ` Stefan Hajnoczi
@ 2023-02-13  3:47       ` Ming Lei
  2023-02-13 19:13         ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-13  3:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang, ming.lei

On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > Hello,
> > > > 
> > > > So far UBLK is only used for implementing virtual block device from
> > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > 
> > > I won't be at LSF/MM so here are my thoughts:
> > 
> > Thanks for the thoughts, :-)
> > 
> > > 
> > > > 
> > > > It could be useful for UBLK to cover real storage hardware too:
> > > > 
> > > > - for fast prototype or performance evaluation
> > > > 
> > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > the current UBLK interface doesn't support such devices, since it needs
> > > > all LUNs/Namespaces to share host resources(such as tag)
> > > 
> > > Can you explain this in more detail? It seems like an iSCSI or
> > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > What am I missing?
> > 
> > The current ublk can't do that yet, because the interface doesn't
> > support multiple ublk disks sharing single host, which is exactly
> > the case of scsi and nvme.
> 
> Can you give an example that shows exactly where a problem is hit?
> 
> I took a quick look at the ublk source code and didn't spot a place
> where it prevents a single ublk server process from handling multiple
> devices.
> 
> Regarding "host resources(such as tag)", can the ublk server deal with
> that in userspace? The Linux block layer doesn't have the concept of a
> "host", that would come in at the SCSI/NVMe level that's implemented in
> userspace.
> 
> I don't understand yet...

blk_mq_tag_set is embedded into driver host structure, and referred by queue
via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
that said all LUNs/NSs share host/queue tags, current every ublk
device is independent, and can't shard tags.

> 
> > 
> > > 
> > > > 
> > > > - SPDK has supported user space driver for real hardware
> > > 
> > > I think this could already be implemented today. There will be extra
> > > memory copies because SPDK won't have access to the application's memory
> > > pages.
> > 
> > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > such extra copy per my understanding.
> > 
> > > 
> > > > 
> > > > So propose to extend UBLK for supporting real hardware device:
> > > > 
> > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > as SCSI Luns/NVME Namespaces
> > > > 
> > > > 2) the followings are related with operating hardware from userspace,
> > > > so userspace driver has to be trusted, and root is required, and
> > > > can't support unprivileged UBLK device
> > > 
> > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > That means memory and interrupts are isolated. Neither userspace nor the
> > > hardware device can access memory or interrupts that the userspace
> > > process is not allowed to access.
> > > 
> > > I think there are still limitations like all memory pages exposed to the
> > > device need to be pinned. So effectively you might still need privileges
> > > to get the mlock resource limits.
> > > 
> > > But overall I think what you're saying about root and unprivileged ublk
> > > devices is not true. Hardware support should be developed with the goal
> > > of supporting unprivileged userspace ublk servers.
> > > 
> > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > they want. The user/admin will need to give them permission to open a
> > > network card, SCSI HBA, etc.
> > 
> > It depends on implementation, please see
> > 
> > 	https://spdk.io/doc/userspace.html
> > 
> > 	```
> > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > 	then follows along with the NVMe Specification to initialize the device,
> > 	create queue pairs, and ultimately send I/O.
> > 	```
> > 
> > The above way needs userspace to operating hardware by the mapped BAR,
> > which can't be allowed for unprivileged user.
> 
> From https://spdk.io/doc/system_configuration.html:
> 
>   Running SPDK as non-privileged user
> 
>   One of the benefits of using the VFIO Linux kernel driver is the
>   ability to perform DMA operations with peripheral devices as
>   unprivileged user. The permissions to access particular devices still
>   need to be granted by the system administrator, but only on a one-time
>   basis. Note that this functionality is supported with DPDK starting
>   from version 18.11.
> 
> This is what I had described in my previous reply.

My reference on spdk were mostly from spdk/nvme doc.
Just take quick look at spdk code, looks both vfio and direct
programming hardware are supported:

1) lib/nvme/nvme_vfio_user.c
const struct spdk_nvme_transport_ops vfio_ops {
	.qpair_submit_request = nvme_pcie_qpair_submit_request,


2) lib/nvme/nvme_pcie.c
const struct spdk_nvme_transport_ops pcie_ops = {
	.qpair_submit_request = nvme_pcie_qpair_submit_request
		nvme_pcie_qpair_submit_tracker
			nvme_pcie_qpair_submit_tracker
				nvme_pcie_qpair_ring_sq_doorbell

but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
write/read mmaped mmio.

> 
> > 
> > > 
> > > > 
> > > > 3) how to operating hardware memory space
> > > > - unbind kernel driver and rebind with uio/vfio
> > > > - map PCI BAR into userspace[2], then userspace can operate hardware
> > > > with mapped user address via MMIO
> > > >
> > > > 4) DMA
> > > > - DMA requires physical memory address, UBLK driver actually has
> > > > block request pages, so can we export request SG list(each segment
> > > > physical address, offset, len) into userspace? If the max_segments
> > > > limit is not too big(<=64), the needed buffer for holding SG list
> > > > can be small enough.
> > > 
> > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> > > address. The IOVA space is defined by the IOMMU page tables. Userspace
> > > controls the IOMMU page tables via Linux VFIO ioctls.
> > > 
> > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> > > IOMMU mapping that makes a range of userspace virtual addresses
> > > available at a given IOVA.
> > > 
> > > Mapping and unmapping operations are not free. Similar to mmap(2), the
> > > program will be slow if it does this frequently.
> > 
> > Yeah, but SPDK shouldn't use vfio DMA interface, see:
> > 
> > https://spdk.io/doc/memory.html
> > 
> > they just programs DMA directly with physical address of pinned hugepages.
> 
> From the page you linked:
> 
>   IOMMU Support
> 
>   ...
> 
>   This is a future-proof, hardware-accelerated solution for performing
>   DMA operations into and out of a user space process and forms the
>   long-term foundation for SPDK and DPDK's memory management strategy.
>   We highly recommend that applications are deployed using vfio and the
>   IOMMU enabled, which is fully supported today.
> 
> Yes, SPDK supports running without IOMMU, but they recommend running
> with the IOMMU.
> 
> > 
> > > 
> > > I think it's effectively the same problem as ublk zero-copy. We want to
> > > give the ublk server access to just the I/O buffers that it currently
> > > needs, but doing so would be expensive :(.
> > > 
> > > I think Linux has strategies for avoiding the expense like
> > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> > > and/or the hardware device controller by userspace would still have
> > > access to the memory pages after I/O has completed. This reduces memory
> > > isolation :(.
> > > 
> > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.
> > 
> > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.
> 
> When using VFIO (recommended by the docs), SPDK uses long-lived DMA
> mappings. Here are places in the SPDK/DPDK source code where VFIO DMA
> mapping is used:
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371
> https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164

I meant spdk nvme implementation.

> 
> > 
> > > 
> > > What I'm trying to get at is that either memory isolation is compromised
> > > or performance is reduced. It's hard to have good performance together
> > > with memory isolation.
> > > 
> > > I think ublk should follow the VFIO philosophy of being a safe
> > > kernel/userspace interface. If userspace is malicious or buggy, the
> > > kernel's and other process' memory should not be corrupted.
> > 
> > It is tradeoff between performance and isolation, that is why I mention
> > that directing programming hardware in userspace can be done by root
> > only.
> 
> Yes, there is a trade-off. Over the years the use of unsafe approaches
> has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As
> secure boot, integrity architecture, and stuff like that becomes more
> widely used, it's harder to include features that break memory isolation
> in software in mainstream distros. There can be an option to sacrifice
> memory isolation for performance and some users may be willing to accept
> the trade-off. I think it should be an option feature though.
> 
> I did want to point out that the statement that "direct programming
> hardware in userspace can be done by root only" is false (see VFIO).

Unfortunately not see vfio is used when spdk/nvme is operating hardware
mmio.

> 
> > > 
> > > > 
> > > > - small amount of physical memory for using as DMA descriptor can be
> > > > pre-allocated from userspace, and ask kernel to pin pages, then still
> > > > return physical address to userspace for programming DMA
> > > 
> > > I think this is possible today. The ublk server owns the I/O buffers. It
> > > can mlock them and DMA map them via VFIO. ublk doesn't need to know
> > > anything about this.
> > 
> > It depends on if such VFIO DMA mapping is required for each IO. If it
> > is required, that won't help one high performance driver.
> 
> It is not necessary to perform a DMA mapping for each IO. ublk's
> existing model is sufficient:
> 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup.
> 2. At runtime the ublk server provides these I/O buffers to the kernel,
>    no further DMA mapping is required.
> 
> Unfortunately there's still the kernel<->userspace copy that existing
> ublk applications have, but there's no new overhead related to VFIO.

We are working on ublk zero copy for avoiding the copy.

> 
> > > 
> > > > - this way is still zero copy
> > > 
> > > True zero-copy would be when an application does O_DIRECT I/O and the
> > > hardware device DMAs to/from the application's memory pages. ublk
> > > doesn't do that today and when combined with VFIO it doesn't get any
> > > easier. I don't think it's possible because you cannot allow userspace
> > > to control a hardware device and grant DMA access to pages that
> > > userspace isn't allowed to access. A malicious userspace will program
> > > the device to access those pages :).
> > 
> > But that should be what SPDK nvme/pci is doing per the above links, :-)
> 
> Sure, it's possible to break memory isolation. Breaking memory isolation
> isn't specific to ublk servers that access hardware. The same unsafe
> zero-copy approach would probably also work for regular ublk servers.
> This is basically bringing back /dev/kmem :).
> 
> > 
> > > 
> > > > 
> > > > 5) notification from hardware: interrupt or polling
> > > > - SPDK applies userspace polling, this way is doable, but
> > > > eat CPU, so it is only one choice
> > > > 
> > > > - io_uring command has been proved as very efficient, if io_uring
> > > > command is applied(similar way with UBLK for forwarding blk io
> > > > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > > > which should be efficient too, given batching processes are done after
> > > > the io_uring command is completed
> > > 
> > > I wonder how much difference there is between the new io_uring command
> > > for receiving VFIO irqs that you are suggesting compared to the existing
> > > io_uring approach IORING_OP_READ eventfd.
> > 
> > eventfd needs extra read/write on the event fd, so more syscalls are
> > required.
> 
> No extra syscall is required because IORING_OP_READ is used to read the
> eventfd, but maybe you were referring to bypassing the
> file->f_op->read() code path?

OK, missed that, it is usually done in the following way:

	io_uring_prep_poll_add(sqe, evfd, POLLIN)
	sqe->flags |= IOSQE_IO_LINK;
	...
    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_readv(sqe, evfd, &vec, 1, 0);
    sqe->flags |= IOSQE_IO_LINK;

When I get time, will compare the two and see which one performs better.



thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-13  3:47       ` Ming Lei
@ 2023-02-13 19:13         ` Stefan Hajnoczi
  2023-02-15  0:51           ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-13 19:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 15994 bytes --]

On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > Hello,
> > > > > 
> > > > > So far UBLK is only used for implementing virtual block device from
> > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > 
> > > > I won't be at LSF/MM so here are my thoughts:
> > > 
> > > Thanks for the thoughts, :-)
> > > 
> > > > 
> > > > > 
> > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > 
> > > > > - for fast prototype or performance evaluation
> > > > > 
> > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > 
> > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > What am I missing?
> > > 
> > > The current ublk can't do that yet, because the interface doesn't
> > > support multiple ublk disks sharing single host, which is exactly
> > > the case of scsi and nvme.
> > 
> > Can you give an example that shows exactly where a problem is hit?
> > 
> > I took a quick look at the ublk source code and didn't spot a place
> > where it prevents a single ublk server process from handling multiple
> > devices.
> > 
> > Regarding "host resources(such as tag)", can the ublk server deal with
> > that in userspace? The Linux block layer doesn't have the concept of a
> > "host", that would come in at the SCSI/NVMe level that's implemented in
> > userspace.
> > 
> > I don't understand yet...
> 
> blk_mq_tag_set is embedded into driver host structure, and referred by queue
> via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> that said all LUNs/NSs share host/queue tags, current every ublk
> device is independent, and can't shard tags.

Does this actually prevent ublk servers with multiple ublk devices or is
it just sub-optimal?

Also, is this specific to real storage hardware? I guess userspace
NVMe-over-TCP or iSCSI initiators would be affected  regardless of
whether they simply use the Sockets API (software) or userspace device
drivers (hardware).

Sorry for all these questions, I think I'm a little confused because you
said "doesn't support such devices" and I thought this discussion was
about real storage hardware. Neither of these seem to apply to the
tag_set issue.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > - SPDK has supported user space driver for real hardware
> > > > 
> > > > I think this could already be implemented today. There will be extra
> > > > memory copies because SPDK won't have access to the application's memory
> > > > pages.
> > > 
> > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > > such extra copy per my understanding.
> > > 
> > > > 
> > > > > 
> > > > > So propose to extend UBLK for supporting real hardware device:
> > > > > 
> > > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > > as SCSI Luns/NVME Namespaces
> > > > > 
> > > > > 2) the followings are related with operating hardware from userspace,
> > > > > so userspace driver has to be trusted, and root is required, and
> > > > > can't support unprivileged UBLK device
> > > > 
> > > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > > That means memory and interrupts are isolated. Neither userspace nor the
> > > > hardware device can access memory or interrupts that the userspace
> > > > process is not allowed to access.
> > > > 
> > > > I think there are still limitations like all memory pages exposed to the
> > > > device need to be pinned. So effectively you might still need privileges
> > > > to get the mlock resource limits.
> > > > 
> > > > But overall I think what you're saying about root and unprivileged ublk
> > > > devices is not true. Hardware support should be developed with the goal
> > > > of supporting unprivileged userspace ublk servers.
> > > > 
> > > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > > they want. The user/admin will need to give them permission to open a
> > > > network card, SCSI HBA, etc.
> > > 
> > > It depends on implementation, please see
> > > 
> > > 	https://spdk.io/doc/userspace.html
> > > 
> > > 	```
> > > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > > 	then follows along with the NVMe Specification to initialize the device,
> > > 	create queue pairs, and ultimately send I/O.
> > > 	```
> > > 
> > > The above way needs userspace to operating hardware by the mapped BAR,
> > > which can't be allowed for unprivileged user.
> > 
> > From https://spdk.io/doc/system_configuration.html:
> > 
> >   Running SPDK as non-privileged user
> > 
> >   One of the benefits of using the VFIO Linux kernel driver is the
> >   ability to perform DMA operations with peripheral devices as
> >   unprivileged user. The permissions to access particular devices still
> >   need to be granted by the system administrator, but only on a one-time
> >   basis. Note that this functionality is supported with DPDK starting
> >   from version 18.11.
> > 
> > This is what I had described in my previous reply.
> 
> My reference on spdk were mostly from spdk/nvme doc.
> Just take quick look at spdk code, looks both vfio and direct
> programming hardware are supported:
> 
> 1) lib/nvme/nvme_vfio_user.c
> const struct spdk_nvme_transport_ops vfio_ops {
> 	.qpair_submit_request = nvme_pcie_qpair_submit_request,

Ignore this, it's the userspace vfio-user UNIX domain socket protocol
support. It's not kernel VFIO and is unrelated to what we're discussing.
More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/

> 
> 
> 2) lib/nvme/nvme_pcie.c
> const struct spdk_nvme_transport_ops pcie_ops = {
> 	.qpair_submit_request = nvme_pcie_qpair_submit_request
> 		nvme_pcie_qpair_submit_tracker
> 			nvme_pcie_qpair_submit_tracker
> 				nvme_pcie_qpair_ring_sq_doorbell
> 
> but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
> write/read mmaped mmio.

I have only a small amount of SPDK code experienced, so this might be
wrong, but I think the NVMe PCI driver code does not need to directly
call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system
abstractions and device driver APIs.

DMA memory is mapped permanently so the device driver doesn't need to
perform individual map/unmap operations in the data path. NVMe PCI
request submission builds the NVMe command structures containing device
addresses (i.e. IOVAs when IOMMU is enabled).

This code probably supports both IOMMU (VFIO) and non-IOMMU operation.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > 3) how to operating hardware memory space
> > > > > - unbind kernel driver and rebind with uio/vfio
> > > > > - map PCI BAR into userspace[2], then userspace can operate hardware
> > > > > with mapped user address via MMIO
> > > > >
> > > > > 4) DMA
> > > > > - DMA requires physical memory address, UBLK driver actually has
> > > > > block request pages, so can we export request SG list(each segment
> > > > > physical address, offset, len) into userspace? If the max_segments
> > > > > limit is not too big(<=64), the needed buffer for holding SG list
> > > > > can be small enough.
> > > > 
> > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> > > > address. The IOVA space is defined by the IOMMU page tables. Userspace
> > > > controls the IOMMU page tables via Linux VFIO ioctls.
> > > > 
> > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> > > > IOMMU mapping that makes a range of userspace virtual addresses
> > > > available at a given IOVA.
> > > > 
> > > > Mapping and unmapping operations are not free. Similar to mmap(2), the
> > > > program will be slow if it does this frequently.
> > > 
> > > Yeah, but SPDK shouldn't use vfio DMA interface, see:
> > > 
> > > https://spdk.io/doc/memory.html
> > > 
> > > they just programs DMA directly with physical address of pinned hugepages.
> > 
> > From the page you linked:
> > 
> >   IOMMU Support
> > 
> >   ...
> > 
> >   This is a future-proof, hardware-accelerated solution for performing
> >   DMA operations into and out of a user space process and forms the
> >   long-term foundation for SPDK and DPDK's memory management strategy.
> >   We highly recommend that applications are deployed using vfio and the
> >   IOMMU enabled, which is fully supported today.
> > 
> > Yes, SPDK supports running without IOMMU, but they recommend running
> > with the IOMMU.
> > 
> > > 
> > > > 
> > > > I think it's effectively the same problem as ublk zero-copy. We want to
> > > > give the ublk server access to just the I/O buffers that it currently
> > > > needs, but doing so would be expensive :(.
> > > > 
> > > > I think Linux has strategies for avoiding the expense like
> > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> > > > and/or the hardware device controller by userspace would still have
> > > > access to the memory pages after I/O has completed. This reduces memory
> > > > isolation :(.
> > > > 
> > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.
> > > 
> > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.
> > 
> > When using VFIO (recommended by the docs), SPDK uses long-lived DMA
> > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA
> > mapping is used:
> > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371
> > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164
> 
> I meant spdk nvme implementation.

I did too. The NVMe PCI driver will use the PCI driver APIs and the EAL
(operating system abstraction) will deal with IOMMU APIs (VFIO)
transparently.

> 
> > 
> > > 
> > > > 
> > > > What I'm trying to get at is that either memory isolation is compromised
> > > > or performance is reduced. It's hard to have good performance together
> > > > with memory isolation.
> > > > 
> > > > I think ublk should follow the VFIO philosophy of being a safe
> > > > kernel/userspace interface. If userspace is malicious or buggy, the
> > > > kernel's and other process' memory should not be corrupted.
> > > 
> > > It is tradeoff between performance and isolation, that is why I mention
> > > that directing programming hardware in userspace can be done by root
> > > only.
> > 
> > Yes, there is a trade-off. Over the years the use of unsafe approaches
> > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As
> > secure boot, integrity architecture, and stuff like that becomes more
> > widely used, it's harder to include features that break memory isolation
> > in software in mainstream distros. There can be an option to sacrifice
> > memory isolation for performance and some users may be willing to accept
> > the trade-off. I think it should be an option feature though.
> > 
> > I did want to point out that the statement that "direct programming
> > hardware in userspace can be done by root only" is false (see VFIO).
> 
> Unfortunately not see vfio is used when spdk/nvme is operating hardware
> mmio.

I think my responses above answered this, but just to be clear: with
VFIO PCI userspace mmaps the BARs and performs direct accesses to them
(load/store instructions). No VFIO API wrappers are necessary for MMIO
accesses, so the code you posted works fine with VFIO.

> 
> > 
> > > > 
> > > > > 
> > > > > - small amount of physical memory for using as DMA descriptor can be
> > > > > pre-allocated from userspace, and ask kernel to pin pages, then still
> > > > > return physical address to userspace for programming DMA
> > > > 
> > > > I think this is possible today. The ublk server owns the I/O buffers. It
> > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know
> > > > anything about this.
> > > 
> > > It depends on if such VFIO DMA mapping is required for each IO. If it
> > > is required, that won't help one high performance driver.
> > 
> > It is not necessary to perform a DMA mapping for each IO. ublk's
> > existing model is sufficient:
> > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup.
> > 2. At runtime the ublk server provides these I/O buffers to the kernel,
> >    no further DMA mapping is required.
> > 
> > Unfortunately there's still the kernel<->userspace copy that existing
> > ublk applications have, but there's no new overhead related to VFIO.
> 
> We are working on ublk zero copy for avoiding the copy.

I'm curious if it's possible to come up with a solution that doesn't
break memory isolation. Userspace controls the IOMMU with Linux VFIO, so
if kernel pages are exposed to the device, then userspace will also be
able to access them (e.g. by submitting a request that gets the device
to DMA those pages).

> 
> > 
> > > > 
> > > > > - this way is still zero copy
> > > > 
> > > > True zero-copy would be when an application does O_DIRECT I/O and the
> > > > hardware device DMAs to/from the application's memory pages. ublk
> > > > doesn't do that today and when combined with VFIO it doesn't get any
> > > > easier. I don't think it's possible because you cannot allow userspace
> > > > to control a hardware device and grant DMA access to pages that
> > > > userspace isn't allowed to access. A malicious userspace will program
> > > > the device to access those pages :).
> > > 
> > > But that should be what SPDK nvme/pci is doing per the above links, :-)
> > 
> > Sure, it's possible to break memory isolation. Breaking memory isolation
> > isn't specific to ublk servers that access hardware. The same unsafe
> > zero-copy approach would probably also work for regular ublk servers.
> > This is basically bringing back /dev/kmem :).
> > 
> > > 
> > > > 
> > > > > 
> > > > > 5) notification from hardware: interrupt or polling
> > > > > - SPDK applies userspace polling, this way is doable, but
> > > > > eat CPU, so it is only one choice
> > > > > 
> > > > > - io_uring command has been proved as very efficient, if io_uring
> > > > > command is applied(similar way with UBLK for forwarding blk io
> > > > > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > > > > which should be efficient too, given batching processes are done after
> > > > > the io_uring command is completed
> > > > 
> > > > I wonder how much difference there is between the new io_uring command
> > > > for receiving VFIO irqs that you are suggesting compared to the existing
> > > > io_uring approach IORING_OP_READ eventfd.
> > > 
> > > eventfd needs extra read/write on the event fd, so more syscalls are
> > > required.
> > 
> > No extra syscall is required because IORING_OP_READ is used to read the
> > eventfd, but maybe you were referring to bypassing the
> > file->f_op->read() code path?
> 
> OK, missed that, it is usually done in the following way:
> 
> 	io_uring_prep_poll_add(sqe, evfd, POLLIN)
> 	sqe->flags |= IOSQE_IO_LINK;
> 	...
>     sqe = io_uring_get_sqe(&ring);
>     io_uring_prep_readv(sqe, evfd, &vec, 1, 0);
>     sqe->flags |= IOSQE_IO_LINK;
> 
> When I get time, will compare the two and see which one performs better.

That would be really interesting.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-13 19:13         ` Stefan Hajnoczi
@ 2023-02-15  0:51           ` Ming Lei
  2023-02-15 15:27             ` Stefan Hajnoczi
  2023-02-16  9:44             ` Andreas Hindborg
  0 siblings, 2 replies; 34+ messages in thread
From: Ming Lei @ 2023-02-15  0:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang, ming.lei

On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > 
> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > 
> > > > Thanks for the thoughts, :-)
> > > > 
> > > > > 
> > > > > > 
> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > 
> > > > > > - for fast prototype or performance evaluation
> > > > > > 
> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > 
> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > What am I missing?
> > > > 
> > > > The current ublk can't do that yet, because the interface doesn't
> > > > support multiple ublk disks sharing single host, which is exactly
> > > > the case of scsi and nvme.
> > > 
> > > Can you give an example that shows exactly where a problem is hit?
> > > 
> > > I took a quick look at the ublk source code and didn't spot a place
> > > where it prevents a single ublk server process from handling multiple
> > > devices.
> > > 
> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > userspace.
> > > 
> > > I don't understand yet...
> > 
> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > that said all LUNs/NSs share host/queue tags, current every ublk
> > device is independent, and can't shard tags.
> 
> Does this actually prevent ublk servers with multiple ublk devices or is
> it just sub-optimal?

It is former, ublk can't support multiple devices which share single host
because duplicated tag can be seen in host side, then io is failed.

> 
> Also, is this specific to real storage hardware? I guess userspace
> NVMe-over-TCP or iSCSI initiators would be affected  regardless of
> whether they simply use the Sockets API (software) or userspace device
> drivers (hardware).
> 
> Sorry for all these questions, I think I'm a little confused because you
> said "doesn't support such devices" and I thought this discussion was
> about real storage hardware. Neither of these seem to apply to the
> tag_set issue.

The reality is that both scsi and nvme(either virt or real hardware)
supports multi LUNs/NSs, so tag_set issue has to be solved, or
multi-LUNs/NSs has to be supported.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > - SPDK has supported user space driver for real hardware
> > > > > 
> > > > > I think this could already be implemented today. There will be extra
> > > > > memory copies because SPDK won't have access to the application's memory
> > > > > pages.
> > > > 
> > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > > > such extra copy per my understanding.
> > > > 
> > > > > 
> > > > > > 
> > > > > > So propose to extend UBLK for supporting real hardware device:
> > > > > > 
> > > > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > > > as SCSI Luns/NVME Namespaces
> > > > > > 
> > > > > > 2) the followings are related with operating hardware from userspace,
> > > > > > so userspace driver has to be trusted, and root is required, and
> > > > > > can't support unprivileged UBLK device
> > > > > 
> > > > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > > > That means memory and interrupts are isolated. Neither userspace nor the
> > > > > hardware device can access memory or interrupts that the userspace
> > > > > process is not allowed to access.
> > > > > 
> > > > > I think there are still limitations like all memory pages exposed to the
> > > > > device need to be pinned. So effectively you might still need privileges
> > > > > to get the mlock resource limits.
> > > > > 
> > > > > But overall I think what you're saying about root and unprivileged ublk
> > > > > devices is not true. Hardware support should be developed with the goal
> > > > > of supporting unprivileged userspace ublk servers.
> > > > > 
> > > > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > > > they want. The user/admin will need to give them permission to open a
> > > > > network card, SCSI HBA, etc.
> > > > 
> > > > It depends on implementation, please see
> > > > 
> > > > 	https://spdk.io/doc/userspace.html
> > > > 
> > > > 	```
> > > > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > > > 	then follows along with the NVMe Specification to initialize the device,
> > > > 	create queue pairs, and ultimately send I/O.
> > > > 	```
> > > > 
> > > > The above way needs userspace to operating hardware by the mapped BAR,
> > > > which can't be allowed for unprivileged user.
> > > 
> > > From https://spdk.io/doc/system_configuration.html:
> > > 
> > >   Running SPDK as non-privileged user
> > > 
> > >   One of the benefits of using the VFIO Linux kernel driver is the
> > >   ability to perform DMA operations with peripheral devices as
> > >   unprivileged user. The permissions to access particular devices still
> > >   need to be granted by the system administrator, but only on a one-time
> > >   basis. Note that this functionality is supported with DPDK starting
> > >   from version 18.11.
> > > 
> > > This is what I had described in my previous reply.
> > 
> > My reference on spdk were mostly from spdk/nvme doc.
> > Just take quick look at spdk code, looks both vfio and direct
> > programming hardware are supported:
> > 
> > 1) lib/nvme/nvme_vfio_user.c
> > const struct spdk_nvme_transport_ops vfio_ops {
> > 	.qpair_submit_request = nvme_pcie_qpair_submit_request,
> 
> Ignore this, it's the userspace vfio-user UNIX domain socket protocol
> support. It's not kernel VFIO and is unrelated to what we're discussing.
> More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/

Not sure, why does .qpair_submit_request point to
nvme_pcie_qpair_submit_request?

> 
> > 
> > 
> > 2) lib/nvme/nvme_pcie.c
> > const struct spdk_nvme_transport_ops pcie_ops = {
> > 	.qpair_submit_request = nvme_pcie_qpair_submit_request
> > 		nvme_pcie_qpair_submit_tracker
> > 			nvme_pcie_qpair_submit_tracker
> > 				nvme_pcie_qpair_ring_sq_doorbell
> > 
> > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
> > write/read mmaped mmio.
> 
> I have only a small amount of SPDK code experienced, so this might be

Me too.

> wrong, but I think the NVMe PCI driver code does not need to directly
> call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system
> abstractions and device driver APIs.
> 
> DMA memory is mapped permanently so the device driver doesn't need to
> perform individual map/unmap operations in the data path. NVMe PCI
> request submission builds the NVMe command structures containing device
> addresses (i.e. IOVAs when IOMMU is enabled).

If IOMMU isn't used, it is physical address of memory.

Then I guess you may understand why I said this way can't be done by
un-privileged user, cause driver is writing memory physical address to
device register directly.

But other driver can follow this approach if the way is accepted.

> 
> This code probably supports both IOMMU (VFIO) and non-IOMMU operation.
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > 3) how to operating hardware memory space
> > > > > > - unbind kernel driver and rebind with uio/vfio
> > > > > > - map PCI BAR into userspace[2], then userspace can operate hardware
> > > > > > with mapped user address via MMIO
> > > > > >
> > > > > > 4) DMA
> > > > > > - DMA requires physical memory address, UBLK driver actually has
> > > > > > block request pages, so can we export request SG list(each segment
> > > > > > physical address, offset, len) into userspace? If the max_segments
> > > > > > limit is not too big(<=64), the needed buffer for holding SG list
> > > > > > can be small enough.
> > > > > 
> > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical
> > > > > address. The IOVA space is defined by the IOMMU page tables. Userspace
> > > > > controls the IOMMU page tables via Linux VFIO ioctls.
> > > > > 
> > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the
> > > > > IOMMU mapping that makes a range of userspace virtual addresses
> > > > > available at a given IOVA.
> > > > > 
> > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the
> > > > > program will be slow if it does this frequently.
> > > > 
> > > > Yeah, but SPDK shouldn't use vfio DMA interface, see:
> > > > 
> > > > https://spdk.io/doc/memory.html
> > > > 
> > > > they just programs DMA directly with physical address of pinned hugepages.
> > > 
> > > From the page you linked:
> > > 
> > >   IOMMU Support
> > > 
> > >   ...
> > > 
> > >   This is a future-proof, hardware-accelerated solution for performing
> > >   DMA operations into and out of a user space process and forms the
> > >   long-term foundation for SPDK and DPDK's memory management strategy.
> > >   We highly recommend that applications are deployed using vfio and the
> > >   IOMMU enabled, which is fully supported today.
> > > 
> > > Yes, SPDK supports running without IOMMU, but they recommend running
> > > with the IOMMU.
> > > 
> > > > 
> > > > > 
> > > > > I think it's effectively the same problem as ublk zero-copy. We want to
> > > > > give the ublk server access to just the I/O buffers that it currently
> > > > > needs, but doing so would be expensive :(.
> > > > > 
> > > > > I think Linux has strategies for avoiding the expense like
> > > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace
> > > > > and/or the hardware device controller by userspace would still have
> > > > > access to the memory pages after I/O has completed. This reduces memory
> > > > > isolation :(.
> > > > > 
> > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings.
> > > > 
> > > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping.
> > > 
> > > When using VFIO (recommended by the docs), SPDK uses long-lived DMA
> > > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA
> > > mapping is used:
> > > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371
> > > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164
> > 
> > I meant spdk nvme implementation.
> 
> I did too. The NVMe PCI driver will use the PCI driver APIs and the EAL
> (operating system abstraction) will deal with IOMMU APIs (VFIO)
> transparently.
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > What I'm trying to get at is that either memory isolation is compromised
> > > > > or performance is reduced. It's hard to have good performance together
> > > > > with memory isolation.
> > > > > 
> > > > > I think ublk should follow the VFIO philosophy of being a safe
> > > > > kernel/userspace interface. If userspace is malicious or buggy, the
> > > > > kernel's and other process' memory should not be corrupted.
> > > > 
> > > > It is tradeoff between performance and isolation, that is why I mention
> > > > that directing programming hardware in userspace can be done by root
> > > > only.
> > > 
> > > Yes, there is a trade-off. Over the years the use of unsafe approaches
> > > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As
> > > secure boot, integrity architecture, and stuff like that becomes more
> > > widely used, it's harder to include features that break memory isolation
> > > in software in mainstream distros. There can be an option to sacrifice
> > > memory isolation for performance and some users may be willing to accept
> > > the trade-off. I think it should be an option feature though.
> > > 
> > > I did want to point out that the statement that "direct programming
> > > hardware in userspace can be done by root only" is false (see VFIO).
> > 
> > Unfortunately not see vfio is used when spdk/nvme is operating hardware
> > mmio.
> 
> I think my responses above answered this, but just to be clear: with
> VFIO PCI userspace mmaps the BARs and performs direct accesses to them
> (load/store instructions). No VFIO API wrappers are necessary for MMIO
> accesses, so the code you posted works fine with VFIO.
> 
> > 
> > > 
> > > > > 
> > > > > > 
> > > > > > - small amount of physical memory for using as DMA descriptor can be
> > > > > > pre-allocated from userspace, and ask kernel to pin pages, then still
> > > > > > return physical address to userspace for programming DMA
> > > > > 
> > > > > I think this is possible today. The ublk server owns the I/O buffers. It
> > > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know
> > > > > anything about this.
> > > > 
> > > > It depends on if such VFIO DMA mapping is required for each IO. If it
> > > > is required, that won't help one high performance driver.
> > > 
> > > It is not necessary to perform a DMA mapping for each IO. ublk's
> > > existing model is sufficient:
> > > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup.
> > > 2. At runtime the ublk server provides these I/O buffers to the kernel,
> > >    no further DMA mapping is required.
> > > 
> > > Unfortunately there's still the kernel<->userspace copy that existing
> > > ublk applications have, but there's no new overhead related to VFIO.
> > 
> > We are working on ublk zero copy for avoiding the copy.
> 
> I'm curious if it's possible to come up with a solution that doesn't
> break memory isolation. Userspace controls the IOMMU with Linux VFIO, so
> if kernel pages are exposed to the device, then userspace will also be
> able to access them (e.g. by submitting a request that gets the device
> to DMA those pages).

spdk nvme already exposes physical address of memory and uses the
physical address to program hardware directly. And I think it can't be
done by un-trusted user.

But I agree with you that this way should be avoided as far as possible.

> 
> > 
> > > 
> > > > > 
> > > > > > - this way is still zero copy
> > > > > 
> > > > > True zero-copy would be when an application does O_DIRECT I/O and the
> > > > > hardware device DMAs to/from the application's memory pages. ublk
> > > > > doesn't do that today and when combined with VFIO it doesn't get any
> > > > > easier. I don't think it's possible because you cannot allow userspace
> > > > > to control a hardware device and grant DMA access to pages that
> > > > > userspace isn't allowed to access. A malicious userspace will program
> > > > > the device to access those pages :).
> > > > 
> > > > But that should be what SPDK nvme/pci is doing per the above links, :-)
> > > 
> > > Sure, it's possible to break memory isolation. Breaking memory isolation
> > > isn't specific to ublk servers that access hardware. The same unsafe
> > > zero-copy approach would probably also work for regular ublk servers.
> > > This is basically bringing back /dev/kmem :).
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > 5) notification from hardware: interrupt or polling
> > > > > > - SPDK applies userspace polling, this way is doable, but
> > > > > > eat CPU, so it is only one choice
> > > > > > 
> > > > > > - io_uring command has been proved as very efficient, if io_uring
> > > > > > command is applied(similar way with UBLK for forwarding blk io
> > > > > > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > > > > > which should be efficient too, given batching processes are done after
> > > > > > the io_uring command is completed
> > > > > 
> > > > > I wonder how much difference there is between the new io_uring command
> > > > > for receiving VFIO irqs that you are suggesting compared to the existing
> > > > > io_uring approach IORING_OP_READ eventfd.
> > > > 
> > > > eventfd needs extra read/write on the event fd, so more syscalls are
> > > > required.
> > > 
> > > No extra syscall is required because IORING_OP_READ is used to read the
> > > eventfd, but maybe you were referring to bypassing the
> > > file->f_op->read() code path?
> > 
> > OK, missed that, it is usually done in the following way:
> > 
> > 	io_uring_prep_poll_add(sqe, evfd, POLLIN)
> > 	sqe->flags |= IOSQE_IO_LINK;
> > 	...
> >     sqe = io_uring_get_sqe(&ring);
> >     io_uring_prep_readv(sqe, evfd, &vec, 1, 0);
> >     sqe->flags |= IOSQE_IO_LINK;
> > 
> > When I get time, will compare the two and see which one performs better.
> 
> That would be really interesting.

Anyway, interrupt notification looks not one big deal. 


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-15  0:51           ` Ming Lei
@ 2023-02-15 15:27             ` Stefan Hajnoczi
  2023-02-16  0:46               ` Ming Lei
  2023-02-16  9:44             ` Andreas Hindborg
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-15 15:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 9192 bytes --]

On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote:
> On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > 
> > > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > 
> > > > > Thanks for the thoughts, :-)
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > 
> > > > > > > - for fast prototype or performance evaluation
> > > > > > > 
> > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > 
> > > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > What am I missing?
> > > > > 
> > > > > The current ublk can't do that yet, because the interface doesn't
> > > > > support multiple ublk disks sharing single host, which is exactly
> > > > > the case of scsi and nvme.
> > > > 
> > > > Can you give an example that shows exactly where a problem is hit?
> > > > 
> > > > I took a quick look at the ublk source code and didn't spot a place
> > > > where it prevents a single ublk server process from handling multiple
> > > > devices.
> > > > 
> > > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > userspace.
> > > > 
> > > > I don't understand yet...
> > > 
> > > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > that said all LUNs/NSs share host/queue tags, current every ublk
> > > device is independent, and can't shard tags.
> > 
> > Does this actually prevent ublk servers with multiple ublk devices or is
> > it just sub-optimal?
> 
> It is former, ublk can't support multiple devices which share single host
> because duplicated tag can be seen in host side, then io is failed.

The kernel sees two independent block devices so there is no issue
within the kernel.

Userspace can do its own hw tag allocation if there are shared storage
controller resources (e.g. NVMe CIDs) to avoid duplicating tags.

Have I missed something?

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > - SPDK has supported user space driver for real hardware
> > > > > > 
> > > > > > I think this could already be implemented today. There will be extra
> > > > > > memory copies because SPDK won't have access to the application's memory
> > > > > > pages.
> > > > > 
> > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't
> > > > > such extra copy per my understanding.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > So propose to extend UBLK for supporting real hardware device:
> > > > > > > 
> > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such
> > > > > > > as SCSI Luns/NVME Namespaces
> > > > > > > 
> > > > > > > 2) the followings are related with operating hardware from userspace,
> > > > > > > so userspace driver has to be trusted, and root is required, and
> > > > > > > can't support unprivileged UBLK device
> > > > > > 
> > > > > > Linux VFIO provides a safe userspace API for userspace device drivers.
> > > > > > That means memory and interrupts are isolated. Neither userspace nor the
> > > > > > hardware device can access memory or interrupts that the userspace
> > > > > > process is not allowed to access.
> > > > > > 
> > > > > > I think there are still limitations like all memory pages exposed to the
> > > > > > device need to be pinned. So effectively you might still need privileges
> > > > > > to get the mlock resource limits.
> > > > > > 
> > > > > > But overall I think what you're saying about root and unprivileged ublk
> > > > > > devices is not true. Hardware support should be developed with the goal
> > > > > > of supporting unprivileged userspace ublk servers.
> > > > > > 
> > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device
> > > > > > they want. The user/admin will need to give them permission to open a
> > > > > > network card, SCSI HBA, etc.
> > > > > 
> > > > > It depends on implementation, please see
> > > > > 
> > > > > 	https://spdk.io/doc/userspace.html
> > > > > 
> > > > > 	```
> > > > > 	The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and
> > > > > 	then follows along with the NVMe Specification to initialize the device,
> > > > > 	create queue pairs, and ultimately send I/O.
> > > > > 	```
> > > > > 
> > > > > The above way needs userspace to operating hardware by the mapped BAR,
> > > > > which can't be allowed for unprivileged user.
> > > > 
> > > > From https://spdk.io/doc/system_configuration.html:
> > > > 
> > > >   Running SPDK as non-privileged user
> > > > 
> > > >   One of the benefits of using the VFIO Linux kernel driver is the
> > > >   ability to perform DMA operations with peripheral devices as
> > > >   unprivileged user. The permissions to access particular devices still
> > > >   need to be granted by the system administrator, but only on a one-time
> > > >   basis. Note that this functionality is supported with DPDK starting
> > > >   from version 18.11.
> > > > 
> > > > This is what I had described in my previous reply.
> > > 
> > > My reference on spdk were mostly from spdk/nvme doc.
> > > Just take quick look at spdk code, looks both vfio and direct
> > > programming hardware are supported:
> > > 
> > > 1) lib/nvme/nvme_vfio_user.c
> > > const struct spdk_nvme_transport_ops vfio_ops {
> > > 	.qpair_submit_request = nvme_pcie_qpair_submit_request,
> > 
> > Ignore this, it's the userspace vfio-user UNIX domain socket protocol
> > support. It's not kernel VFIO and is unrelated to what we're discussing.
> > More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/
> 
> Not sure, why does .qpair_submit_request point to
> nvme_pcie_qpair_submit_request?

The lib/nvme/nvme_vfio_user.c code is for when SPDK connects to a
vfio-user NVMe PCI device. The vfio-user protocol support is not handled
by the regular DPDK/SPDK PCI driver APIs, so the lib/nvme/nvme_pcie.c
doesn't work with vfio-user devices.

However, a lot of the code can be shared with the regular NVMe PCI
driver and that's why .qpair_submit_request points to
nvme_pcie_qpair_submit_request instead of a special version for
vfio-user.

If the vfio-user protocol becomes more widely used for other devices
besides NVMe PCI, then I guess the DPDK/SPDK developers will figure out
a way to move the vfio-user code into the core PCI driver API so that a
single lib/nvme/nvme_pcie.c file works with all PCI APIs (kernel VFIO,
vfio-user, etc). The code was probably structured like this because it's
hard to make those changes and they wanted to get vfio-user NVMe PCI
working quickly.

> 
> > 
> > > 
> > > 
> > > 2) lib/nvme/nvme_pcie.c
> > > const struct spdk_nvme_transport_ops pcie_ops = {
> > > 	.qpair_submit_request = nvme_pcie_qpair_submit_request
> > > 		nvme_pcie_qpair_submit_tracker
> > > 			nvme_pcie_qpair_submit_tracker
> > > 				nvme_pcie_qpair_ring_sq_doorbell
> > > 
> > > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply
> > > write/read mmaped mmio.
> > 
> > I have only a small amount of SPDK code experienced, so this might be
> 
> Me too.
> 
> > wrong, but I think the NVMe PCI driver code does not need to directly
> > call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system
> > abstractions and device driver APIs.
> > 
> > DMA memory is mapped permanently so the device driver doesn't need to
> > perform individual map/unmap operations in the data path. NVMe PCI
> > request submission builds the NVMe command structures containing device
> > addresses (i.e. IOVAs when IOMMU is enabled).
> 
> If IOMMU isn't used, it is physical address of memory.
> 
> Then I guess you may understand why I said this way can't be done by
> un-privileged user, cause driver is writing memory physical address to
> device register directly.
> 
> But other driver can follow this approach if the way is accepted.

Okay, I understand now that you were thinking of non-IOMMU use cases.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-15 15:27             ` Stefan Hajnoczi
@ 2023-02-16  0:46               ` Ming Lei
  2023-02-16 15:28                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-16  0:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang

On Wed, Feb 15, 2023 at 10:27:07AM -0500, Stefan Hajnoczi wrote:
> On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote:
> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > 
> > > > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > 
> > > > > > Thanks for the thoughts, :-)
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > 
> > > > > > > > - for fast prototype or performance evaluation
> > > > > > > > 
> > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > 
> > > > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > What am I missing?
> > > > > > 
> > > > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > the case of scsi and nvme.
> > > > > 
> > > > > Can you give an example that shows exactly where a problem is hit?
> > > > > 
> > > > > I took a quick look at the ublk source code and didn't spot a place
> > > > > where it prevents a single ublk server process from handling multiple
> > > > > devices.
> > > > > 
> > > > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > userspace.
> > > > > 
> > > > > I don't understand yet...
> > > > 
> > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > device is independent, and can't shard tags.
> > > 
> > > Does this actually prevent ublk servers with multiple ublk devices or is
> > > it just sub-optimal?
> > 
> > It is former, ublk can't support multiple devices which share single host
> > because duplicated tag can be seen in host side, then io is failed.
> 
> The kernel sees two independent block devices so there is no issue
> within the kernel.

This way either wastes memory, or performance is bad since we can't
make a perfect queue depth for each ublk device.

> 
> Userspace can do its own hw tag allocation if there are shared storage
> controller resources (e.g. NVMe CIDs) to avoid duplicating tags.
> 
> Have I missed something?

Please look at lib/sbitmap.c and block/blk-mq-tag.c and see how many
hard issues fixed/reported in the past, and how much optimization done
in this area.

In theory hw tag allocation can be done in userspace, but just hard to
do efficiently:

1) it has been proved as one hard task for sharing data efficiently in
SMP, so don't reinvent wheel in userspace, and this work could take
much more efforts than extending current ublk interface, and just
fruitless

2) two times tag allocation slows down io path much

2) even worse for userspace allocation, cause task can be killed and
no cleanup is done, so tag leak can be caused easily


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-15  0:51           ` Ming Lei
  2023-02-15 15:27             ` Stefan Hajnoczi
@ 2023-02-16  9:44             ` Andreas Hindborg
  2023-02-16 10:45               ` Ming Lei
  1 sibling, 1 reply; 34+ messages in thread
From: Andreas Hindborg @ 2023-02-16  9:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang,
	Andreas Hindborg


Hi Ming,

Ming Lei <ming.lei@redhat.com> writes:

> On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
>> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
>> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
>> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
>> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
>> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
>> > > > > > Hello,
>> > > > > > 
>> > > > > > So far UBLK is only used for implementing virtual block device from
>> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
>> > > > > 
>> > > > > I won't be at LSF/MM so here are my thoughts:
>> > > > 
>> > > > Thanks for the thoughts, :-)
>> > > > 
>> > > > > 
>> > > > > > 
>> > > > > > It could be useful for UBLK to cover real storage hardware too:
>> > > > > > 
>> > > > > > - for fast prototype or performance evaluation
>> > > > > > 
>> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
>> > > > > > the current UBLK interface doesn't support such devices, since it needs
>> > > > > > all LUNs/Namespaces to share host resources(such as tag)
>> > > > > 
>> > > > > Can you explain this in more detail? It seems like an iSCSI or
>> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
>> > > > > What am I missing?
>> > > > 
>> > > > The current ublk can't do that yet, because the interface doesn't
>> > > > support multiple ublk disks sharing single host, which is exactly
>> > > > the case of scsi and nvme.
>> > > 
>> > > Can you give an example that shows exactly where a problem is hit?
>> > > 
>> > > I took a quick look at the ublk source code and didn't spot a place
>> > > where it prevents a single ublk server process from handling multiple
>> > > devices.
>> > > 
>> > > Regarding "host resources(such as tag)", can the ublk server deal with
>> > > that in userspace? The Linux block layer doesn't have the concept of a
>> > > "host", that would come in at the SCSI/NVMe level that's implemented in
>> > > userspace.
>> > > 
>> > > I don't understand yet...
>> > 
>> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
>> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
>> > that said all LUNs/NSs share host/queue tags, current every ublk
>> > device is independent, and can't shard tags.
>> 
>> Does this actually prevent ublk servers with multiple ublk devices or is
>> it just sub-optimal?
>
> It is former, ublk can't support multiple devices which share single host
> because duplicated tag can be seen in host side, then io is failed.
>

I have trouble following this discussion. Why can we not handle multiple
block devices in a single ublk user space process?

From this conversation it seems that the limiting factor is allocation
of the tag set of the virtual device in the kernel? But as far as I can
tell, the tag sets are allocated per virtual block device in
`ublk_ctrl_add_dev()`?

It seems to me that a single ublk user space process shuld be able to
connect to multiple storage devices (for instance nvme-of) and then
create a ublk device for each namespace, all from a single ublk process.

Could you elaborate on why this is not possible?

Best regards,
Andreas Hindborg

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-16  9:44             ` Andreas Hindborg
@ 2023-02-16 10:45               ` Ming Lei
  2023-02-16 11:21                 ` Andreas Hindborg
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-16 10:45 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang,
	Andreas Hindborg, ming.lei

On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> 
> Hi Ming,
> 
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> >> > > > > > Hello,
> >> > > > > > 
> >> > > > > > So far UBLK is only used for implementing virtual block device from
> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> >> > > > > 
> >> > > > > I won't be at LSF/MM so here are my thoughts:
> >> > > > 
> >> > > > Thanks for the thoughts, :-)
> >> > > > 
> >> > > > > 
> >> > > > > > 
> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> >> > > > > > 
> >> > > > > > - for fast prototype or performance evaluation
> >> > > > > > 
> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> >> > > > > 
> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> >> > > > > What am I missing?
> >> > > > 
> >> > > > The current ublk can't do that yet, because the interface doesn't
> >> > > > support multiple ublk disks sharing single host, which is exactly
> >> > > > the case of scsi and nvme.
> >> > > 
> >> > > Can you give an example that shows exactly where a problem is hit?
> >> > > 
> >> > > I took a quick look at the ublk source code and didn't spot a place
> >> > > where it prevents a single ublk server process from handling multiple
> >> > > devices.
> >> > > 
> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> >> > > userspace.
> >> > > 
> >> > > I don't understand yet...
> >> > 
> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> >> > device is independent, and can't shard tags.
> >> 
> >> Does this actually prevent ublk servers with multiple ublk devices or is
> >> it just sub-optimal?
> >
> > It is former, ublk can't support multiple devices which share single host
> > because duplicated tag can be seen in host side, then io is failed.
> >
> 
> I have trouble following this discussion. Why can we not handle multiple
> block devices in a single ublk user space process?
> 
> From this conversation it seems that the limiting factor is allocation
> of the tag set of the virtual device in the kernel? But as far as I can
> tell, the tag sets are allocated per virtual block device in
> `ublk_ctrl_add_dev()`?
> 
> It seems to me that a single ublk user space process shuld be able to
> connect to multiple storage devices (for instance nvme-of) and then
> create a ublk device for each namespace, all from a single ublk process.
> 
> Could you elaborate on why this is not possible?

If the multiple storages devices are independent, the current ublk can
handle them just fine.

But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
share single host, and use host-wide tagset, the current interface can't
work as expected, because tags is shared among all these devices. The
current ublk interface needs to be extended for covering this case.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-16 10:45               ` Ming Lei
@ 2023-02-16 11:21                 ` Andreas Hindborg
  2023-02-17  2:20                   ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Andreas Hindborg @ 2023-02-16 11:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang


Ming Lei <ming.lei@redhat.com> writes:

> On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
>> 
>> Hi Ming,
>> 
>> Ming Lei <ming.lei@redhat.com> writes:
>> 
>> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
>> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
>> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
>> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
>> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
>> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
>> >> > > > > > Hello,
>> >> > > > > > 
>> >> > > > > > So far UBLK is only used for implementing virtual block device from
>> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
>> >> > > > > 
>> >> > > > > I won't be at LSF/MM so here are my thoughts:
>> >> > > > 
>> >> > > > Thanks for the thoughts, :-)
>> >> > > > 
>> >> > > > > 
>> >> > > > > > 
>> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
>> >> > > > > > 
>> >> > > > > > - for fast prototype or performance evaluation
>> >> > > > > > 
>> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
>> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
>> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
>> >> > > > > 
>> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
>> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
>> >> > > > > What am I missing?
>> >> > > > 
>> >> > > > The current ublk can't do that yet, because the interface doesn't
>> >> > > > support multiple ublk disks sharing single host, which is exactly
>> >> > > > the case of scsi and nvme.
>> >> > > 
>> >> > > Can you give an example that shows exactly where a problem is hit?
>> >> > > 
>> >> > > I took a quick look at the ublk source code and didn't spot a place
>> >> > > where it prevents a single ublk server process from handling multiple
>> >> > > devices.
>> >> > > 
>> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
>> >> > > that in userspace? The Linux block layer doesn't have the concept of a
>> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
>> >> > > userspace.
>> >> > > 
>> >> > > I don't understand yet...
>> >> > 
>> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
>> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
>> >> > that said all LUNs/NSs share host/queue tags, current every ublk
>> >> > device is independent, and can't shard tags.
>> >> 
>> >> Does this actually prevent ublk servers with multiple ublk devices or is
>> >> it just sub-optimal?
>> >
>> > It is former, ublk can't support multiple devices which share single host
>> > because duplicated tag can be seen in host side, then io is failed.
>> >
>> 
>> I have trouble following this discussion. Why can we not handle multiple
>> block devices in a single ublk user space process?
>> 
>> From this conversation it seems that the limiting factor is allocation
>> of the tag set of the virtual device in the kernel? But as far as I can
>> tell, the tag sets are allocated per virtual block device in
>> `ublk_ctrl_add_dev()`?
>> 
>> It seems to me that a single ublk user space process shuld be able to
>> connect to multiple storage devices (for instance nvme-of) and then
>> create a ublk device for each namespace, all from a single ublk process.
>> 
>> Could you elaborate on why this is not possible?
>
> If the multiple storages devices are independent, the current ublk can
> handle them just fine.
>
> But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> share single host, and use host-wide tagset, the current interface can't
> work as expected, because tags is shared among all these devices. The
> current ublk interface needs to be extended for covering this case.

Thanks for clarifying, that is very helpful.

Follow up question: What would the implications be if one tried to
expose (through ublk) each nvme namespace of an nvme-of controller with
an independent tag set? What are the benefits of sharing a tagset across
all namespaces of a controller?

Best regards,
Andreas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-16  0:46               ` Ming Lei
@ 2023-02-16 15:28                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-16 15:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg,
	Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 4310 bytes --]

On Thu, Feb 16, 2023 at 08:46:56AM +0800, Ming Lei wrote:
> On Wed, Feb 15, 2023 at 10:27:07AM -0500, Stefan Hajnoczi wrote:
> > On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote:
> > > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > Hello,
> > > > > > > > > 
> > > > > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > 
> > > > > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > 
> > > > > > > Thanks for the thoughts, :-)
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > 
> > > > > > > > > - for fast prototype or performance evaluation
> > > > > > > > > 
> > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > 
> > > > > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > What am I missing?
> > > > > > > 
> > > > > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > the case of scsi and nvme.
> > > > > > 
> > > > > > Can you give an example that shows exactly where a problem is hit?
> > > > > > 
> > > > > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > where it prevents a single ublk server process from handling multiple
> > > > > > devices.
> > > > > > 
> > > > > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > userspace.
> > > > > > 
> > > > > > I don't understand yet...
> > > > > 
> > > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > device is independent, and can't shard tags.
> > > > 
> > > > Does this actually prevent ublk servers with multiple ublk devices or is
> > > > it just sub-optimal?
> > > 
> > > It is former, ublk can't support multiple devices which share single host
> > > because duplicated tag can be seen in host side, then io is failed.
> > 
> > The kernel sees two independent block devices so there is no issue
> > within the kernel.
> 
> This way either wastes memory, or performance is bad since we can't
> make a perfect queue depth for each ublk device.
> 
> > 
> > Userspace can do its own hw tag allocation if there are shared storage
> > controller resources (e.g. NVMe CIDs) to avoid duplicating tags.
> > 
> > Have I missed something?
> 
> Please look at lib/sbitmap.c and block/blk-mq-tag.c and see how many
> hard issues fixed/reported in the past, and how much optimization done
> in this area.
> 
> In theory hw tag allocation can be done in userspace, but just hard to
> do efficiently:
> 
> 1) it has been proved as one hard task for sharing data efficiently in
> SMP, so don't reinvent wheel in userspace, and this work could take
> much more efforts than extending current ublk interface, and just
> fruitless
> 
> 2) two times tag allocation slows down io path much
> 
> 2) even worse for userspace allocation, cause task can be killed and
> no cleanup is done, so tag leak can be caused easily

So then it is not "the former" after all?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-16 11:21                 ` Andreas Hindborg
@ 2023-02-17  2:20                   ` Ming Lei
  2023-02-17 16:39                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-17  2:20 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> 
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> >> 
> >> Hi Ming,
> >> 
> >> Ming Lei <ming.lei@redhat.com> writes:
> >> 
> >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> >> >> > > > > > Hello,
> >> >> > > > > > 
> >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> >> >> > > > > 
> >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> >> >> > > > 
> >> >> > > > Thanks for the thoughts, :-)
> >> >> > > > 
> >> >> > > > > 
> >> >> > > > > > 
> >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> >> >> > > > > > 
> >> >> > > > > > - for fast prototype or performance evaluation
> >> >> > > > > > 
> >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> >> >> > > > > 
> >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> >> >> > > > > What am I missing?
> >> >> > > > 
> >> >> > > > The current ublk can't do that yet, because the interface doesn't
> >> >> > > > support multiple ublk disks sharing single host, which is exactly
> >> >> > > > the case of scsi and nvme.
> >> >> > > 
> >> >> > > Can you give an example that shows exactly where a problem is hit?
> >> >> > > 
> >> >> > > I took a quick look at the ublk source code and didn't spot a place
> >> >> > > where it prevents a single ublk server process from handling multiple
> >> >> > > devices.
> >> >> > > 
> >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> >> >> > > userspace.
> >> >> > > 
> >> >> > > I don't understand yet...
> >> >> > 
> >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> >> >> > device is independent, and can't shard tags.
> >> >> 
> >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> >> >> it just sub-optimal?
> >> >
> >> > It is former, ublk can't support multiple devices which share single host
> >> > because duplicated tag can be seen in host side, then io is failed.
> >> >
> >> 
> >> I have trouble following this discussion. Why can we not handle multiple
> >> block devices in a single ublk user space process?
> >> 
> >> From this conversation it seems that the limiting factor is allocation
> >> of the tag set of the virtual device in the kernel? But as far as I can
> >> tell, the tag sets are allocated per virtual block device in
> >> `ublk_ctrl_add_dev()`?
> >> 
> >> It seems to me that a single ublk user space process shuld be able to
> >> connect to multiple storage devices (for instance nvme-of) and then
> >> create a ublk device for each namespace, all from a single ublk process.
> >> 
> >> Could you elaborate on why this is not possible?
> >
> > If the multiple storages devices are independent, the current ublk can
> > handle them just fine.
> >
> > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > share single host, and use host-wide tagset, the current interface can't
> > work as expected, because tags is shared among all these devices. The
> > current ublk interface needs to be extended for covering this case.
> 
> Thanks for clarifying, that is very helpful.
> 
> Follow up question: What would the implications be if one tried to
> expose (through ublk) each nvme namespace of an nvme-of controller with
> an independent tag set?

https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67

> What are the benefits of sharing a tagset across
> all namespaces of a controller?

The userspace implementation can be simplified a lot since generic
shared tag allocation isn't needed, meantime with good performance
(shared tags allocation in SMP is one hard problem)

The extension shouldn't be very hard, follows some raw ideas:

1) interface change

- add new feature flag of UBLK_F_SHARED_HOST, multiple ublk
  devices(ublkcXnY) are attached to the ublk host(ublkhX)

- dev_info.dev_id: in case of UBLK_F_SHARED_HOST, the top 16bit stores
  host id(X), and the bottom 16bit stores device id(Y)

- add two control commands: UBLK_CMD_ADD_HOST, UBLK_CMD_DEL_HOST

  Still sent to /dev/ublk-control

  ADD_HOST command will allocate one host device(char) with specified host
  id or allocated host id, tag_set is allocated as host resource. The
  host device(ublkhX) will become parent of all ublkcXn*

  Before sending DEL_HOST, all devices attached to this host have to
  be stopped & removed first, otherwise DEL_HOST won't succeed.

- keep other interfaces not changed
  in case of UBLK_F_SHARED_HOST, userspace has to set correct
  dev_info.dev_id.host_id, so ublk driver can associate device with
  specified host

2) implementation
- host device(ublkhX) becomes parent of all ublk char devices of
  ublkcXn*

- except for tagset, other per-host resource abstraction? Looks not
  necessary, anything is available in userspace

- host-wide error handling, maybe all devices attached to this host
  need to be recovered, so it should be done in userspace 

- per-host admin queue, looks not necessary, given host related
  management/control tasks are done in userspace directly

- others?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-17  2:20                   ` Ming Lei
@ 2023-02-17 16:39                     ` Stefan Hajnoczi
  2023-02-18 11:22                       ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-17 16:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 8424 bytes --]

On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > 
> > Ming Lei <ming.lei@redhat.com> writes:
> > 
> > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > >> 
> > >> Hi Ming,
> > >> 
> > >> Ming Lei <ming.lei@redhat.com> writes:
> > >> 
> > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > >> >> > > > > > Hello,
> > >> >> > > > > > 
> > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > >> >> > > > > 
> > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > >> >> > > > 
> > >> >> > > > Thanks for the thoughts, :-)
> > >> >> > > > 
> > >> >> > > > > 
> > >> >> > > > > > 
> > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > >> >> > > > > > 
> > >> >> > > > > > - for fast prototype or performance evaluation
> > >> >> > > > > > 
> > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > >> >> > > > > 
> > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > >> >> > > > > What am I missing?
> > >> >> > > > 
> > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > >> >> > > > the case of scsi and nvme.
> > >> >> > > 
> > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > >> >> > > 
> > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > >> >> > > where it prevents a single ublk server process from handling multiple
> > >> >> > > devices.
> > >> >> > > 
> > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > >> >> > > userspace.
> > >> >> > > 
> > >> >> > > I don't understand yet...
> > >> >> > 
> > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > >> >> > device is independent, and can't shard tags.
> > >> >> 
> > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > >> >> it just sub-optimal?
> > >> >
> > >> > It is former, ublk can't support multiple devices which share single host
> > >> > because duplicated tag can be seen in host side, then io is failed.
> > >> >
> > >> 
> > >> I have trouble following this discussion. Why can we not handle multiple
> > >> block devices in a single ublk user space process?
> > >> 
> > >> From this conversation it seems that the limiting factor is allocation
> > >> of the tag set of the virtual device in the kernel? But as far as I can
> > >> tell, the tag sets are allocated per virtual block device in
> > >> `ublk_ctrl_add_dev()`?
> > >> 
> > >> It seems to me that a single ublk user space process shuld be able to
> > >> connect to multiple storage devices (for instance nvme-of) and then
> > >> create a ublk device for each namespace, all from a single ublk process.
> > >> 
> > >> Could you elaborate on why this is not possible?
> > >
> > > If the multiple storages devices are independent, the current ublk can
> > > handle them just fine.
> > >
> > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > share single host, and use host-wide tagset, the current interface can't
> > > work as expected, because tags is shared among all these devices. The
> > > current ublk interface needs to be extended for covering this case.
> > 
> > Thanks for clarifying, that is very helpful.
> > 
> > Follow up question: What would the implications be if one tried to
> > expose (through ublk) each nvme namespace of an nvme-of controller with
> > an independent tag set?
> 
> https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> 
> > What are the benefits of sharing a tagset across
> > all namespaces of a controller?
> 
> The userspace implementation can be simplified a lot since generic
> shared tag allocation isn't needed, meantime with good performance
> (shared tags allocation in SMP is one hard problem)

In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
shared tags across multiple SQs in NVMe. So userspace doesn't need an
SMP tag allocator in the first place:
- Each ublk server thread has a separate io_uring context.
- Each ublk server thread has its own NVMe Submission Queue.
- Therefore it's trivial and cheap to allocate NVMe CIDs in userspace
  because there are no SMP concerns.

The issue isn't tag allocation, it's the fact that the kernel block
layer submits requests to userspace that don't fit into the NVMe
Submission Queue because multiple devices that appear independent from
the kernel perspective are sharing a single NVMe Submission Queue.
Userspace needs a basic I/O scheduler to ensure fairness across devices.
Round-robin for example. There are no SMP concerns here either.

So I don't buy the argument that userspace would have to duplicate the
tag allocation code from Linux because that solves a different problem
that the ublk server doesn't have.

If the kernel is aware of tag sharing, then userspace doesn't have to do
(trivial) tag allocation or I/O scheduling. It can simply stuff ublk io
commands into NVMe queues without thinking, which wastes fewer CPU
cycles and is a little simpler.

> The extension shouldn't be very hard, follows some raw ideas:

It is definitely nice for the ublk server to tell the kernel about
shared resources so the Linux block layer has the best information. I
think it's a good idea to add support for that. I just disagree with
some of the statements you've made about why and especially the claim
that ublk doesn't support multiple device servers today.

> 
> 1) interface change
> 
> - add new feature flag of UBLK_F_SHARED_HOST, multiple ublk
>   devices(ublkcXnY) are attached to the ublk host(ublkhX)
> 
> - dev_info.dev_id: in case of UBLK_F_SHARED_HOST, the top 16bit stores
>   host id(X), and the bottom 16bit stores device id(Y)
> 
> - add two control commands: UBLK_CMD_ADD_HOST, UBLK_CMD_DEL_HOST
> 
>   Still sent to /dev/ublk-control
> 
>   ADD_HOST command will allocate one host device(char) with specified host
>   id or allocated host id, tag_set is allocated as host resource. The
>   host device(ublkhX) will become parent of all ublkcXn*
> 
>   Before sending DEL_HOST, all devices attached to this host have to
>   be stopped & removed first, otherwise DEL_HOST won't succeed.
> 
> - keep other interfaces not changed
>   in case of UBLK_F_SHARED_HOST, userspace has to set correct
>   dev_info.dev_id.host_id, so ublk driver can associate device with
>   specified host
> 
> 2) implementation
> - host device(ublkhX) becomes parent of all ublk char devices of
>   ublkcXn*
> 
> - except for tagset, other per-host resource abstraction? Looks not
>   necessary, anything is available in userspace
> 
> - host-wide error handling, maybe all devices attached to this host
>   need to be recovered, so it should be done in userspace 
> 
> - per-host admin queue, looks not necessary, given host related
>   management/control tasks are done in userspace directly
> 
> - others?
> 
> 
> Thanks,
> Ming
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-17 16:39                     ` Stefan Hajnoczi
@ 2023-02-18 11:22                       ` Ming Lei
  2023-02-18 18:38                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-18 11:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > 
> > > Ming Lei <ming.lei@redhat.com> writes:
> > > 
> > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > >> 
> > > >> Hi Ming,
> > > >> 
> > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > >> 
> > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > >> >> > > > > > Hello,
> > > >> >> > > > > > 
> > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > >> >> > > > > 
> > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > >> >> > > > 
> > > >> >> > > > Thanks for the thoughts, :-)
> > > >> >> > > > 
> > > >> >> > > > > 
> > > >> >> > > > > > 
> > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > >> >> > > > > > 
> > > >> >> > > > > > - for fast prototype or performance evaluation
> > > >> >> > > > > > 
> > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > >> >> > > > > 
> > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > >> >> > > > > What am I missing?
> > > >> >> > > > 
> > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > >> >> > > > the case of scsi and nvme.
> > > >> >> > > 
> > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > >> >> > > 
> > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > >> >> > > devices.
> > > >> >> > > 
> > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > >> >> > > userspace.
> > > >> >> > > 
> > > >> >> > > I don't understand yet...
> > > >> >> > 
> > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > >> >> > device is independent, and can't shard tags.
> > > >> >> 
> > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > >> >> it just sub-optimal?
> > > >> >
> > > >> > It is former, ublk can't support multiple devices which share single host
> > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > >> >
> > > >> 
> > > >> I have trouble following this discussion. Why can we not handle multiple
> > > >> block devices in a single ublk user space process?
> > > >> 
> > > >> From this conversation it seems that the limiting factor is allocation
> > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > >> tell, the tag sets are allocated per virtual block device in
> > > >> `ublk_ctrl_add_dev()`?
> > > >> 
> > > >> It seems to me that a single ublk user space process shuld be able to
> > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > >> create a ublk device for each namespace, all from a single ublk process.
> > > >> 
> > > >> Could you elaborate on why this is not possible?
> > > >
> > > > If the multiple storages devices are independent, the current ublk can
> > > > handle them just fine.
> > > >
> > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > share single host, and use host-wide tagset, the current interface can't
> > > > work as expected, because tags is shared among all these devices. The
> > > > current ublk interface needs to be extended for covering this case.
> > > 
> > > Thanks for clarifying, that is very helpful.
> > > 
> > > Follow up question: What would the implications be if one tried to
> > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > an independent tag set?
> > 
> > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > 
> > > What are the benefits of sharing a tagset across
> > > all namespaces of a controller?
> > 
> > The userspace implementation can be simplified a lot since generic
> > shared tag allocation isn't needed, meantime with good performance
> > (shared tags allocation in SMP is one hard problem)
> 
> In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> shared tags across multiple SQs in NVMe. So userspace doesn't need an

In reality the max supported nr_queues of nvme is often much less than
nr_cpu_ids, for example, lots of nvme-pci devices just support at most
32 queues, I remembered that Azure nvme supports less(just 8 queues).
That is because queue isn't free in both software and hardware, which
implementation is often tradeoff between performance and cost.

Not mention, most of scsi devices are SQ in which tag allocations from
all CPUs are against single shared tagset.

So there is still per-queue tag allocations from different CPUs which aims
at same queue.

What we discussed are supposed to be generic solution, not something just
for ideal 1:1 mapping device, which isn't dominant in reality.

> SMP tag allocator in the first place:
> - Each ublk server thread has a separate io_uring context.
> - Each ublk server thread has its own NVMe Submission Queue.
> - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace
>   because there are no SMP concerns.

It isn't even trivial for 1:1 mapping, when any ublk server crashes
global tag will be leaked, and other ublk servers can't use the
leaked tag any more.

Not mention there are lots of SQ device(1:M), or nr_queues
is much less than nr_cpu_ids(N:M N < M). It is pretty easier to
see 1:M or N:M mapping for both nvme and scsi.

> 
> The issue isn't tag allocation, it's the fact that the kernel block
> layer submits requests to userspace that don't fit into the NVMe
> Submission Queue because multiple devices that appear independent from
> the kernel perspective are sharing a single NVMe Submission Queue.
> Userspace needs a basic I/O scheduler to ensure fairness across devices.
> Round-robin for example.

We already have io scheduler for /dev/ublkbN. Also what I proposed is
just to align ublk device with the actual device definition, and so far
tags is the only shared resource in generic io code path.

> There are no SMP concerns here either.

No, see above.

> 
> So I don't buy the argument that userspace would have to duplicate the
> tag allocation code from Linux because that solves a different problem
> that the ublk server doesn't have.
> 
> If the kernel is aware of tag sharing, then userspace doesn't have to do
> (trivial) tag allocation or I/O scheduling. It can simply stuff ublk io

Again, it isn't trivial.

> commands into NVMe queues without thinking, which wastes fewer CPU
> cycles and is a little simpler.

tag allocation is pretty generic, which is supposed to be done in
kernel, then any userspace isn't supposed to duplicate the
not-trivial implementation.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-18 11:22                       ` Ming Lei
@ 2023-02-18 18:38                         ` Stefan Hajnoczi
  2023-02-22 23:17                           ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-18 18:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 8230 bytes --]

On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > 
> > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > 
> > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > >> 
> > > > >> Hi Ming,
> > > > >> 
> > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > >> 
> > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > >> >> > > > > > Hello,
> > > > >> >> > > > > > 
> > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > >> >> > > > > 
> > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > >> >> > > > 
> > > > >> >> > > > Thanks for the thoughts, :-)
> > > > >> >> > > > 
> > > > >> >> > > > > 
> > > > >> >> > > > > > 
> > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > >> >> > > > > > 
> > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > >> >> > > > > > 
> > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > >> >> > > > > 
> > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > >> >> > > > > What am I missing?
> > > > >> >> > > > 
> > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > >> >> > > > the case of scsi and nvme.
> > > > >> >> > > 
> > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > >> >> > > 
> > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > >> >> > > devices.
> > > > >> >> > > 
> > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > >> >> > > userspace.
> > > > >> >> > > 
> > > > >> >> > > I don't understand yet...
> > > > >> >> > 
> > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > >> >> > device is independent, and can't shard tags.
> > > > >> >> 
> > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > >> >> it just sub-optimal?
> > > > >> >
> > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > >> >
> > > > >> 
> > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > >> block devices in a single ublk user space process?
> > > > >> 
> > > > >> From this conversation it seems that the limiting factor is allocation
> > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > >> tell, the tag sets are allocated per virtual block device in
> > > > >> `ublk_ctrl_add_dev()`?
> > > > >> 
> > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > >> 
> > > > >> Could you elaborate on why this is not possible?
> > > > >
> > > > > If the multiple storages devices are independent, the current ublk can
> > > > > handle them just fine.
> > > > >
> > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > work as expected, because tags is shared among all these devices. The
> > > > > current ublk interface needs to be extended for covering this case.
> > > > 
> > > > Thanks for clarifying, that is very helpful.
> > > > 
> > > > Follow up question: What would the implications be if one tried to
> > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > an independent tag set?
> > > 
> > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > 
> > > > What are the benefits of sharing a tagset across
> > > > all namespaces of a controller?
> > > 
> > > The userspace implementation can be simplified a lot since generic
> > > shared tag allocation isn't needed, meantime with good performance
> > > (shared tags allocation in SMP is one hard problem)
> > 
> > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> 
> In reality the max supported nr_queues of nvme is often much less than
> nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> That is because queue isn't free in both software and hardware, which
> implementation is often tradeoff between performance and cost.

I didn't say that the ublk server should have nr_cpu_ids threads. I
thought the idea was the ublk server creates as many threads as it needs
(e.g. max 8 if the Azure NVMe device only has 8 queues).

Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?

> Not mention, most of scsi devices are SQ in which tag allocations from
> all CPUs are against single shared tagset.
> 
> So there is still per-queue tag allocations from different CPUs which aims
> at same queue.
>
> What we discussed are supposed to be generic solution, not something just
> for ideal 1:1 mapping device, which isn't dominant in reality.

The same trivial tag allocation can be used for SCSI: instead of a
private tag namespace (e.g. 0x0-0xffff), give each queue a private
subset of the tag namespace (e.g. queue 0 has 0x0-0x7f, queue 1 has
0x80-0xff, etc).

The issue is not whether the tag namespace is shared across queues, but
the threading model of the ublk server. If the threading model requires
queues to be shared, then it becomes more complex and slow.

It's not clear to me why you think ublk servers should choose threading
models that require queues to be shared? They don't have to. Unlike the
kernel, they can choose the number of threads.

> 
> > SMP tag allocator in the first place:
> > - Each ublk server thread has a separate io_uring context.
> > - Each ublk server thread has its own NVMe Submission Queue.
> > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace
> >   because there are no SMP concerns.
> 
> It isn't even trivial for 1:1 mapping, when any ublk server crashes
> global tag will be leaked, and other ublk servers can't use the
> leaked tag any more.

I'm not sure what you're describing here, a multi-process ublk server?
Are you saying userspace must not do tag allocation itself because it
won't be able to recover?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-18 18:38                         ` Stefan Hajnoczi
@ 2023-02-22 23:17                           ` Ming Lei
  2023-02-23 20:18                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-02-22 23:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > 
> > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > 
> > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > >> 
> > > > > >> Hi Ming,
> > > > > >> 
> > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > >> 
> > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > >> >> > > > > > Hello,
> > > > > >> >> > > > > > 
> > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > >> >> > > > > 
> > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > >> >> > > > 
> > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > >> >> > > > 
> > > > > >> >> > > > > 
> > > > > >> >> > > > > > 
> > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > >> >> > > > > > 
> > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > >> >> > > > > > 
> > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > >> >> > > > > 
> > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > >> >> > > > > What am I missing?
> > > > > >> >> > > > 
> > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > >> >> > > > the case of scsi and nvme.
> > > > > >> >> > > 
> > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > >> >> > > 
> > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > >> >> > > devices.
> > > > > >> >> > > 
> > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > >> >> > > userspace.
> > > > > >> >> > > 
> > > > > >> >> > > I don't understand yet...
> > > > > >> >> > 
> > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > >> >> > device is independent, and can't shard tags.
> > > > > >> >> 
> > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > >> >> it just sub-optimal?
> > > > > >> >
> > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > >> >
> > > > > >> 
> > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > >> block devices in a single ublk user space process?
> > > > > >> 
> > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > >> `ublk_ctrl_add_dev()`?
> > > > > >> 
> > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > >> 
> > > > > >> Could you elaborate on why this is not possible?
> > > > > >
> > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > handle them just fine.
> > > > > >
> > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > current ublk interface needs to be extended for covering this case.
> > > > > 
> > > > > Thanks for clarifying, that is very helpful.
> > > > > 
> > > > > Follow up question: What would the implications be if one tried to
> > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > an independent tag set?
> > > > 
> > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > 
> > > > > What are the benefits of sharing a tagset across
> > > > > all namespaces of a controller?
> > > > 
> > > > The userspace implementation can be simplified a lot since generic
> > > > shared tag allocation isn't needed, meantime with good performance
> > > > (shared tags allocation in SMP is one hard problem)
> > > 
> > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > 
> > In reality the max supported nr_queues of nvme is often much less than
> > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > That is because queue isn't free in both software and hardware, which
> > implementation is often tradeoff between performance and cost.
> 
> I didn't say that the ublk server should have nr_cpu_ids threads. I
> thought the idea was the ublk server creates as many threads as it needs
> (e.g. max 8 if the Azure NVMe device only has 8 queues).
> 
> Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?

No.

In ublksrv project, each pthread maps to one unique hardware queue, so total
number of pthread is equal to nr_hw_queues.

> 
> > Not mention, most of scsi devices are SQ in which tag allocations from
> > all CPUs are against single shared tagset.
> > 
> > So there is still per-queue tag allocations from different CPUs which aims
> > at same queue.
> >
> > What we discussed are supposed to be generic solution, not something just
> > for ideal 1:1 mapping device, which isn't dominant in reality.
> 
> The same trivial tag allocation can be used for SCSI: instead of a
> private tag namespace (e.g. 0x0-0xffff), give each queue a private
> subset of the tag namespace (e.g. queue 0 has 0x0-0x7f, queue 1 has
> 0x80-0xff, etc).

Sorry, I may not get your point.

Each hw queue has its own tag space, for example, one scsi adaptor has 2
queues, queue depth is 128, then each hardware queue's tag space is
0 ~ 127. Also if there are two LUNs attached to this host, the two luns
share the two queue's tag space, that means any IO issued to queue 0,
no matter if it is from lun0 or lun1, the allocated tag has to unique in
the set of 0~127.

> 
> The issue is not whether the tag namespace is shared across queues, but
> the threading model of the ublk server. If the threading model requires
> queues to be shared, then it becomes more complex and slow.

ublksrv's threading model is simple, each thread handles IOs from one unique
hw queue, so total thread number is equal to nr_hw_queues.

If nr_hw_queues(nr_pthreads) < nr_cpu_id, one queue(ublk pthread) has to
handle IO requests from more than one CPUs, then contention on tag allocation
from this queue(ublk pthread).

> 
> It's not clear to me why you think ublk servers should choose threading
> models that require queues to be shared? They don't have to. Unlike the
> kernel, they can choose the number of threads.

queue sharing or not simply depends on if nr_hw_queues is less than
nr_cpu_id. That is one easy math problem, isn't it?

> 
> > 
> > > SMP tag allocator in the first place:
> > > - Each ublk server thread has a separate io_uring context.
> > > - Each ublk server thread has its own NVMe Submission Queue.
> > > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace
> > >   because there are no SMP concerns.
> > 
> > It isn't even trivial for 1:1 mapping, when any ublk server crashes
> > global tag will be leaked, and other ublk servers can't use the
> > leaked tag any more.
> 
> I'm not sure what you're describing here, a multi-process ublk server?
> Are you saying userspace must not do tag allocation itself because it
> won't be able to recover?

No matter if the ublk server is multi process or threads. If tag
allocation is implemented in userspace, you have to take thread/process
panic into account. Because if one process/pthread panics without
releasing one tag, the tag won't be visible to other ublk server any
more.

That is because each queue's tag space is shared for all LUNs/NSs which
are supposed to implemented as ublk server.

Tag utilization highly affects performance, and recover could take a
bit long or even not recovered, during this period, the leaked tags
aren't visible for other LUNs/NSs(ublk server), not mention for fixing
tag leak in recover, you have to track each tag's user(ublk server info),
which adds cost/complexity to fast/parallel io path, trivial to solve?


thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-22 23:17                           ` Ming Lei
@ 2023-02-23 20:18                             ` Stefan Hajnoczi
  2023-03-02  3:22                               ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-02-23 20:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 14209 bytes --]

On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > 
> > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > 
> > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > >> 
> > > > > > >> Hi Ming,
> > > > > > >> 
> > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > >> 
> > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > >> >> > > > > > Hello,
> > > > > > >> >> > > > > > 
> > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > >> >> > > > > 
> > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > >> >> > > > 
> > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > >> >> > > > 
> > > > > > >> >> > > > > 
> > > > > > >> >> > > > > > 
> > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > >> >> > > > > > 
> > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > >> >> > > > > > 
> > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > >> >> > > > > 
> > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > >> >> > > > > What am I missing?
> > > > > > >> >> > > > 
> > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > >> >> > > 
> > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > >> >> > > 
> > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > >> >> > > devices.
> > > > > > >> >> > > 
> > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > >> >> > > userspace.
> > > > > > >> >> > > 
> > > > > > >> >> > > I don't understand yet...
> > > > > > >> >> > 
> > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > >> >> 
> > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > >> >> it just sub-optimal?
> > > > > > >> >
> > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > >> >
> > > > > > >> 
> > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > >> block devices in a single ublk user space process?
> > > > > > >> 
> > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > >> 
> > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > >> 
> > > > > > >> Could you elaborate on why this is not possible?
> > > > > > >
> > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > handle them just fine.
> > > > > > >
> > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > 
> > > > > > Thanks for clarifying, that is very helpful.
> > > > > > 
> > > > > > Follow up question: What would the implications be if one tried to
> > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > an independent tag set?
> > > > > 
> > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > 
> > > > > > What are the benefits of sharing a tagset across
> > > > > > all namespaces of a controller?
> > > > > 
> > > > > The userspace implementation can be simplified a lot since generic
> > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > (shared tags allocation in SMP is one hard problem)
> > > > 
> > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > 
> > > In reality the max supported nr_queues of nvme is often much less than
> > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > That is because queue isn't free in both software and hardware, which
> > > implementation is often tradeoff between performance and cost.
> > 
> > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > thought the idea was the ublk server creates as many threads as it needs
> > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > 
> > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> 
> No.
> 
> In ublksrv project, each pthread maps to one unique hardware queue, so total
> number of pthread is equal to nr_hw_queues.

Good, I think we agree on that part.

Here is a summary of the ublk server model I've been describing:
1. Each pthread has a separate io_uring context.
2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
   command queue, etc).
3. Each pthread has a distinct subrange of the tag space if the tag
   space is shared across hardware submission queues.
4. Each pthread allocates tags from its subrange without coordinating
   with other threads. This is cheap and simple.
5. When the pthread runs out of tags it either suspends processing new
   ublk requests or enqueues them internally. When hardware completes
   requests, the pthread resumes requests that were waiting for tags.

This way multiple ublk_devices can be handled by a single ublk server
without the Linux block layer knowing the exact tag space sharing
relationship between ublk_devices and hardware submission queues (NVMe
SQ, SCSI command queue, etc).

When ublk adds support for configuring tagsets, then 3, 4, and 5 can be
eliminated. However, this is purely an optimization. Not that much
userspace code will be eliminated and the performance gain is not huge.

I believe this model works for the major storage protocols like NVMe and
SCSI.

I put forward this model to explain why I don't agree that ublk doesn't
support ublk servers with multiple devices (e.g. I/O would be failed due
to duplicated tags).

I think we agree on 1 and 2. It's 3, 4, and 5 that I think you are
either saying won't work or are very complex/hard?

> > 
> > > Not mention, most of scsi devices are SQ in which tag allocations from
> > > all CPUs are against single shared tagset.
> > > 
> > > So there is still per-queue tag allocations from different CPUs which aims
> > > at same queue.
> > >
> > > What we discussed are supposed to be generic solution, not something just
> > > for ideal 1:1 mapping device, which isn't dominant in reality.
> > 
> > The same trivial tag allocation can be used for SCSI: instead of a
> > private tag namespace (e.g. 0x0-0xffff), give each queue a private
> > subset of the tag namespace (e.g. queue 0 has 0x0-0x7f, queue 1 has
> > 0x80-0xff, etc).
> 
> Sorry, I may not get your point.
> 
> Each hw queue has its own tag space, for example, one scsi adaptor has 2
> queues, queue depth is 128, then each hardware queue's tag space is
> 0 ~ 127.
>
> Also if there are two LUNs attached to this host, the two luns
> share the two queue's tag space, that means any IO issued to queue 0,
> no matter if it is from lun0 or lun1, the allocated tag has to unique in
> the set of 0~127.

I'm trying to explain why tag allocation in userspace is simple and
cheap thanks to the ublk server's ability to create only as many threads
as hardware queues (e.g. NVMe SQs).

Even in the case where all hardware (NVME/SCSI/etc) queues and LUNs
share the same tag space (the worst case), ublk server threads can
perform allocation from distinct subranges of the shared tag space.
There are no SMP concerns because there is no overlap in the tag space
between threads.

> > 
> > The issue is not whether the tag namespace is shared across queues, but
> > the threading model of the ublk server. If the threading model requires
> > queues to be shared, then it becomes more complex and slow.
> 
> ublksrv's threading model is simple, each thread handles IOs from one unique
> hw queue, so total thread number is equal to nr_hw_queues.

Here "hw queue" is a Linux block layer hw queue, not a hardware queue
(i.e. NVMe SQ)?

> 
> If nr_hw_queues(nr_pthreads) < nr_cpu_id, one queue(ublk pthread) has to
> handle IO requests from more than one CPUs, then contention on tag allocation
> from this queue(ublk pthread).

Userspace doesn't need to worry about the fact that I/O requests were
submitted by many CPUs. Each pthread processes one ublk_queue with a
known queue depth.

Each pthread has a range of userspace tags available and if there are no
more tags available then it waits to complete in-flight I/O before
accepting more requests or it can internally queue incoming requests.

> > 
> > It's not clear to me why you think ublk servers should choose threading
> > models that require queues to be shared? They don't have to. Unlike the
> > kernel, they can choose the number of threads.
> 
> queue sharing or not simply depends on if nr_hw_queues is less than
> nr_cpu_id. That is one easy math problem, isn't it?

We're talking about different things. I mean sharing a hardware queue
(i.e. NVMe SQ) across multiple ublk server threads. You seem to define
queue sharing as multiple CPUs submitting I/O via ublk?

Thinking about your scenario: why does it matter if multiple CPUs submit
I/O to a single ublk_queue? I don't see how it makes a difference
whether 1 CPU or multiple CPUs enqueue requests on a single ublk_queue.
Userspace will process that ublk_queue in the same way in either case.

> > 
> > > 
> > > > SMP tag allocator in the first place:
> > > > - Each ublk server thread has a separate io_uring context.
> > > > - Each ublk server thread has its own NVMe Submission Queue.
> > > > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace
> > > >   because there are no SMP concerns.
> > > 
> > > It isn't even trivial for 1:1 mapping, when any ublk server crashes
> > > global tag will be leaked, and other ublk servers can't use the
> > > leaked tag any more.
> > 
> > I'm not sure what you're describing here, a multi-process ublk server?
> > Are you saying userspace must not do tag allocation itself because it
> > won't be able to recover?
> 
> No matter if the ublk server is multi process or threads. If tag
> allocation is implemented in userspace, you have to take thread/process
> panic into account. Because if one process/pthread panics without
> releasing one tag, the tag won't be visible to other ublk server any
> more.
>
> That is because each queue's tag space is shared for all LUNs/NSs which
> are supposed to implemented as ublk server.
>
> Tag utilization highly affects performance, and recover could take a
> bit long or even not recovered, during this period, the leaked tags
> aren't visible for other LUNs/NSs(ublk server), not mention for fixing
> tag leak in recover, you have to track each tag's user(ublk server info),
> which adds cost/complexity to fast/parallel io path, trivial to solve?

In the interest of time, let's defer the recovery discussion until
after the core discussion is finished. I would need to research how ublk
recovery works. I am happy to do that if you think recovery is the
reason why userspace cannot allocate tags, but reaching a conclusion on
the core discussion might be enough to make discussing recovery
unnecessary.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-23 20:18                             ` Stefan Hajnoczi
@ 2023-03-02  3:22                               ` Ming Lei
  2023-03-02 15:09                                 ` Stefan Hajnoczi
  2023-03-16 14:24                                 ` Stefan Hajnoczi
  0 siblings, 2 replies; 34+ messages in thread
From: Ming Lei @ 2023-03-02  3:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > 
> > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > 
> > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > >> 
> > > > > > > >> Hi Ming,
> > > > > > > >> 
> > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > >> 
> > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > >> >> > > > > > Hello,
> > > > > > > >> >> > > > > > 
> > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > >> >> > > > > 
> > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > >> >> > > > 
> > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > >> >> > > > 
> > > > > > > >> >> > > > > 
> > > > > > > >> >> > > > > > 
> > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > >> >> > > > > > 
> > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > >> >> > > > > > 
> > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > >> >> > > > > 
> > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > >> >> > > > > What am I missing?
> > > > > > > >> >> > > > 
> > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > >> >> > > 
> > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > >> >> > > 
> > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > >> >> > > devices.
> > > > > > > >> >> > > 
> > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > >> >> > > userspace.
> > > > > > > >> >> > > 
> > > > > > > >> >> > > I don't understand yet...
> > > > > > > >> >> > 
> > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > >> >> 
> > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > >> >> it just sub-optimal?
> > > > > > > >> >
> > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > >> >
> > > > > > > >> 
> > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > >> block devices in a single ublk user space process?
> > > > > > > >> 
> > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > >> 
> > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > >> 
> > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > >
> > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > handle them just fine.
> > > > > > > >
> > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > 
> > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > 
> > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > an independent tag set?
> > > > > > 
> > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > 
> > > > > > > What are the benefits of sharing a tagset across
> > > > > > > all namespaces of a controller?
> > > > > > 
> > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > 
> > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > 
> > > > In reality the max supported nr_queues of nvme is often much less than
> > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > That is because queue isn't free in both software and hardware, which
> > > > implementation is often tradeoff between performance and cost.
> > > 
> > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > thought the idea was the ublk server creates as many threads as it needs
> > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > 
> > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > 
> > No.
> > 
> > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > number of pthread is equal to nr_hw_queues.
> 
> Good, I think we agree on that part.
> 
> Here is a summary of the ublk server model I've been describing:
> 1. Each pthread has a separate io_uring context.
> 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
>    command queue, etc).
> 3. Each pthread has a distinct subrange of the tag space if the tag
>    space is shared across hardware submission queues.
> 4. Each pthread allocates tags from its subrange without coordinating
>    with other threads. This is cheap and simple.

That is also not doable.

The tag space can be pretty small, such as, usb-storage queue depth
is just 1, and usb card reader can support multi lun too.

That is just one extreme example, but there can be more low queue depth
scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
there could be some implementation with less.

More importantly subrange could waste lots of tags for idle LUNs/NSs, and
active LUNs/NSs will have to suffer from the small subrange tags. And available
tags depth represents the max allowed in-flight block IOs, so performance
is affected a lot by subrange.

If you look at block layer tag allocation change history, we never take
such way.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-02  3:22                               ` Ming Lei
@ 2023-03-02 15:09                                 ` Stefan Hajnoczi
  2023-03-17  3:10                                   ` Ming Lei
  2023-03-16 14:24                                 ` Stefan Hajnoczi
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-03-02 15:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 9972 bytes --]

On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > 
> > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > 
> > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > >> 
> > > > > > > > >> Hi Ming,
> > > > > > > > >> 
> > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > >> 
> > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > >> >> > > > > > Hello,
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > >> >> > > devices.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > >> >> > > userspace.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > >> >> > 
> > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > >> >> 
> > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > >> >> it just sub-optimal?
> > > > > > > > >> >
> > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > >> >
> > > > > > > > >> 
> > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > >> 
> > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > >> 
> > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > >> 
> > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > >
> > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > handle them just fine.
> > > > > > > > >
> > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > 
> > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > 
> > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > an independent tag set?
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > 
> > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > all namespaces of a controller?
> > > > > > > 
> > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > 
> > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > 
> > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > That is because queue isn't free in both software and hardware, which
> > > > > implementation is often tradeoff between performance and cost.
> > > > 
> > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > thought the idea was the ublk server creates as many threads as it needs
> > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > 
> > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > 
> > > No.
> > > 
> > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > number of pthread is equal to nr_hw_queues.
> > 
> > Good, I think we agree on that part.
> > 
> > Here is a summary of the ublk server model I've been describing:
> > 1. Each pthread has a separate io_uring context.
> > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> >    command queue, etc).
> > 3. Each pthread has a distinct subrange of the tag space if the tag
> >    space is shared across hardware submission queues.
> > 4. Each pthread allocates tags from its subrange without coordinating
> >    with other threads. This is cheap and simple.
> 
> That is also not doable.
> 
> The tag space can be pretty small, such as, usb-storage queue depth
> is just 1, and usb card reader can support multi lun too.

If the tag space is very limited, just create one pthread.

> That is just one extreme example, but there can be more low queue depth
> scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> there could be some implementation with less.

NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
its own independent tag space. That means NVMe devices with low queue
depths work fine in the model I described.

I don't know the exact SCSI/SATA scenario you mentioned, but if there
are only 32 tags globally then just create one pthread. If you mean AHCI
PCI devices, my understanding is that AHCI is multi-LUN but each port
(LUN) has a single Command List (queue) has an independent tag space.
Therefore each port has just one ublk_queue that is handled by one
pthread.

> More importantly subrange could waste lots of tags for idle LUNs/NSs, and
> active LUNs/NSs will have to suffer from the small subrange tags. And available
> tags depth represents the max allowed in-flight block IOs, so performance
> is affected a lot by subrange.

Tag subranges are pthread, not per-LUN/NS, so these concerns do not
apply to the model I described.

Are there any other reasons why you say this model is not doable?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-02-06 17:53 ` Hannes Reinecke
@ 2023-03-08  8:50   ` Hans Holmberg
  2023-03-08 12:27     ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Hans Holmberg @ 2023-03-08  8:50 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, Stefan Hajnoczi,
	ZiyangZhang

This is a great topic, so I'd like to be part of it as well.

It would be great to figure out what latency overhead we could expect
of ublk in the future, clarifying what use cases ublk could cater for.
This will help a lot in making decisions on what to implement
in-kernel vs user space.

Cheers,
Hans

On Mon, Feb 6, 2023 at 6:54 PM Hannes Reinecke <hare@suse.de> wrote:
>
> On 2/6/23 16:00, Ming Lei wrote:
> > Hello,
> >
> > So far UBLK is only used for implementing virtual block device from
> > userspace, such as loop, nbd, qcow2, ...[1].
> >
> > It could be useful for UBLK to cover real storage hardware too:
> >
> > - for fast prototype or performance evaluation
> >
> > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > the current UBLK interface doesn't support such devices, since it needs
> > all LUNs/Namespaces to share host resources(such as tag)
> >
> > - SPDK has supported user space driver for real hardware
> >
> > So propose to extend UBLK for supporting real hardware device:
> >
> > 1) extend UBLK ABI interface to support disks attached to host, such
> > as SCSI Luns/NVME Namespaces
> >
> > 2) the followings are related with operating hardware from userspace,
> > so userspace driver has to be trusted, and root is required, and
> > can't support unprivileged UBLK device
> >
> > 3) how to operating hardware memory space
> > - unbind kernel driver and rebind with uio/vfio
> > - map PCI BAR into userspace[2], then userspace can operate hardware
> > with mapped user address via MMIO
> >
> > 4) DMA
> > - DMA requires physical memory address, UBLK driver actually has
> > block request pages, so can we export request SG list(each segment
> > physical address, offset, len) into userspace? If the max_segments
> > limit is not too big(<=64), the needed buffer for holding SG list
> > can be small enough.
> >
> > - small amount of physical memory for using as DMA descriptor can be
> > pre-allocated from userspace, and ask kernel to pin pages, then still
> > return physical address to userspace for programming DMA
> >
> > - this way is still zero copy
> >
> > 5) notification from hardware: interrupt or polling
> > - SPDK applies userspace polling, this way is doable, but
> > eat CPU, so it is only one choice
> >
> > - io_uring command has been proved as very efficient, if io_uring
> > command is applied(similar way with UBLK for forwarding blk io
> > command from kernel to userspace) to uio/vfio for delivering interrupt,
> > which should be efficient too, given batching processes are done after
> > the io_uring command is completed
> >
> > - or it could be flexible by hybrid interrupt & polling, given
> > userspace single pthread/queue implementation can retrieve all
> > kinds of inflight IO info in very cheap way, and maybe it is likely
> > to apply some ML model to learn & predict when IO will be completed
> >
> > 6) others?
> >
> >
> Good idea.
> I'd love to have this discussion.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-08  8:50   ` Hans Holmberg
@ 2023-03-08 12:27     ` Ming Lei
  0 siblings, 0 replies; 34+ messages in thread
From: Ming Lei @ 2023-03-08 12:27 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: Hannes Reinecke, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, Stefan Hajnoczi,
	ZiyangZhang, ming.lei

On Wed, Mar 08, 2023 at 09:50:53AM +0100, Hans Holmberg wrote:
> This is a great topic, so I'd like to be part of it as well.
> 
> It would be great to figure out what latency overhead we could expect
> of ublk in the future, clarifying what use cases ublk could cater for.
> This will help a lot in making decisions on what to implement
> in-kernel vs user space.

If the zero copy patchset[1] can be accepted, the main overhead should be
in io_uring command communication.

I just run one quick test on my laptop between ublk/null(2 queues, depth 64, with
zero copy) and null_blk(2 queues, depth 64), and run single job fio
(128 qd, batch 16, libaio, 4k randread). IOPS on ublk is less than ~13% than
null_blk.

So looks the difference isn't bad, cause IOPS level has been reached
Million level(1.29M vs. 1.46M). This basically shows the communication
overhead. However, ublk userspace can handle io in lockless way, and
minimize context switch & maximize io parallel by coroutine, that is
ublk's advantage, and hard or impossible to do in kernel.

In the ublksrv[2] project, we implemented loop, nbd & qcow2, so far, in
my previous IOPS test result:

1) kernel loop(dio) vs. ublk/loop: the two are close

2) kernel nbd vs. ublk/nbd:  ublk/nbd is a bit better than kernel nbd

3) qmeu-nbd based qcow2 vs. ublk/qcow2: ublk/qcow2 is much better

All the three just works, not run further optimization yet.

Also ublk may perform bad if io isn't handled in batch, such as, single
queue depth io submission.

But ublk is still very young, and there can be lots of optimization in
future, such as:

1) applying polling for reducing communication overhead for both
io command and io handle, and this way should improve latency for low QD
workload

2) apply kind of ML model for predicating IO completion, and improve 
io polling, meantime reducing cpu utilization.

3) improve io_uring command for reducing communication overhead

IMO, ublk is one generic userspace block device approach, especially good at:

1) handle complicated io logic, such as, btree is applied in io mapping,
cause userspace has more weapons for this stuff

2) virtual device, such as all network based storage, or logical volume
management

3) quick prototype development

4) flexible storage simulation for test purpose

[1] https://lore.kernel.org/linux-block/ZAff9usDuyXxIPt9@ovpn-8-16.pek2.redhat.com/T/#t
[2] https://github.com/ming1/ubdsrv


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-02  3:22                               ` Ming Lei
  2023-03-02 15:09                                 ` Stefan Hajnoczi
@ 2023-03-16 14:24                                 ` Stefan Hajnoczi
  1 sibling, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-03-16 14:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 9387 bytes --]

On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > 
> > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > 
> > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > >> 
> > > > > > > > >> Hi Ming,
> > > > > > > > >> 
> > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > >> 
> > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > >> >> > > > > > Hello,
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > >> >> > > devices.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > >> >> > > userspace.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > >> >> > 
> > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > >> >> 
> > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > >> >> it just sub-optimal?
> > > > > > > > >> >
> > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > >> >
> > > > > > > > >> 
> > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > >> 
> > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > >> 
> > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > >> 
> > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > >
> > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > handle them just fine.
> > > > > > > > >
> > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > 
> > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > 
> > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > an independent tag set?
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > 
> > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > all namespaces of a controller?
> > > > > > > 
> > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > 
> > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > 
> > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > That is because queue isn't free in both software and hardware, which
> > > > > implementation is often tradeoff between performance and cost.
> > > > 
> > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > thought the idea was the ublk server creates as many threads as it needs
> > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > 
> > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > 
> > > No.
> > > 
> > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > number of pthread is equal to nr_hw_queues.
> > 
> > Good, I think we agree on that part.
> > 
> > Here is a summary of the ublk server model I've been describing:
> > 1. Each pthread has a separate io_uring context.
> > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> >    command queue, etc).
> > 3. Each pthread has a distinct subrange of the tag space if the tag
> >    space is shared across hardware submission queues.
> > 4. Each pthread allocates tags from its subrange without coordinating
> >    with other threads. This is cheap and simple.
> 
> That is also not doable.
> 
> The tag space can be pretty small, such as, usb-storage queue depth
> is just 1, and usb card reader can support multi lun too.
> 
> That is just one extreme example, but there can be more low queue depth
> scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> there could be some implementation with less.
> 
> More importantly subrange could waste lots of tags for idle LUNs/NSs, and
> active LUNs/NSs will have to suffer from the small subrange tags. And available
> tags depth represents the max allowed in-flight block IOs, so performance
> is affected a lot by subrange.
> 
> If you look at block layer tag allocation change history, we never take
> such way.

Hi Ming,
Any thoughts on my last reply? If my mental model is incorrect I'd like
to learn why.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-02 15:09                                 ` Stefan Hajnoczi
@ 2023-03-17  3:10                                   ` Ming Lei
  2023-03-17 14:41                                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-03-17  3:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote:
> On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > > 
> > > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > 
> > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > > >> 
> > > > > > > > > >> Hi Ming,
> > > > > > > > > >> 
> > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > >> 
> > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > >> >> > > > > > Hello,
> > > > > > > > > >> >> > > > > > 
> > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > > >> >> > > > > 
> > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > > >> >> > > > 
> > > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > > >> >> > > > 
> > > > > > > > > >> >> > > > > 
> > > > > > > > > >> >> > > > > > 
> > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > >> >> > > > > > 
> > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > > >> >> > > > > > 
> > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > > >> >> > > > > 
> > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > > >> >> > > > 
> > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > > >> >> > > 
> > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > > >> >> > > 
> > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > > >> >> > > devices.
> > > > > > > > > >> >> > > 
> > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > > >> >> > > userspace.
> > > > > > > > > >> >> > > 
> > > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > > >> >> > 
> > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > > >> >> 
> > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > > >> >> it just sub-optimal?
> > > > > > > > > >> >
> > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > > >> >
> > > > > > > > > >> 
> > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > > >> 
> > > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > > >> 
> > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > > >> 
> > > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > > >
> > > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > > handle them just fine.
> > > > > > > > > >
> > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > > 
> > > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > > 
> > > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > > an independent tag set?
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > > 
> > > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > > all namespaces of a controller?
> > > > > > > > 
> > > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > > 
> > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > > 
> > > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > > That is because queue isn't free in both software and hardware, which
> > > > > > implementation is often tradeoff between performance and cost.
> > > > > 
> > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > > thought the idea was the ublk server creates as many threads as it needs
> > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > > 
> > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > > 
> > > > No.
> > > > 
> > > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > > number of pthread is equal to nr_hw_queues.
> > > 
> > > Good, I think we agree on that part.
> > > 
> > > Here is a summary of the ublk server model I've been describing:
> > > 1. Each pthread has a separate io_uring context.
> > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> > >    command queue, etc).
> > > 3. Each pthread has a distinct subrange of the tag space if the tag
> > >    space is shared across hardware submission queues.
> > > 4. Each pthread allocates tags from its subrange without coordinating
> > >    with other threads. This is cheap and simple.
> > 
> > That is also not doable.
> > 
> > The tag space can be pretty small, such as, usb-storage queue depth
> > is just 1, and usb card reader can support multi lun too.
> 
> If the tag space is very limited, just create one pthread.

What I meant is that sub-range isn't doable.

And pthread is aligned with queue, that is nothing to do with nr_tags.

> 
> > That is just one extreme example, but there can be more low queue depth
> > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> > there could be some implementation with less.
> 
> NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
> its own independent tag space. That means NVMe devices with low queue
> depths work fine in the model I described.

NVMe PCI isn't special, and it is covered by current ublk abstract, so one way
or another, we should not support both sub-range or non-sub-range for
avoiding unnecessary complexity.

"Each pthread has its own independent tag space" may mean two things

1) each LUN/NS is implemented in standalone process space:
- so every queue of each LUN has its own space, but all the queues with
same ID share the whole queue tag space
- that matches with current ublksrv
- also easier to implement

2) all LUNs/NSs are implemented in single process space
- so each pthread handles one queue for all NSs/LUNs

Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk
char device has to handle multiple LUNs/NSs(disks), which still need
(big) ublk interface change. Also this way can't scale for single queue
devices.

Another thing is that io command buffer has to be shared among all LUNs/
NSs. So interface change has to cover shared io command buffer.

With zero copy support, io buffer sharing needn't to be considered, that
can be a bit easier.

In short, the sharing of (tag, io command buffer, io buffer) needs to be
considered for shared host ublk disks.

Actually I prefer to 1), which matches with current design, and we can
just add host concept into ublk, and implementation could be easier.

BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2],
and the performance improvement is pretty nice, so I think it is one reasonable
requirement to support "shared host" ublk disks for covering multi-lun or multi-ns.

[1] https://github.com/ming1/ubdsrv/issues/49
[2] https://github.com/longhorn/longhorn

Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-17  3:10                                   ` Ming Lei
@ 2023-03-17 14:41                                     ` Stefan Hajnoczi
  2023-03-18  0:30                                       ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-03-17 14:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 14963 bytes --]

On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote:
> On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote:
> > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > > > 
> > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > 
> > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > > > >> 
> > > > > > > > > > >> Hi Ming,
> > > > > > > > > > >> 
> > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > >> 
> > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > > >> >> > > > > > Hello,
> > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > > > >> >> > > > > 
> > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > > > >> >> > > > 
> > > > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > > > >> >> > > > 
> > > > > > > > > > >> >> > > > > 
> > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > > > >> >> > > > > 
> > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > > > >> >> > > > 
> > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > > > >> >> > > 
> > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > > > >> >> > > 
> > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > > > >> >> > > devices.
> > > > > > > > > > >> >> > > 
> > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > > > >> >> > > userspace.
> > > > > > > > > > >> >> > > 
> > > > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > > > >> >> > 
> > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > > > >> >> 
> > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > > > >> >> it just sub-optimal?
> > > > > > > > > > >> >
> > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > > > >> >
> > > > > > > > > > >> 
> > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > > > >> 
> > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > > > >> 
> > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > > > >> 
> > > > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > > > >
> > > > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > > > handle them just fine.
> > > > > > > > > > >
> > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > > > 
> > > > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > > > 
> > > > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > > > an independent tag set?
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > > > 
> > > > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > > > all namespaces of a controller?
> > > > > > > > > 
> > > > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > > > 
> > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > > > 
> > > > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > > > That is because queue isn't free in both software and hardware, which
> > > > > > > implementation is often tradeoff between performance and cost.
> > > > > > 
> > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > > > thought the idea was the ublk server creates as many threads as it needs
> > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > > > 
> > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > > > 
> > > > > No.
> > > > > 
> > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > > > number of pthread is equal to nr_hw_queues.
> > > > 
> > > > Good, I think we agree on that part.
> > > > 
> > > > Here is a summary of the ublk server model I've been describing:
> > > > 1. Each pthread has a separate io_uring context.
> > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> > > >    command queue, etc).
> > > > 3. Each pthread has a distinct subrange of the tag space if the tag
> > > >    space is shared across hardware submission queues.
> > > > 4. Each pthread allocates tags from its subrange without coordinating
> > > >    with other threads. This is cheap and simple.
> > > 
> > > That is also not doable.
> > > 
> > > The tag space can be pretty small, such as, usb-storage queue depth
> > > is just 1, and usb card reader can support multi lun too.
> > 
> > If the tag space is very limited, just create one pthread.
> 
> What I meant is that sub-range isn't doable.
> 
> And pthread is aligned with queue, that is nothing to do with nr_tags.
> 
> > 
> > > That is just one extreme example, but there can be more low queue depth
> > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> > > there could be some implementation with less.
> > 
> > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
> > its own independent tag space. That means NVMe devices with low queue
> > depths work fine in the model I described.
> 
> NVMe PCI isn't special, and it is covered by current ublk abstract, so one way
> or another, we should not support both sub-range or non-sub-range for
> avoiding unnecessary complexity.
> 
> "Each pthread has its own independent tag space" may mean two things
> 
> 1) each LUN/NS is implemented in standalone process space:
> - so every queue of each LUN has its own space, but all the queues with
> same ID share the whole queue tag space
> - that matches with current ublksrv
> - also easier to implement
> 
> 2) all LUNs/NSs are implemented in single process space
> - so each pthread handles one queue for all NSs/LUNs
> 
> Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk
> char device has to handle multiple LUNs/NSs(disks), which still need
> (big) ublk interface change. Also this way can't scale for single queue
> devices.

The model I described is neither 1) or 2). It's similar to 2) but I'm
not sure why you say the ublk interface needs to be changed. I'm afraid
I haven't explained it well, sorry. I'll try to describe it again with
an NVMe PCI adapter being handled by userspace.

There is a single ublk server process with an NVMe PCI device opened
using VFIO.

There are N pthreads and each pthread has 1 io_uring context and 1 NVMe
PCI SQ/CQ pair. The size of the SQ and CQ rings is QD.

The NVMe PCI device has M Namespaces. The ublk server creates M
ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD.

The Linux block layer sees M block devices with N nr_hw_queues and QD
queue_depth. The actual NVMe PCI device resources are less than what the
Linux block layer sees because the each SQ/CQ pair is used for M
ublk_devices. In other words, Linux thinks there can be M * N * QD
requests in flight but in reality the NVMe PCI adapter only supports N *
QD requests.

Now I'll describe how userspace can take care of the mismatch between
the Linux block layer and the NVMe PCI device without doing much work:

Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for
each of the M Namespaces.

When userspace receives a request from ublk, it cannot simply copy the
struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier
(CID) field. There would be collisions between the tags used across the
M ublk_queues that the pthread services.

Userspace selects a free tag (e.g. from a bitmap with QD elements) and
uses that as the NVMe Command Identifier. This is trivial because each
pthread has its own bitmap and NVMe Command Identifiers are per-SQ.

If there are no free tags then the request is placed in the pthread's
per Namespace overflow list. Whenever an NVMe command completes, the
overflow lists are scanned. One pending request is submitted to the NVMe
PCI adapter in a round-robin fashion until the lists are empty or there
are no more free tags.

That's it. No ublk API changes are necessary. The userspace code is not
slow or complex (just a bitmap and overflow list).

The approach also works for SCSI or devices that only support 1 request
in flight at a time, with small tweaks.

Going back to the beginning of the discussion: I think it's possible to
write a ublk server that handles multiple LUNs/NS today.

> Another thing is that io command buffer has to be shared among all LUNs/
> NSs. So interface change has to cover shared io command buffer.

I think the main advantage of extending the ublk API to share io command
buffers between ublk_devices is to reduce userspace memory consumption?

It eliminates the need to over-provision I/O buffers for write requests
(or use the slower UBLK_IO_NEED_GET_DATA approach).

> With zero copy support, io buffer sharing needn't to be considered, that
> can be a bit easier.
> 
> In short, the sharing of (tag, io command buffer, io buffer) needs to be
> considered for shared host ublk disks.
> 
> Actually I prefer to 1), which matches with current design, and we can
> just add host concept into ublk, and implementation could be easier.
> 
> BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2],
> and the performance improvement is pretty nice, so I think it is one reasonable
> requirement to support "shared host" ublk disks for covering multi-lun or multi-ns.
> 
> [1] https://github.com/ming1/ubdsrv/issues/49
> [2] https://github.com/longhorn/longhorn

Nice performance improvement!

I agree with you that the ublk API should have a way to declare the
resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I
guess the simplest way to do that is by passing a reference to an
existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing
else about the ublk API needs to change, at least for tags.

Solving I/O buffer over-provisioning sounds similar to io_uring's
provided buffer mechanism :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-17 14:41                                     ` Stefan Hajnoczi
@ 2023-03-18  0:30                                       ` Ming Lei
  2023-03-20 12:34                                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-03-18  0:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote:
> On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote:
> > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote:
> > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > 
> > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Hi Ming,
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > >> 
> > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > > > >> >> > > > > > Hello,
> > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > > > > >> >> > > 
> > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > > > > >> >> > > 
> > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > > > > >> >> > > devices.
> > > > > > > > > > > >> >> > > 
> > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > > > > >> >> > > userspace.
> > > > > > > > > > > >> >> > > 
> > > > > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > > > > >> >> > 
> > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > > > > >> >> 
> > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > > > > >> >> it just sub-optimal?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> 
> > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > > > > >> 
> > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > > > > >> 
> > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > > > > >> 
> > > > > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > > > > >
> > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > > > > handle them just fine.
> > > > > > > > > > > >
> > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > > > > 
> > > > > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > > > > 
> > > > > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > > > > an independent tag set?
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > > > > 
> > > > > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > > > > all namespaces of a controller?
> > > > > > > > > > 
> > > > > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > > > > 
> > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > > > > 
> > > > > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > > > > That is because queue isn't free in both software and hardware, which
> > > > > > > > implementation is often tradeoff between performance and cost.
> > > > > > > 
> > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > > > > thought the idea was the ublk server creates as many threads as it needs
> > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > > > > 
> > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > > > > 
> > > > > > No.
> > > > > > 
> > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > > > > number of pthread is equal to nr_hw_queues.
> > > > > 
> > > > > Good, I think we agree on that part.
> > > > > 
> > > > > Here is a summary of the ublk server model I've been describing:
> > > > > 1. Each pthread has a separate io_uring context.
> > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> > > > >    command queue, etc).
> > > > > 3. Each pthread has a distinct subrange of the tag space if the tag
> > > > >    space is shared across hardware submission queues.
> > > > > 4. Each pthread allocates tags from its subrange without coordinating
> > > > >    with other threads. This is cheap and simple.
> > > > 
> > > > That is also not doable.
> > > > 
> > > > The tag space can be pretty small, such as, usb-storage queue depth
> > > > is just 1, and usb card reader can support multi lun too.
> > > 
> > > If the tag space is very limited, just create one pthread.
> > 
> > What I meant is that sub-range isn't doable.
> > 
> > And pthread is aligned with queue, that is nothing to do with nr_tags.
> > 
> > > 
> > > > That is just one extreme example, but there can be more low queue depth
> > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> > > > there could be some implementation with less.
> > > 
> > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
> > > its own independent tag space. That means NVMe devices with low queue
> > > depths work fine in the model I described.
> > 
> > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way
> > or another, we should not support both sub-range or non-sub-range for
> > avoiding unnecessary complexity.
> > 
> > "Each pthread has its own independent tag space" may mean two things
> > 
> > 1) each LUN/NS is implemented in standalone process space:
> > - so every queue of each LUN has its own space, but all the queues with
> > same ID share the whole queue tag space
> > - that matches with current ublksrv
> > - also easier to implement
> > 
> > 2) all LUNs/NSs are implemented in single process space
> > - so each pthread handles one queue for all NSs/LUNs
> > 
> > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk
> > char device has to handle multiple LUNs/NSs(disks), which still need
> > (big) ublk interface change. Also this way can't scale for single queue
> > devices.
> 
> The model I described is neither 1) or 2). It's similar to 2) but I'm
> not sure why you say the ublk interface needs to be changed. I'm afraid
> I haven't explained it well, sorry. I'll try to describe it again with
> an NVMe PCI adapter being handled by userspace.
> 
> There is a single ublk server process with an NVMe PCI device opened
> using VFIO.
> 
> There are N pthreads and each pthread has 1 io_uring context and 1 NVMe
> PCI SQ/CQ pair. The size of the SQ and CQ rings is QD.
> 
> The NVMe PCI device has M Namespaces. The ublk server creates M
> ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD.
> 
> The Linux block layer sees M block devices with N nr_hw_queues and QD
> queue_depth. The actual NVMe PCI device resources are less than what the
> Linux block layer sees because the each SQ/CQ pair is used for M
> ublk_devices. In other words, Linux thinks there can be M * N * QD
> requests in flight but in reality the NVMe PCI adapter only supports N *
> QD requests.

Yeah, but it is really bad.

Now QD is the host hard queue depth, which can be very big, and could be
more than thousands.

ublk driver doesn't understand this kind of sharing(tag, io command buffer, io
buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory
are wasted a lot.

Every device has to allocate command buffers for holding QD io commands, and
command buffer is supposed to be per-host, instead of per-disk. Same with io
buffer pre-allocation in userspace side.

Userspace has to re-tag the requests for avoiding duplicated tag, and
requests have to be throttled in ublk server side. If you implement tag allocation
in userspace side, it is still one typical shared data issue in SMP, M pthreads
contends on single tags from multiple CPUs.

> 
> Now I'll describe how userspace can take care of the mismatch between
> the Linux block layer and the NVMe PCI device without doing much work:
> 
> Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for
> each of the M Namespaces.
> 
> When userspace receives a request from ublk, it cannot simply copy the
> struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier
> (CID) field. There would be collisions between the tags used across the
> M ublk_queues that the pthread services.
> 
> Userspace selects a free tag (e.g. from a bitmap with QD elements) and
> uses that as the NVMe Command Identifier. This is trivial because each
> pthread has its own bitmap and NVMe Command Identifiers are per-SQ.

I believe I have explained, in reality, NVME SQ/CQ pair can be less(
or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed
among CPUs of (nr_cpu_ids / nr_hw_queues).

Not mention userspace is capable of overriding the pthread cpu affinity,
so it isn't trivial & cheap, M pthreads could be run from
more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags.

> 
> If there are no free tags then the request is placed in the pthread's
> per Namespace overflow list. Whenever an NVMe command completes, the
> overflow lists are scanned. One pending request is submitted to the NVMe
> PCI adapter in a round-robin fashion until the lists are empty or there
> are no more free tags.
> 
> That's it. No ublk API changes are necessary. The userspace code is not
> slow or complex (just a bitmap and overflow list).

Fine, but I am not sure we need to support such mess & pool implementation.

> 
> The approach also works for SCSI or devices that only support 1 request
> in flight at a time, with small tweaks.
> 
> Going back to the beginning of the discussion: I think it's possible to
> write a ublk server that handles multiple LUNs/NS today.

It is possible, but it is poor in both performance and resource
utilization, meantime with complicated ublk server implementation.

> 
> > Another thing is that io command buffer has to be shared among all LUNs/
> > NSs. So interface change has to cover shared io command buffer.
> 
> I think the main advantage of extending the ublk API to share io command
> buffers between ublk_devices is to reduce userspace memory consumption?
> 
> It eliminates the need to over-provision I/O buffers for write requests
> (or use the slower UBLK_IO_NEED_GET_DATA approach).

Not only avoiding memory and cpu waste, but also simplifying ublk
server.

> 
> > With zero copy support, io buffer sharing needn't to be considered, that
> > can be a bit easier.
> > 
> > In short, the sharing of (tag, io command buffer, io buffer) needs to be
> > considered for shared host ublk disks.
> > 
> > Actually I prefer to 1), which matches with current design, and we can
> > just add host concept into ublk, and implementation could be easier.
> > 
> > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2],
> > and the performance improvement is pretty nice, so I think it is one reasonable
> > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns.
> > 
> > [1] https://github.com/ming1/ubdsrv/issues/49
> > [2] https://github.com/longhorn/longhorn
> 
> Nice performance improvement!
> 
> I agree with you that the ublk API should have a way to declare the
> resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I
> guess the simplest way to do that is by passing a reference to an
> existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing
> else about the ublk API needs to change, at least for tags.

Basically (tags, io command buffer, io buffers) need to move into
host/hw_queue wide from disk wide, so not so simple, but won't
be too complicated.

> 
> Solving I/O buffer over-provisioning sounds similar to io_uring's
> provided buffer mechanism :).

blk-mq has built-in host/hw_queue wide tag allocation, which can provide
unique tag for ublk server from ublk driver side, so everything can be
simplified a lot if we move (tag, io command buffer, io buffers) into
host/hw_queue wide by telling ublk_driver that we are
BLK_MQ_F_TAG_QUEUE_SHARED.

Not sure if io_uring's provided buffer is good here, cause we need to
discard io buffers after queue become idle. But it won't be one big
deal if zero copy can be supported.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-18  0:30                                       ` Ming Lei
@ 2023-03-20 12:34                                         ` Stefan Hajnoczi
  2023-03-20 15:30                                           ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-03-20 12:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 20002 bytes --]

On Sat, Mar 18, 2023 at 08:30:29AM +0800, Ming Lei wrote:
> On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote:
> > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote:
> > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote:
> > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > > 
> > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> Hi Ming,
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > > > > >> >> > > > > > Hello,
> > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > > > > > >> >> > > devices.
> > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > > > > > >> >> > > userspace.
> > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > > > > > >> >> > 
> > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > > > > > >> >> 
> > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > > > > > >> >> it just sub-optimal?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > > > > > >> 
> > > > > > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > > > > > >
> > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > > > > > handle them just fine.
> > > > > > > > > > > > >
> > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > > > > > 
> > > > > > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > > > > > an independent tag set?
> > > > > > > > > > > 
> > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > > > > > 
> > > > > > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > > > > > all namespaces of a controller?
> > > > > > > > > > > 
> > > > > > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > > > > > 
> > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > > > > > 
> > > > > > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > > > > > That is because queue isn't free in both software and hardware, which
> > > > > > > > > implementation is often tradeoff between performance and cost.
> > > > > > > > 
> > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > > > > > thought the idea was the ublk server creates as many threads as it needs
> > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > > > > > 
> > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > > > > > 
> > > > > > > No.
> > > > > > > 
> > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > > > > > number of pthread is equal to nr_hw_queues.
> > > > > > 
> > > > > > Good, I think we agree on that part.
> > > > > > 
> > > > > > Here is a summary of the ublk server model I've been describing:
> > > > > > 1. Each pthread has a separate io_uring context.
> > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> > > > > >    command queue, etc).
> > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag
> > > > > >    space is shared across hardware submission queues.
> > > > > > 4. Each pthread allocates tags from its subrange without coordinating
> > > > > >    with other threads. This is cheap and simple.
> > > > > 
> > > > > That is also not doable.
> > > > > 
> > > > > The tag space can be pretty small, such as, usb-storage queue depth
> > > > > is just 1, and usb card reader can support multi lun too.
> > > > 
> > > > If the tag space is very limited, just create one pthread.
> > > 
> > > What I meant is that sub-range isn't doable.
> > > 
> > > And pthread is aligned with queue, that is nothing to do with nr_tags.
> > > 
> > > > 
> > > > > That is just one extreme example, but there can be more low queue depth
> > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> > > > > there could be some implementation with less.
> > > > 
> > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
> > > > its own independent tag space. That means NVMe devices with low queue
> > > > depths work fine in the model I described.
> > > 
> > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way
> > > or another, we should not support both sub-range or non-sub-range for
> > > avoiding unnecessary complexity.
> > > 
> > > "Each pthread has its own independent tag space" may mean two things
> > > 
> > > 1) each LUN/NS is implemented in standalone process space:
> > > - so every queue of each LUN has its own space, but all the queues with
> > > same ID share the whole queue tag space
> > > - that matches with current ublksrv
> > > - also easier to implement
> > > 
> > > 2) all LUNs/NSs are implemented in single process space
> > > - so each pthread handles one queue for all NSs/LUNs
> > > 
> > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk
> > > char device has to handle multiple LUNs/NSs(disks), which still need
> > > (big) ublk interface change. Also this way can't scale for single queue
> > > devices.
> > 
> > The model I described is neither 1) or 2). It's similar to 2) but I'm
> > not sure why you say the ublk interface needs to be changed. I'm afraid
> > I haven't explained it well, sorry. I'll try to describe it again with
> > an NVMe PCI adapter being handled by userspace.
> > 
> > There is a single ublk server process with an NVMe PCI device opened
> > using VFIO.
> > 
> > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe
> > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD.
> > 
> > The NVMe PCI device has M Namespaces. The ublk server creates M
> > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD.
> > 
> > The Linux block layer sees M block devices with N nr_hw_queues and QD
> > queue_depth. The actual NVMe PCI device resources are less than what the
> > Linux block layer sees because the each SQ/CQ pair is used for M
> > ublk_devices. In other words, Linux thinks there can be M * N * QD
> > requests in flight but in reality the NVMe PCI adapter only supports N *
> > QD requests.
> 
> Yeah, but it is really bad.
> 
> Now QD is the host hard queue depth, which can be very big, and could be
> more than thousands.
> 
> ublk driver doesn't understand this kind of sharing(tag, io command buffer, io
> buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory
> are wasted a lot.
> 
> Every device has to allocate command buffers for holding QD io commands, and
> command buffer is supposed to be per-host, instead of per-disk. Same with io
> buffer pre-allocation in userspace side.

I agree with you in cases with lots of LUNs (large M), block layer and
ublk driver per-request memory is allocated that cannot be used
simultaneously.

> Userspace has to re-tag the requests for avoiding duplicated tag, and
> requests have to be throttled in ublk server side. If you implement tag allocation
> in userspace side, it is still one typical shared data issue in SMP, M pthreads
> contends on single tags from multiple CPUs.

Here I still disagree. There is no SMP contention with NVMe because tags
are per SQ. For SCSI the tag namespace is shared but each pthread can
trivially work with a sub-range to avoid SMP contention. If the tag
namespace is too small for sub-ranges, then there should be fewer
pthreads.

> > 
> > Now I'll describe how userspace can take care of the mismatch between
> > the Linux block layer and the NVMe PCI device without doing much work:
> > 
> > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for
> > each of the M Namespaces.
> > 
> > When userspace receives a request from ublk, it cannot simply copy the
> > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier
> > (CID) field. There would be collisions between the tags used across the
> > M ublk_queues that the pthread services.
> > 
> > Userspace selects a free tag (e.g. from a bitmap with QD elements) and
> > uses that as the NVMe Command Identifier. This is trivial because each
> > pthread has its own bitmap and NVMe Command Identifiers are per-SQ.
> 
> I believe I have explained, in reality, NVME SQ/CQ pair can be less(
> or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed
> among CPUs of (nr_cpu_ids / nr_hw_queues).
> 
> Not mention userspace is capable of overriding the pthread cpu affinity,
> so it isn't trivial & cheap, M pthreads could be run from
> more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags.

I don't understand your nr_cpu_ids concerns. In the model I have
described, the number of pthreads is min(nr_cpu_ids, max_sq_cq_pairs)
and the SQ/CQ pairs are per pthread. There is no sharing of SQ/CQ pairs
across pthreads.

On a limited NVMe controller nr_cpu_ids=128 and max_sq_cq_pairs=8, so
there are only 8 pthreads. Each pthread has its own io_uring context
through which it handles M ublk_queues. Even if a pthread runs from more
than 1 CPU, its SQ Command Identifiers (tags) are only used by that
pthread and there is no SMP contention.

Can you explain where you see SMP contention for NVMe SQ Command
Identifiers?

> > 
> > If there are no free tags then the request is placed in the pthread's
> > per Namespace overflow list. Whenever an NVMe command completes, the
> > overflow lists are scanned. One pending request is submitted to the NVMe
> > PCI adapter in a round-robin fashion until the lists are empty or there
> > are no more free tags.
> > 
> > That's it. No ublk API changes are necessary. The userspace code is not
> > slow or complex (just a bitmap and overflow list).
> 
> Fine, but I am not sure we need to support such mess & pool implementation.
> 
> > 
> > The approach also works for SCSI or devices that only support 1 request
> > in flight at a time, with small tweaks.
> > 
> > Going back to the beginning of the discussion: I think it's possible to
> > write a ublk server that handles multiple LUNs/NS today.
> 
> It is possible, but it is poor in both performance and resource
> utilization, meantime with complicated ublk server implementation.

Okay. I wanted to make sure I wasn't missing a reason why it's
fundamentally impossible. Performance, resource utilization, or
complexity is debatable and I think I understand your position. I think
you're looking for a general solution that works well even with a high
number of LUNs, where the model I proposed wastes resources.

> 
> > 
> > > Another thing is that io command buffer has to be shared among all LUNs/
> > > NSs. So interface change has to cover shared io command buffer.
> > 
> > I think the main advantage of extending the ublk API to share io command
> > buffers between ublk_devices is to reduce userspace memory consumption?
> > 
> > It eliminates the need to over-provision I/O buffers for write requests
> > (or use the slower UBLK_IO_NEED_GET_DATA approach).
> 
> Not only avoiding memory and cpu waste, but also simplifying ublk
> server.
> 
> > 
> > > With zero copy support, io buffer sharing needn't to be considered, that
> > > can be a bit easier.
> > > 
> > > In short, the sharing of (tag, io command buffer, io buffer) needs to be
> > > considered for shared host ublk disks.
> > > 
> > > Actually I prefer to 1), which matches with current design, and we can
> > > just add host concept into ublk, and implementation could be easier.
> > > 
> > > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2],
> > > and the performance improvement is pretty nice, so I think it is one reasonable
> > > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns.
> > > 
> > > [1] https://github.com/ming1/ubdsrv/issues/49
> > > [2] https://github.com/longhorn/longhorn
> > 
> > Nice performance improvement!
> > 
> > I agree with you that the ublk API should have a way to declare the
> > resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I
> > guess the simplest way to do that is by passing a reference to an
> > existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing
> > else about the ublk API needs to change, at least for tags.
> 
> Basically (tags, io command buffer, io buffers) need to move into
> host/hw_queue wide from disk wide, so not so simple, but won't
> be too complicated.
> 
> > 
> > Solving I/O buffer over-provisioning sounds similar to io_uring's
> > provided buffer mechanism :).
> 
> blk-mq has built-in host/hw_queue wide tag allocation, which can provide
> unique tag for ublk server from ublk driver side, so everything can be
> simplified a lot if we move (tag, io command buffer, io buffers) into
> host/hw_queue wide by telling ublk_driver that we are
> BLK_MQ_F_TAG_QUEUE_SHARED.
> 
> Not sure if io_uring's provided buffer is good here, cause we need to
> discard io buffers after queue become idle. But it won't be one big
> deal if zero copy can be supported.

If the per-request ublk resources are shared like tags as you described,
then that's a nice solution that also solves I/O buffer
over-provisioning.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-20 12:34                                         ` Stefan Hajnoczi
@ 2023-03-20 15:30                                           ` Ming Lei
  2023-03-21 11:25                                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2023-03-20 15:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang, ming.lei

On Mon, Mar 20, 2023 at 08:34:17AM -0400, Stefan Hajnoczi wrote:
> On Sat, Mar 18, 2023 at 08:30:29AM +0800, Ming Lei wrote:
> > On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote:
> > > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote:
> > > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote:
> > > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> > > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> Hi Ming,
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > > > > > >> >> > > > > > Hello,
> > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > > > > > > >> >> > > devices.
> > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > > > > > > >> >> > > userspace.
> > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > > > > > > >> >> > 
> > > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > > > > > > >> >> 
> > > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > > > > > > >> >> it just sub-optimal?
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > > > > > > >> 
> > > > > > > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > > > > > > handle them just fine.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > > > > > > an independent tag set?
> > > > > > > > > > > > 
> > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > > > > > > 
> > > > > > > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > > > > > > all namespaces of a controller?
> > > > > > > > > > > > 
> > > > > > > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > > > > > > 
> > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > > > > > > 
> > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > > > > > > That is because queue isn't free in both software and hardware, which
> > > > > > > > > > implementation is often tradeoff between performance and cost.
> > > > > > > > > 
> > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > > > > > > thought the idea was the ublk server creates as many threads as it needs
> > > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > > > > > > 
> > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > > > > > > 
> > > > > > > > No.
> > > > > > > > 
> > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > > > > > > number of pthread is equal to nr_hw_queues.
> > > > > > > 
> > > > > > > Good, I think we agree on that part.
> > > > > > > 
> > > > > > > Here is a summary of the ublk server model I've been describing:
> > > > > > > 1. Each pthread has a separate io_uring context.
> > > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> > > > > > >    command queue, etc).
> > > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag
> > > > > > >    space is shared across hardware submission queues.
> > > > > > > 4. Each pthread allocates tags from its subrange without coordinating
> > > > > > >    with other threads. This is cheap and simple.
> > > > > > 
> > > > > > That is also not doable.
> > > > > > 
> > > > > > The tag space can be pretty small, such as, usb-storage queue depth
> > > > > > is just 1, and usb card reader can support multi lun too.
> > > > > 
> > > > > If the tag space is very limited, just create one pthread.
> > > > 
> > > > What I meant is that sub-range isn't doable.
> > > > 
> > > > And pthread is aligned with queue, that is nothing to do with nr_tags.
> > > > 
> > > > > 
> > > > > > That is just one extreme example, but there can be more low queue depth
> > > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> > > > > > there could be some implementation with less.
> > > > > 
> > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
> > > > > its own independent tag space. That means NVMe devices with low queue
> > > > > depths work fine in the model I described.
> > > > 
> > > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way
> > > > or another, we should not support both sub-range or non-sub-range for
> > > > avoiding unnecessary complexity.
> > > > 
> > > > "Each pthread has its own independent tag space" may mean two things
> > > > 
> > > > 1) each LUN/NS is implemented in standalone process space:
> > > > - so every queue of each LUN has its own space, but all the queues with
> > > > same ID share the whole queue tag space
> > > > - that matches with current ublksrv
> > > > - also easier to implement
> > > > 
> > > > 2) all LUNs/NSs are implemented in single process space
> > > > - so each pthread handles one queue for all NSs/LUNs
> > > > 
> > > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk
> > > > char device has to handle multiple LUNs/NSs(disks), which still need
> > > > (big) ublk interface change. Also this way can't scale for single queue
> > > > devices.
> > > 
> > > The model I described is neither 1) or 2). It's similar to 2) but I'm
> > > not sure why you say the ublk interface needs to be changed. I'm afraid
> > > I haven't explained it well, sorry. I'll try to describe it again with
> > > an NVMe PCI adapter being handled by userspace.
> > > 
> > > There is a single ublk server process with an NVMe PCI device opened
> > > using VFIO.
> > > 
> > > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe
> > > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD.
> > > 
> > > The NVMe PCI device has M Namespaces. The ublk server creates M
> > > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD.
> > > 
> > > The Linux block layer sees M block devices with N nr_hw_queues and QD
> > > queue_depth. The actual NVMe PCI device resources are less than what the
> > > Linux block layer sees because the each SQ/CQ pair is used for M
> > > ublk_devices. In other words, Linux thinks there can be M * N * QD
> > > requests in flight but in reality the NVMe PCI adapter only supports N *
> > > QD requests.
> > 
> > Yeah, but it is really bad.
> > 
> > Now QD is the host hard queue depth, which can be very big, and could be
> > more than thousands.
> > 
> > ublk driver doesn't understand this kind of sharing(tag, io command buffer, io
> > buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory
> > are wasted a lot.
> > 
> > Every device has to allocate command buffers for holding QD io commands, and
> > command buffer is supposed to be per-host, instead of per-disk. Same with io
> > buffer pre-allocation in userspace side.
> 
> I agree with you in cases with lots of LUNs (large M), block layer and
> ublk driver per-request memory is allocated that cannot be used
> simultaneously.
> 
> > Userspace has to re-tag the requests for avoiding duplicated tag, and
> > requests have to be throttled in ublk server side. If you implement tag allocation
> > in userspace side, it is still one typical shared data issue in SMP, M pthreads
> > contends on single tags from multiple CPUs.
> 
> Here I still disagree. There is no SMP contention with NVMe because tags
> are per SQ. For SCSI the tag namespace is shared but each pthread can
> trivially work with a sub-range to avoid SMP contention. If the tag
> namespace is too small for sub-ranges, then there should be fewer
> pthreads.
> 
> > > 
> > > Now I'll describe how userspace can take care of the mismatch between
> > > the Linux block layer and the NVMe PCI device without doing much work:
> > > 
> > > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for
> > > each of the M Namespaces.
> > > 
> > > When userspace receives a request from ublk, it cannot simply copy the
> > > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier
> > > (CID) field. There would be collisions between the tags used across the
> > > M ublk_queues that the pthread services.
> > > 
> > > Userspace selects a free tag (e.g. from a bitmap with QD elements) and
> > > uses that as the NVMe Command Identifier. This is trivial because each
> > > pthread has its own bitmap and NVMe Command Identifiers are per-SQ.
> > 
> > I believe I have explained, in reality, NVME SQ/CQ pair can be less(
> > or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed
> > among CPUs of (nr_cpu_ids / nr_hw_queues).
> > 
> > Not mention userspace is capable of overriding the pthread cpu affinity,
> > so it isn't trivial & cheap, M pthreads could be run from
> > more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags.
> 
> I don't understand your nr_cpu_ids concerns. In the model I have
> described, the number of pthreads is min(nr_cpu_ids, max_sq_cq_pairs)
> and the SQ/CQ pairs are per pthread. There is no sharing of SQ/CQ pairs
> across pthreads.
> 
> On a limited NVMe controller nr_cpu_ids=128 and max_sq_cq_pairs=8, so
> there are only 8 pthreads. Each pthread has its own io_uring context
> through which it handles M ublk_queues. Even if a pthread runs from more
> than 1 CPU, its SQ Command Identifiers (tags) are only used by that
> pthread and there is no SMP contention.
> 
> Can you explain where you see SMP contention for NVMe SQ Command
> Identifiers?

ublk server queue pthread is aligned with hw queue in ublk driver, and its affinity is
retrieved from ublk blk-mq's hw queue's affinity.

So if nr_hw_queues is 8, nr_cpu_ids is 128, there will be 16 cpus mapped
to each hw queue. For example, hw queue 0's cpu affinity is cpu 0 ~ 15,
and affinity of pthread for handling hw queue 0 is cpu 0 ~ 15 too.

Now if we have M ublk devices, pthead 0(hw queue 0) of these M devices
share same hw queue tags. M pthreads could be scheduled among cpu0~15,
and tag is allocated from M pthreads among cpu0~15, contention?

That is why I mentioned, if all devices are implemented in same process, and
each pthread is handling host hardware queue for all M devices, the contention
can be avoided. However, ublk server still needs lots of change.

More importantly, it is one generic design, we need to cover both SQ and MQ.

> 
> > > 
> > > If there are no free tags then the request is placed in the pthread's
> > > per Namespace overflow list. Whenever an NVMe command completes, the
> > > overflow lists are scanned. One pending request is submitted to the NVMe
> > > PCI adapter in a round-robin fashion until the lists are empty or there
> > > are no more free tags.
> > > 
> > > That's it. No ublk API changes are necessary. The userspace code is not
> > > slow or complex (just a bitmap and overflow list).
> > 
> > Fine, but I am not sure we need to support such mess & pool implementation.
> > 
> > > 
> > > The approach also works for SCSI or devices that only support 1 request
> > > in flight at a time, with small tweaks.
> > > 
> > > Going back to the beginning of the discussion: I think it's possible to
> > > write a ublk server that handles multiple LUNs/NS today.
> > 
> > It is possible, but it is poor in both performance and resource
> > utilization, meantime with complicated ublk server implementation.
> 
> Okay. I wanted to make sure I wasn't missing a reason why it's
> fundamentally impossible. Performance, resource utilization, or
> complexity is debatable and I think I understand your position. I think
> you're looking for a general solution that works well even with a high
> number of LUNs, where the model I proposed wastes resources.

As I mentioned, it is one generic design for handling both SQ and MQ,
and we won't take some hybrid approach of sub-range and mq.

> 
> > 
> > > 
> > > > Another thing is that io command buffer has to be shared among all LUNs/
> > > > NSs. So interface change has to cover shared io command buffer.
> > > 
> > > I think the main advantage of extending the ublk API to share io command
> > > buffers between ublk_devices is to reduce userspace memory consumption?
> > > 
> > > It eliminates the need to over-provision I/O buffers for write requests
> > > (or use the slower UBLK_IO_NEED_GET_DATA approach).
> > 
> > Not only avoiding memory and cpu waste, but also simplifying ublk
> > server.
> > 
> > > 
> > > > With zero copy support, io buffer sharing needn't to be considered, that
> > > > can be a bit easier.
> > > > 
> > > > In short, the sharing of (tag, io command buffer, io buffer) needs to be
> > > > considered for shared host ublk disks.
> > > > 
> > > > Actually I prefer to 1), which matches with current design, and we can
> > > > just add host concept into ublk, and implementation could be easier.
> > > > 
> > > > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2],
> > > > and the performance improvement is pretty nice, so I think it is one reasonable
> > > > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns.
> > > > 
> > > > [1] https://github.com/ming1/ubdsrv/issues/49
> > > > [2] https://github.com/longhorn/longhorn
> > > 
> > > Nice performance improvement!
> > > 
> > > I agree with you that the ublk API should have a way to declare the
> > > resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I
> > > guess the simplest way to do that is by passing a reference to an
> > > existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing
> > > else about the ublk API needs to change, at least for tags.
> > 
> > Basically (tags, io command buffer, io buffers) need to move into
> > host/hw_queue wide from disk wide, so not so simple, but won't
> > be too complicated.
> > 
> > > 
> > > Solving I/O buffer over-provisioning sounds similar to io_uring's
> > > provided buffer mechanism :).
> > 
> > blk-mq has built-in host/hw_queue wide tag allocation, which can provide
> > unique tag for ublk server from ublk driver side, so everything can be
> > simplified a lot if we move (tag, io command buffer, io buffers) into
> > host/hw_queue wide by telling ublk_driver that we are
> > BLK_MQ_F_TAG_QUEUE_SHARED.
> > 
> > Not sure if io_uring's provided buffer is good here, cause we need to
> > discard io buffers after queue become idle. But it won't be one big
> > deal if zero copy can be supported.
> 
> If the per-request ublk resources are shared like tags as you described,
> then that's a nice solution that also solves I/O buffer
> over-provisioning.

BTW, io_uring provided buffer can't work here, since we use per-queue/pthead io_uring
in device level, but buffer actually belong to hardware queue of host.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
  2023-03-20 15:30                                           ` Ming Lei
@ 2023-03-21 11:25                                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2023-03-21 11:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris,
	Hans Holmberg, Matias Bjørling, hch, ZiyangZhang

[-- Attachment #1: Type: text/plain, Size: 18139 bytes --]

On Mon, Mar 20, 2023 at 11:30:00PM +0800, Ming Lei wrote:
> On Mon, Mar 20, 2023 at 08:34:17AM -0400, Stefan Hajnoczi wrote:
> > On Sat, Mar 18, 2023 at 08:30:29AM +0800, Ming Lei wrote:
> > > On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote:
> > > > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote:
> > > > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote:
> > > > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> > > > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> Hi Ming,
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > > > > > > > >> >> > > > > > Hello,
> > > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > > > > > > > >> >> > > > > > 
> > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > > > > > > > >> >> > > > > 
> > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > > > > > > > >> >> > > > 
> > > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > > > > > > > >> >> > > devices.
> > > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > > > > > > > >> >> > > userspace.
> > > > > > > > > > > > > > >> >> > > 
> > > > > > > > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > > > > > > > >> >> > 
> > > > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > > > > > > > >> >> 
> > > > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > > > > > > > >> >> it just sub-optimal?
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > > > > > > > >> 
> > > > > > > > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > > > > > > > handle them just fine.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > > > > > > > an independent tag set?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > > > > > > > all namespaces of a controller?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > > > > > > > 
> > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > > > > > > > 
> > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > > > > > > > That is because queue isn't free in both software and hardware, which
> > > > > > > > > > > implementation is often tradeoff between performance and cost.
> > > > > > > > > > 
> > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > > > > > > > thought the idea was the ublk server creates as many threads as it needs
> > > > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > > > > > > > 
> > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > > > > > > > 
> > > > > > > > > No.
> > > > > > > > > 
> > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > > > > > > > number of pthread is equal to nr_hw_queues.
> > > > > > > > 
> > > > > > > > Good, I think we agree on that part.
> > > > > > > > 
> > > > > > > > Here is a summary of the ublk server model I've been describing:
> > > > > > > > 1. Each pthread has a separate io_uring context.
> > > > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> > > > > > > >    command queue, etc).
> > > > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag
> > > > > > > >    space is shared across hardware submission queues.
> > > > > > > > 4. Each pthread allocates tags from its subrange without coordinating
> > > > > > > >    with other threads. This is cheap and simple.
> > > > > > > 
> > > > > > > That is also not doable.
> > > > > > > 
> > > > > > > The tag space can be pretty small, such as, usb-storage queue depth
> > > > > > > is just 1, and usb card reader can support multi lun too.
> > > > > > 
> > > > > > If the tag space is very limited, just create one pthread.
> > > > > 
> > > > > What I meant is that sub-range isn't doable.
> > > > > 
> > > > > And pthread is aligned with queue, that is nothing to do with nr_tags.
> > > > > 
> > > > > > 
> > > > > > > That is just one extreme example, but there can be more low queue depth
> > > > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> > > > > > > there could be some implementation with less.
> > > > > > 
> > > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has
> > > > > > its own independent tag space. That means NVMe devices with low queue
> > > > > > depths work fine in the model I described.
> > > > > 
> > > > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way
> > > > > or another, we should not support both sub-range or non-sub-range for
> > > > > avoiding unnecessary complexity.
> > > > > 
> > > > > "Each pthread has its own independent tag space" may mean two things
> > > > > 
> > > > > 1) each LUN/NS is implemented in standalone process space:
> > > > > - so every queue of each LUN has its own space, but all the queues with
> > > > > same ID share the whole queue tag space
> > > > > - that matches with current ublksrv
> > > > > - also easier to implement
> > > > > 
> > > > > 2) all LUNs/NSs are implemented in single process space
> > > > > - so each pthread handles one queue for all NSs/LUNs
> > > > > 
> > > > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk
> > > > > char device has to handle multiple LUNs/NSs(disks), which still need
> > > > > (big) ublk interface change. Also this way can't scale for single queue
> > > > > devices.
> > > > 
> > > > The model I described is neither 1) or 2). It's similar to 2) but I'm
> > > > not sure why you say the ublk interface needs to be changed. I'm afraid
> > > > I haven't explained it well, sorry. I'll try to describe it again with
> > > > an NVMe PCI adapter being handled by userspace.
> > > > 
> > > > There is a single ublk server process with an NVMe PCI device opened
> > > > using VFIO.
> > > > 
> > > > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe
> > > > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD.
> > > > 
> > > > The NVMe PCI device has M Namespaces. The ublk server creates M
> > > > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD.
> > > > 
> > > > The Linux block layer sees M block devices with N nr_hw_queues and QD
> > > > queue_depth. The actual NVMe PCI device resources are less than what the
> > > > Linux block layer sees because the each SQ/CQ pair is used for M
> > > > ublk_devices. In other words, Linux thinks there can be M * N * QD
> > > > requests in flight but in reality the NVMe PCI adapter only supports N *
> > > > QD requests.
> > > 
> > > Yeah, but it is really bad.
> > > 
> > > Now QD is the host hard queue depth, which can be very big, and could be
> > > more than thousands.
> > > 
> > > ublk driver doesn't understand this kind of sharing(tag, io command buffer, io
> > > buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory
> > > are wasted a lot.
> > > 
> > > Every device has to allocate command buffers for holding QD io commands, and
> > > command buffer is supposed to be per-host, instead of per-disk. Same with io
> > > buffer pre-allocation in userspace side.
> > 
> > I agree with you in cases with lots of LUNs (large M), block layer and
> > ublk driver per-request memory is allocated that cannot be used
> > simultaneously.
> > 
> > > Userspace has to re-tag the requests for avoiding duplicated tag, and
> > > requests have to be throttled in ublk server side. If you implement tag allocation
> > > in userspace side, it is still one typical shared data issue in SMP, M pthreads
> > > contends on single tags from multiple CPUs.
> > 
> > Here I still disagree. There is no SMP contention with NVMe because tags
> > are per SQ. For SCSI the tag namespace is shared but each pthread can
> > trivially work with a sub-range to avoid SMP contention. If the tag
> > namespace is too small for sub-ranges, then there should be fewer
> > pthreads.
> > 
> > > > 
> > > > Now I'll describe how userspace can take care of the mismatch between
> > > > the Linux block layer and the NVMe PCI device without doing much work:
> > > > 
> > > > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for
> > > > each of the M Namespaces.
> > > > 
> > > > When userspace receives a request from ublk, it cannot simply copy the
> > > > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier
> > > > (CID) field. There would be collisions between the tags used across the
> > > > M ublk_queues that the pthread services.
> > > > 
> > > > Userspace selects a free tag (e.g. from a bitmap with QD elements) and
> > > > uses that as the NVMe Command Identifier. This is trivial because each
> > > > pthread has its own bitmap and NVMe Command Identifiers are per-SQ.
> > > 
> > > I believe I have explained, in reality, NVME SQ/CQ pair can be less(
> > > or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed
> > > among CPUs of (nr_cpu_ids / nr_hw_queues).
> > > 
> > > Not mention userspace is capable of overriding the pthread cpu affinity,
> > > so it isn't trivial & cheap, M pthreads could be run from
> > > more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags.
> > 
> > I don't understand your nr_cpu_ids concerns. In the model I have
> > described, the number of pthreads is min(nr_cpu_ids, max_sq_cq_pairs)
> > and the SQ/CQ pairs are per pthread. There is no sharing of SQ/CQ pairs
> > across pthreads.
> > 
> > On a limited NVMe controller nr_cpu_ids=128 and max_sq_cq_pairs=8, so
> > there are only 8 pthreads. Each pthread has its own io_uring context
> > through which it handles M ublk_queues. Even if a pthread runs from more
> > than 1 CPU, its SQ Command Identifiers (tags) are only used by that
> > pthread and there is no SMP contention.
> > 
> > Can you explain where you see SMP contention for NVMe SQ Command
> > Identifiers?
> 
> ublk server queue pthread is aligned with hw queue in ublk driver, and its affinity is
> retrieved from ublk blk-mq's hw queue's affinity.
> 
> So if nr_hw_queues is 8, nr_cpu_ids is 128, there will be 16 cpus mapped
> to each hw queue. For example, hw queue 0's cpu affinity is cpu 0 ~ 15,
> and affinity of pthread for handling hw queue 0 is cpu 0 ~ 15 too.
> 
> Now if we have M ublk devices, pthead 0(hw queue 0) of these M devices
> share same hw queue tags. M pthreads could be scheduled among cpu0~15,
> and tag is allocated from M pthreads among cpu0~15, contention?
> 
> That is why I mentioned, if all devices are implemented in same process, and
> each pthread is handling host hardware queue for all M devices, the contention
> can be avoided. However, ublk server still needs lots of change.

I see. In the model I described each pthread services all M devices so
the contention is avoided.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-03-21 11:26 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei
2023-02-06 17:53 ` Hannes Reinecke
2023-03-08  8:50   ` Hans Holmberg
2023-03-08 12:27     ` Ming Lei
2023-02-06 18:26 ` Bart Van Assche
2023-02-08  1:38   ` Ming Lei
2023-02-08 18:02     ` Bart Van Assche
2023-02-06 20:27 ` Stefan Hajnoczi
2023-02-08  2:12   ` Ming Lei
2023-02-08 12:17     ` Stefan Hajnoczi
2023-02-13  3:47       ` Ming Lei
2023-02-13 19:13         ` Stefan Hajnoczi
2023-02-15  0:51           ` Ming Lei
2023-02-15 15:27             ` Stefan Hajnoczi
2023-02-16  0:46               ` Ming Lei
2023-02-16 15:28                 ` Stefan Hajnoczi
2023-02-16  9:44             ` Andreas Hindborg
2023-02-16 10:45               ` Ming Lei
2023-02-16 11:21                 ` Andreas Hindborg
2023-02-17  2:20                   ` Ming Lei
2023-02-17 16:39                     ` Stefan Hajnoczi
2023-02-18 11:22                       ` Ming Lei
2023-02-18 18:38                         ` Stefan Hajnoczi
2023-02-22 23:17                           ` Ming Lei
2023-02-23 20:18                             ` Stefan Hajnoczi
2023-03-02  3:22                               ` Ming Lei
2023-03-02 15:09                                 ` Stefan Hajnoczi
2023-03-17  3:10                                   ` Ming Lei
2023-03-17 14:41                                     ` Stefan Hajnoczi
2023-03-18  0:30                                       ` Ming Lei
2023-03-20 12:34                                         ` Stefan Hajnoczi
2023-03-20 15:30                                           ` Ming Lei
2023-03-21 11:25                                             ` Stefan Hajnoczi
2023-03-16 14:24                                 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.