All of lore.kernel.org
 help / color / mirror / Atom feed
* [SPDK] Questions about vhost memory registration
@ 2018-11-08  0:49 Nikos Dragazis
  0 siblings, 0 replies; 10+ messages in thread
From: Nikos Dragazis @ 2018-11-08  0:49 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2501 bytes --]

Hi all,

I would like to raise a couple of questions about vhost target.

My first question is:

During vhost-user negotiation, the master sends its memory regions to
the slave. Slave maps each region in its own address space. The mmap
addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
aligned. When vhost registers the memory regions in
spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB here:

https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534

The aligned addresses may not have a valid page table entry. So, in case
of uio, it is possible that during vtophys translation, the aligned
addresses are touched here:

https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287

and this could lead to a segfault. Is this a possible scenario?

My second question is:

The commit message here:

https://review.gerrithub.io/c/spdk/spdk/+/410071

says:

“We've had cases (especially with vhost) in the past where we have
a valid vaddr but the backing page was not assigned yet.”.

This refers to the vhost target, where shared memory is allocated by the
QEMU process and the SPDK process maps this memory.

Let’s consider this case. After mapping vhost-user memory regions, they
are registered to the vtophys map. In case vfio is disabled,
vtophys_get_paddr_pagemap() finds the corresponding physical addresses.
These addresses must refer to pinned memory because vfio is not there to
do the pinning. Therefore, VM’s memory has to be backed by hugepages.
Hugepages are allocated by the QEMU process, way before vhost memory
registration. After their allocation, hugepages will always have a
backing page because they never get swapped out. So, I do not see any
such case where backing page is not assigned yet and thus I do not see
any need to touch the mapped page.

This is my current understanding in brief and I'd welcome any feedback
you may have:

1. address alignment in spdk_vhost_dev_mem_register() is buggy because
   the aligned address may not have a valid page table entry thus
   triggering a segfault when being touched in
   vtophys_get_paddr_pagemap() -> rte_atomic64_read().
2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
   because VM’s memory has to be backed by hugepages and hugepages are
   not handled by demand paging strategy and they are never swapped out.

I am looking forward to your feedback.

Thanks,
Nikos


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-12-03  8:19 Wodkowski, PawelX
  0 siblings, 0 replies; 10+ messages in thread
From: Wodkowski, PawelX @ 2018-12-03  8:19 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3168 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Friday, November 30, 2018 7:01 PM
> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
> Subject: Re: [SPDK] Questions about vhost memory registration
> 
> On 29/11/18 11:22 π.μ., Wodkowski, PawelX wrote:
> 
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> Dragazis
> >> Sent: Thursday, November 29, 2018 12:24 AM
> >> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
> >> Subject: Re: [SPDK] Questions about vhost memory registration
> >>
> >> Let me add one more question:
> >>
> >> Why virtio-scsi and virtio-blk bdev moludes do not support
> >> VIRTIO_F_IOMMU_PLATFORM feature? Have you tested these two
> bdevs
> >> with the presence of a vIOMMU in QEMU?
> >>
> >> Here is the problematic scenario I have in mind:
> >>
> >> Let’s say we have a VM with a vIOMMU and a virtio-scsi HBA with a couple
> >> of SCSI disks which we want to use as storage backends for an SPDK
> >> target app. The SPDK virtio-scsi bdev driver does not support the
> >> VIRTIO_F_IOMMU_PLATFORM feature. This means that the device will
> >> always
> >> bypass the vIOMMU for the DMA operations. So, in this case, physical
> >> addresses must still be provided to the device by the SPDK virtio
> >> driver, even though an IOMMU appears to be present. The problem is
> that
> >> the virtio driver passes IOVAs instead of physical addresses. This is
> >> done here:
> >> https://github.com/spdk/spdk/blob/master/lib/virtio/virtio.c#L538
> >> (Actually, it passes the address kept in vtophys map table. The vtophys
> >> map keeps physical addresses in case vfio is disabled and IOVAs in case
> >> vfio is enabled.)
> >>
> > We don't support this in vhost because we are using modified copy
> > of DPDK rte_vhost implementation (see spdk/lib/vhost/rte_vhost). DPDK
> > version we forked  from was quite old and VIRTIO_F_IOMMU_PLATFORM
> > was not supported. We tried upstreaming our changes to the DPDK
> community
> > so we could switch to DPDK rte_vhost but got strong resistance from them.
> > Current vhost implementation in DPDK do support
> VIRTIO_F_IOMMU_PLATFORM
> > but we just have no free resources to backport it to SPDK rte_vhost copy.
> 
> OK. I see your point. So, you are saying that there is no need for the
> SPDK virtio-scsi bdev driver to support the VIRTIO_F_IOMMU_PLATFORM
> feature because we know it is not supported by the SPDK vhost-scsi
> target. How about adding this feature to the virtio-scsi bdev driver to
> enable supporting device backends other that vhost-user-scsi? Like, say
> QEMU user space scsi target. Why is this driver restricted to vhost-user
> scsi targets?
> 

If vhost target do support VIRTIO_F_IOMMU_PLATFORM then this
should be transparent for SPDK virtio-scsi bdev running inside guest system.

Pawel

> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-30 18:00 Nikos Dragazis
  0 siblings, 0 replies; 10+ messages in thread
From: Nikos Dragazis @ 2018-11-30 18:00 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

On 29/11/18 11:22 π.μ., Wodkowski, PawelX wrote:

>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
>> Sent: Thursday, November 29, 2018 12:24 AM
>> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
>> Subject: Re: [SPDK] Questions about vhost memory registration
>>
>> Let me add one more question:
>>
>> Why virtio-scsi and virtio-blk bdev moludes do not support
>> VIRTIO_F_IOMMU_PLATFORM feature? Have you tested these two bdevs
>> with the presence of a vIOMMU in QEMU?
>>
>> Here is the problematic scenario I have in mind:
>>
>> Let’s say we have a VM with a vIOMMU and a virtio-scsi HBA with a couple
>> of SCSI disks which we want to use as storage backends for an SPDK
>> target app. The SPDK virtio-scsi bdev driver does not support the
>> VIRTIO_F_IOMMU_PLATFORM feature. This means that the device will
>> always
>> bypass the vIOMMU for the DMA operations. So, in this case, physical
>> addresses must still be provided to the device by the SPDK virtio
>> driver, even though an IOMMU appears to be present. The problem is that
>> the virtio driver passes IOVAs instead of physical addresses. This is
>> done here:
>> https://github.com/spdk/spdk/blob/master/lib/virtio/virtio.c#L538
>> (Actually, it passes the address kept in vtophys map table. The vtophys
>> map keeps physical addresses in case vfio is disabled and IOVAs in case
>> vfio is enabled.)
>>
> We don't support this in vhost because we are using modified copy
> of DPDK rte_vhost implementation (see spdk/lib/vhost/rte_vhost). DPDK
> version we forked  from was quite old and VIRTIO_F_IOMMU_PLATFORM
> was not supported. We tried upstreaming our changes to the DPDK community
> so we could switch to DPDK rte_vhost but got strong resistance from them.
> Current vhost implementation in DPDK do support VIRTIO_F_IOMMU_PLATFORM
> but we just have no free resources to backport it to SPDK rte_vhost copy.

OK. I see your point. So, you are saying that there is no need for the
SPDK virtio-scsi bdev driver to support the VIRTIO_F_IOMMU_PLATFORM
feature because we know it is not supported by the SPDK vhost-scsi
target. How about adding this feature to the virtio-scsi bdev driver to
enable supporting device backends other that vhost-user-scsi? Like, say
QEMU user space scsi target. Why is this driver restricted to vhost-user
scsi targets?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-29  9:22 Wodkowski, PawelX
  0 siblings, 0 replies; 10+ messages in thread
From: Wodkowski, PawelX @ 2018-11-29  9:22 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 19135 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Thursday, November 29, 2018 12:24 AM
> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
> Subject: Re: [SPDK] Questions about vhost memory registration
> 
> On 23/11/18 10:27 π.μ., Wodkowski, PawelX wrote:
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> Dragazis
> >> Sent: Thursday, November 22, 2018 7:52 PM
> >> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
> >> Subject: Re: [SPDK] Questions about vhost memory registration
> >>
> >>
> >> On 12/11/18 1:48 μ.μ., Wodkowski, PawelX wrote:
> >>>> -----Original Message-----
> >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> >> Dragazis
> >>>> Sent: Saturday, November 10, 2018 3:37 AM
> >>>> To: spdk(a)lists.01.org
> >>>> Subject: Re: [SPDK] Questions about vhost memory registration
> >>>>
> >>>> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> >>>> Dragazis
> >>>>>> Sent: Thursday, November 8, 2018 1:49 AM
> >>>>>> To: spdk(a)lists.01.org
> >>>>>> Subject: [SPDK] Questions about vhost memory registration
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I would like to raise a couple of questions about vhost target.
> >>>>>>
> >>>>>> My first question is:
> >>>>>>
> >>>>>> During vhost-user negotiation, the master sends its memory regions
> to
> >>>>>> the slave. Slave maps each region in its own address space. The
> mmap
> >>>>>> addresses are page aligned (that is 4KB aligned) but not necessarily
> >> 2MB
> >>>>>> aligned. When vhost registers the memory regions in
> >>>>>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to
> >> 2MB
> >>>>>> here:
> >>>>> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require
> >> that
> >>>> initiator
> >>>>> pass memory backed by huge pages >= 2MB in size. On x86 MMU this
> >> imply
> >>>>> that page alignment is the same as page size which is >= 2MB (99%
> sure -
> >>>>> can someone confirm this to get this +1% ;) ).
> >>>> Yes, you are probably right. I didn’t know how the kernel achieves
> >>>> having a single page table entry for a contiguous 2MB virtual address
> >>>> range. If I get this right, in case of x86_64, the answer is using a
> >>>> page middle directory (PMD) entry pointing directly to a 2MB physical
> >>>> page rather than to a lower-level page table. And since the PMDs are
> 2MB
> >>>> aligned by definition, the resulting virtual address will be 2MB
> >>>> aligned.
> >>>>>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
> >>>>>>
> >>>>>> The aligned addresses may not have a valid page table entry. So, in
> case
> >>>>>> of uio, it is possible that during vtophys translation, the aligned
> >>>>>> addresses are touched here:
> >>>>>>
> >>>>>>
> >>
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> >>>>>> and this could lead to a segfault. Is this a possible scenario?
> >>>>>>
> >>>>>> My second question is:
> >>>>>>
> >>>>>> The commit message here:
> >>>>>>
> >>>>>> https://review.gerrithub.io/c/spdk/spdk/+/410071
> >>>>>>
> >>>>>> says:
> >>>>>>
> >>>>>> “We've had cases (especially with vhost) in the past where we have
> >>>>>> a valid vaddr but the backing page was not assigned yet.”.
> >>>>>>
> >>>>>> This refers to the vhost target, where shared memory is allocated by
> >> the
> >>>>>> QEMU process and the SPDK process maps this memory.
> >>>>>>
> >>>>>> Let’s consider this case. After mapping vhost-user memory regions,
> >> they
> >>>>>> are registered to the vtophys map. In case vfio is disabled,
> >>>>>> vtophys_get_paddr_pagemap() finds the corresponding physical
> >>>> addresses.
> >>>>>> These addresses must refer to pinned memory because vfio is not
> >> there
> >>>> to
> >>>>>> do the pinning. Therefore, VM’s memory has to be backed by
> >> hugepages.
> >>>>>> Hugepages are allocated by the QEMU process, way before vhost
> >>>> memory
> >>>>>> registration. After their allocation, hugepages will always have a
> >>>>>> backing page because they never get swapped out. So, I do not see
> any
> >>>>>> such case where backing page is not assigned yet and thus I do not
> see
> >>>>>> any need to touch the mapped page.
> >>>>>>
> >>>>>> This is my current understanding in brief and I'd welcome any
> feedback
> >>>>>> you may have:
> >>>>>>
> >>>>>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
> >>>> because
> >>>>>>    the aligned address may not have a valid page table entry thus
> >>>>>>    triggering a segfault when being touched in
> >>>>>>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
> >>>>>> 2. touching the page in vtophys_get_paddr_pagemap() is
> unnecessary
> >>>>>>    because VM’s memory has to be backed by hugepages and
> >> hugepages
> >>>> are
> >>>>>>    not handled by demand paging strategy and they are never
> swapped
> >>>> out.
> >>>>>> I am looking forward to your feedback.
> >>>>>>
> >>>>> Current start/end calculation in spdk_vhost_dev_mem_register()
> might
> >> be
> >>>> a actually
> >>>>> NOP for memory backed by hugepages.
> >>>> It seems so. However, there are other platforms that support
> hugepage
> >>>> sizes less than 2MB. I do not know if SPDK supports such platforms.
> >>> I think that currently only >=2MB HP are supported.
> >>>
> >>>>> I think that we can try to validate alignmet of the memory in
> >>>> spdk_vhost_dev_mem_register()
> >>>>> and fail if it is not 2MB aligned.
> >>>> This sounds reasonable to me. However, I believe it would be better if
> >>>> we could support registering non-2MB aligned virtual addresses. Is this
> >>>> a WIP? I have found this commit:
> >>>>
> >>>> https://review.gerrithub.io/c/spdk/spdk/+/427816/1
> >>>>
> >>>> It is not clear to me why the community has chosen 2MB granularity for
> >>>> the SPDK map tables.
> >>> SPKD vhost was created some time after iSCSI and NVMf targets and it
> >>> needs to obey existing limitations. To be honest, vhost don't really need
> >>> to use huge pages, as this is the limitation of:
> >>>
> >>> 1. DMA - memory passed to DMA need to be:
> >>>    - pinned memory - can't be swapped, physical address can't change
> >>>    - contiguous (VFIO complicate this case)
> >>>   - virtual address must have assigned huge page so SPKD can discover
> >>>      its physical address
> >>>
> >>> 2. env_dpdk/memory
> >>> this was implemented for NVMe drivers that have limitations that single
> >>> transaction can't span 2MB address boundary - PRP have this limitation
> >>> I don't know if SGLs overcome this. This also required from us to
> implement
> >>> this in vhost:
> >>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462
> >>>
> >>> This is why 2MB granularity was chosen.
> >> So, you are saying that vhost doesn’t really need to use huge pages. Are
> > It is possible for SPDK vhost backend to be modified in a way that it won't
> > require hugepages. But again when passing payload descriptors down to
> > physical devices the memory must be "good" for them. So if you use
> bdev_malloc
> > (without IOAT acceleration!) or bdev_aio as backing device the hugepages
> > backed memory requirement disappear as host kernel will handle all page
> > faults for you. This is not true for other bdevs that use DMA like nvme.
> 
> Agreed.
> 
> >> you referring to SPDK’s memory? This would make sense. And I think, this
> >> is also true for the nvme and virtio-scsi bdev modules, which I am
> >> currently using. In these cases, the storage backend performs zero-copy
> >> DMA directly from VM’s huge page backed memory. Is this correct?
> > For virtio-scsi bdev it is (might be) correct but not for nvme (bdev_nvme?).
> 
> Basically, I was referring to a local NVMe drive and I had the SPDK NVMe
> PCIe driver in mind. I guess you say “no” for NVMe because the NVMe bdev
> module handles both locally attached and remote NVMe drives. So, in case
> of a locally attached NVMe drive, is the DMA operation zero-copy?
> 

Yes, in most cases it is zero-copy (there might be used copy-mode if guest
gives misaligned memory but this is not the case we are talking about here).

> >> As far as VM’s memory is concerned, is it true that huge page backed
> >> memory is just a limitation of uio? Is it necessary to use huge page
> >> backed memory for the VM in case of vfio?
> >>
> > This is the question that VFIO kernel module developers could have
> answare
> > for. But I bet $5 that it is NOT true. Let me write this again: memory
> > for DMA need to be:
> >
> > 1. Pinned
> > 2. vtophys(addr) translation need to possible during memory registration
> > 3. vtophys(addr) must always return the same result for the same 'add'
> >
> > Kernel can do all above for any pages at any time but in userspace, only
> > hugepages guarantee all these so we are using them.
> 
> I think this is not true. I think that the vfio kernel module can do the
> job. In case of x86 architecture with an IOMMU, the vfio kernel module
> exposes an ioctl type called “VFIO_IOMMU_MAP_DMA”. This is used by
> SPDK
> to register the user space memory that will be used for DMA. The vfio
> serves this ioctl by basically doing two things:
> 
> - pin the registered user space memory. This means that this memory will
>   never get swapped out or moved to another physical address. This is
>   done here:
>   https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_typ
> e1.c#L1046
> 
> - program the IOMMU. The kernel IOMMU driver will insert the appropriate
>   entries in the device IOVA domain in a way that the device will be
>   seeing this memory as contiguous. This means that the registered
>   memory, although it might be physically scattered, it will be mapped
>   to a contiguous IOVA segment. This is done here:
>   https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_typ
> e1.c#L1055
> 
> So, I believe that the vfio kernel module serves the DMA memory
> limitations you ‘ve already mentioned, but I will post a relevant
> question in the vfio-users mailing list to get more feedback on this.
> 

Question to vfio-user mailing list is good idea but if this is true you are good
to go here.

> > There is interesting article here https://lwn.net/Articles/600502/ about
> DMA
> > and memory. Maybe it an describe it better than me :)
> 
> Let me add one more question:
> 
> Why virtio-scsi and virtio-blk bdev moludes do not support
> VIRTIO_F_IOMMU_PLATFORM feature? Have you tested these two bdevs
> with
> the presence of a vIOMMU in QEMU?
> 
> Here is the problematic scenario I have in mind:
> 
> Let’s say we have a VM with a vIOMMU and a virtio-scsi HBA with a couple
> of SCSI disks which we want to use as storage backends for an SPDK
> target app. The SPDK virtio-scsi bdev driver does not support the
> VIRTIO_F_IOMMU_PLATFORM feature. This means that the device will
> always
> bypass the vIOMMU for the DMA operations. So, in this case, physical
> addresses must still be provided to the device by the SPDK virtio
> driver, even though an IOMMU appears to be present. The problem is that
> the virtio driver passes IOVAs instead of physical addresses. This is
> done here:
> https://github.com/spdk/spdk/blob/master/lib/virtio/virtio.c#L538
> (Actually, it passes the address kept in vtophys map table. The vtophys
> map keeps physical addresses in case vfio is disabled and IOVAs in case
> vfio is enabled.)
> 

We don't support this in vhost because we are using modified copy
of DPDK rte_vhost implementation (see spdk/lib/vhost/rte_vhost). DPDK
version we forked  from was quite old and VIRTIO_F_IOMMU_PLATFORM
was not supported. We tried upstreaming our changes to the DPDK community
so we could switch to DPDK rte_vhost but got strong resistance from them.
Current vhost implementation in DPDK do support VIRTIO_F_IOMMU_PLATFORM
but we just have no free resources to backport it to SPDK rte_vhost copy.

> >
> >>>>> Have you hit any segfault there?
> >>>> Yes. I will give you a brief description.
> >>>>
> >>>> As I have already announced here:
> >>>>
> >>>> https://lists.01.org/pipermail/spdk/2018-October/002528.html
> >>>>
> >>>> I am currently working on an alternative vhost-user transport. I am
> >>>> shipping the SPDK vhost target into a dedicated storage appliance VM.
> >>>> Inspired by this post:
> >>>>
> >>>> https://wiki.qemu.org/Features/VirtioVhostUser
> >>>>
> >>>> I am using a dedicated virtio device called “virtio-vhost-user” to
> >>>> extend the vhost-user control plane. This device intercepts the
> >>>> vhost-user protocol messages from the unix domain socket on the host
> >> and
> >>>> inserts them into a virtqueue. In case a SET_MEM_TABLE message
> arrives
> >>>> from the unix socket, it maps the memory regions set by the master
> and
> >>>> exposes them to the slave guest as an MMIO PCI memory region.
> >>>>
> >>>> So, instead of mapping hugepage backed memory regions, the vhost
> >> target,
> >>>> running in slave guest user space, maps segments of an MMIO BAR of
> the
> >>>> virtio-vhost-user device.
> >>>>
> >>>> Thus, in my case, the mapped addresses are not necessarily 2MB
> aligned.
> >>>> The segfault is happening in a specific test case. That is when I do
> >>>> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
> >>>> “construct_vhost_scsi_controller”.
> >>>> In my code, this implies calling “spdk_pci_device_attach” ->
> >>>> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
> >>>> which in turn implies “rte_pci_map_device” ->
> “rte_pci_unmap_device” -
> >>>> “rte_pci_map_device”.
> >>>>
> >>>> During the first map, the MMIO BAR is always mapped to a 2MB aligned
> >>>> address (btw I can’t explain this, it can’t be a coincidence).
> >>>> However, this is not the case after the second map. The result is that I
> >>>> get a segfault when I register this non-2MB aligned address.
> >>>>
> >>>> So, I am seeking for a solution. I think the best would be to support
> >>>> registering non-2MB aligned addresses. This would be useful in
> general,
> >>>> when you want to register an MMIO BAR, which is necessary in cases of
> >>>> peer-to-peer DMA. I know that there is a use case for peer-to-peer
> DMA
> >>>> between NVMe SSDs in SPDK. I wonder how you manage the 2MB
> >> alignment
> >>>> restriction in that case.
> >>> Anything that you don't pass to DMA don't need to be 2MB aligned. If
> you
> >>> read/write this using CPU it don't need to be HP backed either.
> >>>
> >>> For DMA I think you will have to obey memory limitation I wrote above.
> >>>
> >>> Adding Darek, he can have some more (up to date) knowledge.
> >> OK, let me get this a little bit more clear. The dataplane is unchanged.
> >> The vhost target passes all the received descriptor addresses to the
> >> underlying storage backend for DMA (after address translation and iovec
> >> splitting). What I did was just to change the way the vhost target
> >> accesses the VM’s memory.
> >>
> >> The previous case was that the vhost target was running on the host and
> >> it mapped the master vhost memory regions sent over the unix socket.
> >> These memory regions relied on huge pages on the host physical
> memory.
> >>
> >> The current case is that the vhost target is running inside a VM and
> >> needs to have access to the other VM’s memory lying on host hugetlbfs.
> >> Therefore, I use a special device called virtio-vhost-user, which maps
> >> the master vhost memory regions and exposes them to guest user space
> as
> >> an MMIO BAR. That’s how the vhost target has access to host hugetlbfs
> >> from guest user space.
> >>
> >> So, the current case is that the storage backend (say an emulated NVMe
> >> controller) performs peer-to-peer DMA from this MMIO BAR. This
> requires
> >> that the vhost target has registered this BAR to the vtophys map. And
> >> here is the problem because spdk_mem_register() requires the address
> to
> >> be 2MB aligned but the MMIO BAR is not necessarily mapped to a 2MB
> >> aligned virtual address.
> >>
> >> Currently, I am using a temporary solution. I am mapping all PCI BARs
> >> from all PCI devices to 2MB aligned virtual addresses. I think this is
> >> not going to trigger any implications, is it? The other solution, is to
> > Should be fine for 2MB huge pages. The mmap() might fail for hugepages
> >2MB.
> >
> >> modify the env_dpdk library in order to allow registering non-2MB
> >> aligned addresses. Darek, in case you are reading this, I would
> >> appreciate any feedback at this point. I think you are working on this.
> >>
> >>>> Last but not least, in case you may know, I would appreciate if you
> >>>> could give me a situation where page touching in
> >>>> vtophys_get_paddr_pagemap() here:
> >>>>
> >>>>
> >>
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> >>>> is necessary. Is this related to vhost exclusively? In case of vhost,
> >>>> the memory regions are backed by hugepages and these are not
> allocated
> >>>> on demand by the kernel. What am I missing?
> >>> When you mmap() huge page you are getting virtual address but actual
> >>> physical hugepage might not be assigned yet. We are touching each
> page
> >>> to force kernel to assign the huge page to virtual addrsss so we can
> >> discover
> >>> vtophys mmaping.
> >>>
> >>>>>> Thanks,
> >>>>>> Nikos
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> SPDK mailing list
> >>>>>> SPDK(a)lists.01.org
> >>>>>> https://lists.01.org/mailman/listinfo/spdk
> >>>>> _______________________________________________
> >>>>> SPDK mailing list
> >>>>> SPDK(a)lists.01.org
> >>>>> https://lists.01.org/mailman/listinfo/spdk
> >>>> _______________________________________________
> >>>> SPDK mailing list
> >>>> SPDK(a)lists.01.org
> >>>> https://lists.01.org/mailman/listinfo/spdk
> >>> _______________________________________________
> >>> SPDK mailing list
> >>> SPDK(a)lists.01.org
> >>> https://lists.01.org/mailman/listinfo/spdk
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-28 23:24 Nikos Dragazis
  0 siblings, 0 replies; 10+ messages in thread
From: Nikos Dragazis @ 2018-11-28 23:24 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 17084 bytes --]

On 23/11/18 10:27 π.μ., Wodkowski, PawelX wrote:
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
>> Sent: Thursday, November 22, 2018 7:52 PM
>> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
>> Subject: Re: [SPDK] Questions about vhost memory registration
>>
>>
>> On 12/11/18 1:48 μ.μ., Wodkowski, PawelX wrote:
>>>> -----Original Message-----
>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
>> Dragazis
>>>> Sent: Saturday, November 10, 2018 3:37 AM
>>>> To: spdk(a)lists.01.org
>>>> Subject: Re: [SPDK] Questions about vhost memory registration
>>>>
>>>> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
>>>>>> -----Original Message-----
>>>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
>>>> Dragazis
>>>>>> Sent: Thursday, November 8, 2018 1:49 AM
>>>>>> To: spdk(a)lists.01.org
>>>>>> Subject: [SPDK] Questions about vhost memory registration
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I would like to raise a couple of questions about vhost target.
>>>>>>
>>>>>> My first question is:
>>>>>>
>>>>>> During vhost-user negotiation, the master sends its memory regions to
>>>>>> the slave. Slave maps each region in its own address space. The mmap
>>>>>> addresses are page aligned (that is 4KB aligned) but not necessarily
>> 2MB
>>>>>> aligned. When vhost registers the memory regions in
>>>>>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to
>> 2MB
>>>>>> here:
>>>>> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require
>> that
>>>> initiator
>>>>> pass memory backed by huge pages >= 2MB in size. On x86 MMU this
>> imply
>>>>> that page alignment is the same as page size which is >= 2MB (99% sure -
>>>>> can someone confirm this to get this +1% ;) ).
>>>> Yes, you are probably right. I didn’t know how the kernel achieves
>>>> having a single page table entry for a contiguous 2MB virtual address
>>>> range. If I get this right, in case of x86_64, the answer is using a
>>>> page middle directory (PMD) entry pointing directly to a 2MB physical
>>>> page rather than to a lower-level page table. And since the PMDs are 2MB
>>>> aligned by definition, the resulting virtual address will be 2MB
>>>> aligned.
>>>>>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
>>>>>>
>>>>>> The aligned addresses may not have a valid page table entry. So, in case
>>>>>> of uio, it is possible that during vtophys translation, the aligned
>>>>>> addresses are touched here:
>>>>>>
>>>>>>
>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>>>>> and this could lead to a segfault. Is this a possible scenario?
>>>>>>
>>>>>> My second question is:
>>>>>>
>>>>>> The commit message here:
>>>>>>
>>>>>> https://review.gerrithub.io/c/spdk/spdk/+/410071
>>>>>>
>>>>>> says:
>>>>>>
>>>>>> “We've had cases (especially with vhost) in the past where we have
>>>>>> a valid vaddr but the backing page was not assigned yet.”.
>>>>>>
>>>>>> This refers to the vhost target, where shared memory is allocated by
>> the
>>>>>> QEMU process and the SPDK process maps this memory.
>>>>>>
>>>>>> Let’s consider this case. After mapping vhost-user memory regions,
>> they
>>>>>> are registered to the vtophys map. In case vfio is disabled,
>>>>>> vtophys_get_paddr_pagemap() finds the corresponding physical
>>>> addresses.
>>>>>> These addresses must refer to pinned memory because vfio is not
>> there
>>>> to
>>>>>> do the pinning. Therefore, VM’s memory has to be backed by
>> hugepages.
>>>>>> Hugepages are allocated by the QEMU process, way before vhost
>>>> memory
>>>>>> registration. After their allocation, hugepages will always have a
>>>>>> backing page because they never get swapped out. So, I do not see any
>>>>>> such case where backing page is not assigned yet and thus I do not see
>>>>>> any need to touch the mapped page.
>>>>>>
>>>>>> This is my current understanding in brief and I'd welcome any feedback
>>>>>> you may have:
>>>>>>
>>>>>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
>>>> because
>>>>>>    the aligned address may not have a valid page table entry thus
>>>>>>    triggering a segfault when being touched in
>>>>>>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
>>>>>> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
>>>>>>    because VM’s memory has to be backed by hugepages and
>> hugepages
>>>> are
>>>>>>    not handled by demand paging strategy and they are never swapped
>>>> out.
>>>>>> I am looking forward to your feedback.
>>>>>>
>>>>> Current start/end calculation in spdk_vhost_dev_mem_register() might
>> be
>>>> a actually
>>>>> NOP for memory backed by hugepages.
>>>> It seems so. However, there are other platforms that support hugepage
>>>> sizes less than 2MB. I do not know if SPDK supports such platforms.
>>> I think that currently only >=2MB HP are supported.
>>>
>>>>> I think that we can try to validate alignmet of the memory in
>>>> spdk_vhost_dev_mem_register()
>>>>> and fail if it is not 2MB aligned.
>>>> This sounds reasonable to me. However, I believe it would be better if
>>>> we could support registering non-2MB aligned virtual addresses. Is this
>>>> a WIP? I have found this commit:
>>>>
>>>> https://review.gerrithub.io/c/spdk/spdk/+/427816/1
>>>>
>>>> It is not clear to me why the community has chosen 2MB granularity for
>>>> the SPDK map tables.
>>> SPKD vhost was created some time after iSCSI and NVMf targets and it
>>> needs to obey existing limitations. To be honest, vhost don't really need
>>> to use huge pages, as this is the limitation of:
>>>
>>> 1. DMA - memory passed to DMA need to be:
>>>    - pinned memory - can't be swapped, physical address can't change
>>>    - contiguous (VFIO complicate this case)
>>>   - virtual address must have assigned huge page so SPKD can discover
>>>      its physical address
>>>
>>> 2. env_dpdk/memory
>>> this was implemented for NVMe drivers that have limitations that single
>>> transaction can't span 2MB address boundary - PRP have this limitation
>>> I don't know if SGLs overcome this. This also required from us to implement
>>> this in vhost:
>>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462
>>>
>>> This is why 2MB granularity was chosen.
>> So, you are saying that vhost doesn’t really need to use huge pages. Are
> It is possible for SPDK vhost backend to be modified in a way that it won't
> require hugepages. But again when passing payload descriptors down to
> physical devices the memory must be "good" for them. So if you use bdev_malloc
> (without IOAT acceleration!) or bdev_aio as backing device the hugepages
> backed memory requirement disappear as host kernel will handle all page
> faults for you. This is not true for other bdevs that use DMA like nvme.

Agreed.

>> you referring to SPDK’s memory? This would make sense. And I think, this
>> is also true for the nvme and virtio-scsi bdev modules, which I am
>> currently using. In these cases, the storage backend performs zero-copy
>> DMA directly from VM’s huge page backed memory. Is this correct?
> For virtio-scsi bdev it is (might be) correct but not for nvme (bdev_nvme?).

Basically, I was referring to a local NVMe drive and I had the SPDK NVMe
PCIe driver in mind. I guess you say “no” for NVMe because the NVMe bdev
module handles both locally attached and remote NVMe drives. So, in case
of a locally attached NVMe drive, is the DMA operation zero-copy?

>> As far as VM’s memory is concerned, is it true that huge page backed
>> memory is just a limitation of uio? Is it necessary to use huge page
>> backed memory for the VM in case of vfio?
>>
> This is the question that VFIO kernel module developers could have answare
> for. But I bet $5 that it is NOT true. Let me write this again: memory
> for DMA need to be:
>
> 1. Pinned
> 2. vtophys(addr) translation need to possible during memory registration
> 3. vtophys(addr) must always return the same result for the same 'add'
>
> Kernel can do all above for any pages at any time but in userspace, only
> hugepages guarantee all these so we are using them.

I think this is not true. I think that the vfio kernel module can do the
job. In case of x86 architecture with an IOMMU, the vfio kernel module
exposes an ioctl type called “VFIO_IOMMU_MAP_DMA”. This is used by SPDK
to register the user space memory that will be used for DMA. The vfio
serves this ioctl by basically doing two things:

- pin the registered user space memory. This means that this memory will
  never get swapped out or moved to another physical address. This is
  done here:
  https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.c#L1046

- program the IOMMU. The kernel IOMMU driver will insert the appropriate
  entries in the device IOVA domain in a way that the device will be
  seeing this memory as contiguous. This means that the registered
  memory, although it might be physically scattered, it will be mapped
  to a contiguous IOVA segment. This is done here:
  https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.c#L1055

So, I believe that the vfio kernel module serves the DMA memory
limitations you ‘ve already mentioned, but I will post a relevant
question in the vfio-users mailing list to get more feedback on this.

> There is interesting article here https://lwn.net/Articles/600502/ about DMA
> and memory. Maybe it an describe it better than me :)

Let me add one more question:

Why virtio-scsi and virtio-blk bdev moludes do not support
VIRTIO_F_IOMMU_PLATFORM feature? Have you tested these two bdevs with
the presence of a vIOMMU in QEMU?

Here is the problematic scenario I have in mind:

Let’s say we have a VM with a vIOMMU and a virtio-scsi HBA with a couple
of SCSI disks which we want to use as storage backends for an SPDK
target app. The SPDK virtio-scsi bdev driver does not support the
VIRTIO_F_IOMMU_PLATFORM feature. This means that the device will always
bypass the vIOMMU for the DMA operations. So, in this case, physical
addresses must still be provided to the device by the SPDK virtio
driver, even though an IOMMU appears to be present. The problem is that
the virtio driver passes IOVAs instead of physical addresses. This is
done here:
https://github.com/spdk/spdk/blob/master/lib/virtio/virtio.c#L538
(Actually, it passes the address kept in vtophys map table. The vtophys
map keeps physical addresses in case vfio is disabled and IOVAs in case
vfio is enabled.)

>
>>>>> Have you hit any segfault there?
>>>> Yes. I will give you a brief description.
>>>>
>>>> As I have already announced here:
>>>>
>>>> https://lists.01.org/pipermail/spdk/2018-October/002528.html
>>>>
>>>> I am currently working on an alternative vhost-user transport. I am
>>>> shipping the SPDK vhost target into a dedicated storage appliance VM.
>>>> Inspired by this post:
>>>>
>>>> https://wiki.qemu.org/Features/VirtioVhostUser
>>>>
>>>> I am using a dedicated virtio device called “virtio-vhost-user” to
>>>> extend the vhost-user control plane. This device intercepts the
>>>> vhost-user protocol messages from the unix domain socket on the host
>> and
>>>> inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
>>>> from the unix socket, it maps the memory regions set by the master and
>>>> exposes them to the slave guest as an MMIO PCI memory region.
>>>>
>>>> So, instead of mapping hugepage backed memory regions, the vhost
>> target,
>>>> running in slave guest user space, maps segments of an MMIO BAR of the
>>>> virtio-vhost-user device.
>>>>
>>>> Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
>>>> The segfault is happening in a specific test case. That is when I do
>>>> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
>>>> “construct_vhost_scsi_controller”.
>>>> In my code, this implies calling “spdk_pci_device_attach” ->
>>>> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
>>>> which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” -
>>>> “rte_pci_map_device”.
>>>>
>>>> During the first map, the MMIO BAR is always mapped to a 2MB aligned
>>>> address (btw I can’t explain this, it can’t be a coincidence).
>>>> However, this is not the case after the second map. The result is that I
>>>> get a segfault when I register this non-2MB aligned address.
>>>>
>>>> So, I am seeking for a solution. I think the best would be to support
>>>> registering non-2MB aligned addresses. This would be useful in general,
>>>> when you want to register an MMIO BAR, which is necessary in cases of
>>>> peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
>>>> between NVMe SSDs in SPDK. I wonder how you manage the 2MB
>> alignment
>>>> restriction in that case.
>>> Anything that you don't pass to DMA don't need to be 2MB aligned. If you
>>> read/write this using CPU it don't need to be HP backed either.
>>>
>>> For DMA I think you will have to obey memory limitation I wrote above.
>>>
>>> Adding Darek, he can have some more (up to date) knowledge.
>> OK, let me get this a little bit more clear. The dataplane is unchanged.
>> The vhost target passes all the received descriptor addresses to the
>> underlying storage backend for DMA (after address translation and iovec
>> splitting). What I did was just to change the way the vhost target
>> accesses the VM’s memory.
>>
>> The previous case was that the vhost target was running on the host and
>> it mapped the master vhost memory regions sent over the unix socket.
>> These memory regions relied on huge pages on the host physical memory.
>>
>> The current case is that the vhost target is running inside a VM and
>> needs to have access to the other VM’s memory lying on host hugetlbfs.
>> Therefore, I use a special device called virtio-vhost-user, which maps
>> the master vhost memory regions and exposes them to guest user space as
>> an MMIO BAR. That’s how the vhost target has access to host hugetlbfs
>> from guest user space.
>>
>> So, the current case is that the storage backend (say an emulated NVMe
>> controller) performs peer-to-peer DMA from this MMIO BAR. This requires
>> that the vhost target has registered this BAR to the vtophys map. And
>> here is the problem because spdk_mem_register() requires the address to
>> be 2MB aligned but the MMIO BAR is not necessarily mapped to a 2MB
>> aligned virtual address.
>>
>> Currently, I am using a temporary solution. I am mapping all PCI BARs
>> from all PCI devices to 2MB aligned virtual addresses. I think this is
>> not going to trigger any implications, is it? The other solution, is to
> Should be fine for 2MB huge pages. The mmap() might fail for hugepages >2MB.
>
>> modify the env_dpdk library in order to allow registering non-2MB
>> aligned addresses. Darek, in case you are reading this, I would
>> appreciate any feedback at this point. I think you are working on this.
>>
>>>> Last but not least, in case you may know, I would appreciate if you
>>>> could give me a situation where page touching in
>>>> vtophys_get_paddr_pagemap() here:
>>>>
>>>>
>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>>> is necessary. Is this related to vhost exclusively? In case of vhost,
>>>> the memory regions are backed by hugepages and these are not allocated
>>>> on demand by the kernel. What am I missing?
>>> When you mmap() huge page you are getting virtual address but actual
>>> physical hugepage might not be assigned yet. We are touching each page
>>> to force kernel to assign the huge page to virtual addrsss so we can
>> discover
>>> vtophys mmaping.
>>>
>>>>>> Thanks,
>>>>>> Nikos
>>>>>>
>>>>>> _______________________________________________
>>>>>> SPDK mailing list
>>>>>> SPDK(a)lists.01.org
>>>>>> https://lists.01.org/mailman/listinfo/spdk
>>>>> _______________________________________________
>>>>> SPDK mailing list
>>>>> SPDK(a)lists.01.org
>>>>> https://lists.01.org/mailman/listinfo/spdk
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/spdk
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-23  8:27 Wodkowski, PawelX
  0 siblings, 0 replies; 10+ messages in thread
From: Wodkowski, PawelX @ 2018-11-23  8:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14236 bytes --]

> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Thursday, November 22, 2018 7:52 PM
> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
> Subject: Re: [SPDK] Questions about vhost memory registration
> 
> 
> On 12/11/18 1:48 μ.μ., Wodkowski, PawelX wrote:
> >
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> Dragazis
> >> Sent: Saturday, November 10, 2018 3:37 AM
> >> To: spdk(a)lists.01.org
> >> Subject: Re: [SPDK] Questions about vhost memory registration
> >>
> >> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
> >>>> -----Original Message-----
> >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> >> Dragazis
> >>>> Sent: Thursday, November 8, 2018 1:49 AM
> >>>> To: spdk(a)lists.01.org
> >>>> Subject: [SPDK] Questions about vhost memory registration
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I would like to raise a couple of questions about vhost target.
> >>>>
> >>>> My first question is:
> >>>>
> >>>> During vhost-user negotiation, the master sends its memory regions to
> >>>> the slave. Slave maps each region in its own address space. The mmap
> >>>> addresses are page aligned (that is 4KB aligned) but not necessarily
> 2MB
> >>>> aligned. When vhost registers the memory regions in
> >>>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to
> 2MB
> >>>> here:
> >>> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require
> that
> >> initiator
> >>> pass memory backed by huge pages >= 2MB in size. On x86 MMU this
> imply
> >>> that page alignment is the same as page size which is >= 2MB (99% sure -
> >>> can someone confirm this to get this +1% ;) ).
> >> Yes, you are probably right. I didn’t know how the kernel achieves
> >> having a single page table entry for a contiguous 2MB virtual address
> >> range. If I get this right, in case of x86_64, the answer is using a
> >> page middle directory (PMD) entry pointing directly to a 2MB physical
> >> page rather than to a lower-level page table. And since the PMDs are 2MB
> >> aligned by definition, the resulting virtual address will be 2MB
> >> aligned.
> >>>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
> >>>>
> >>>> The aligned addresses may not have a valid page table entry. So, in case
> >>>> of uio, it is possible that during vtophys translation, the aligned
> >>>> addresses are touched here:
> >>>>
> >>>>
> >>
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> >>>> and this could lead to a segfault. Is this a possible scenario?
> >>>>
> >>>> My second question is:
> >>>>
> >>>> The commit message here:
> >>>>
> >>>> https://review.gerrithub.io/c/spdk/spdk/+/410071
> >>>>
> >>>> says:
> >>>>
> >>>> “We've had cases (especially with vhost) in the past where we have
> >>>> a valid vaddr but the backing page was not assigned yet.”.
> >>>>
> >>>> This refers to the vhost target, where shared memory is allocated by
> the
> >>>> QEMU process and the SPDK process maps this memory.
> >>>>
> >>>> Let’s consider this case. After mapping vhost-user memory regions,
> they
> >>>> are registered to the vtophys map. In case vfio is disabled,
> >>>> vtophys_get_paddr_pagemap() finds the corresponding physical
> >> addresses.
> >>>> These addresses must refer to pinned memory because vfio is not
> there
> >> to
> >>>> do the pinning. Therefore, VM’s memory has to be backed by
> hugepages.
> >>>> Hugepages are allocated by the QEMU process, way before vhost
> >> memory
> >>>> registration. After their allocation, hugepages will always have a
> >>>> backing page because they never get swapped out. So, I do not see any
> >>>> such case where backing page is not assigned yet and thus I do not see
> >>>> any need to touch the mapped page.
> >>>>
> >>>> This is my current understanding in brief and I'd welcome any feedback
> >>>> you may have:
> >>>>
> >>>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
> >> because
> >>>>    the aligned address may not have a valid page table entry thus
> >>>>    triggering a segfault when being touched in
> >>>>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
> >>>> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
> >>>>    because VM’s memory has to be backed by hugepages and
> hugepages
> >> are
> >>>>    not handled by demand paging strategy and they are never swapped
> >> out.
> >>>> I am looking forward to your feedback.
> >>>>
> >>> Current start/end calculation in spdk_vhost_dev_mem_register() might
> be
> >> a actually
> >>> NOP for memory backed by hugepages.
> >> It seems so. However, there are other platforms that support hugepage
> >> sizes less than 2MB. I do not know if SPDK supports such platforms.
> > I think that currently only >=2MB HP are supported.
> >
> >>> I think that we can try to validate alignmet of the memory in
> >> spdk_vhost_dev_mem_register()
> >>> and fail if it is not 2MB aligned.
> >> This sounds reasonable to me. However, I believe it would be better if
> >> we could support registering non-2MB aligned virtual addresses. Is this
> >> a WIP? I have found this commit:
> >>
> >> https://review.gerrithub.io/c/spdk/spdk/+/427816/1
> >>
> >> It is not clear to me why the community has chosen 2MB granularity for
> >> the SPDK map tables.
> > SPKD vhost was created some time after iSCSI and NVMf targets and it
> > needs to obey existing limitations. To be honest, vhost don't really need
> > to use huge pages, as this is the limitation of:
> >
> > 1. DMA - memory passed to DMA need to be:
> >    - pinned memory - can't be swapped, physical address can't change
> >    - contiguous (VFIO complicate this case)
> >   - virtual address must have assigned huge page so SPKD can discover
> >      its physical address
> >
> > 2. env_dpdk/memory
> > this was implemented for NVMe drivers that have limitations that single
> > transaction can't span 2MB address boundary - PRP have this limitation
> > I don't know if SGLs overcome this. This also required from us to implement
> > this in vhost:
> > https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462
> >
> > This is why 2MB granularity was chosen.
> 
> So, you are saying that vhost doesn’t really need to use huge pages. Are

It is possible for SPDK vhost backend to be modified in a way that it won't
require hugepages. But again when passing payload descriptors down to
physical devices the memory must be "good" for them. So if you use bdev_malloc
(without IOAT acceleration!) or bdev_aio as backing device the hugepages
backed memory requirement disappear as host kernel will handle all page
faults for you. This is not true for other bdevs that use DMA like nvme.

> you referring to SPDK’s memory? This would make sense. And I think, this
> is also true for the nvme and virtio-scsi bdev modules, which I am
> currently using. In these cases, the storage backend performs zero-copy
> DMA directly from VM’s huge page backed memory. Is this correct?

For virtio-scsi bdev it is (might be) correct but not for nvme (bdev_nvme?).

> 
> As far as VM’s memory is concerned, is it true that huge page backed
> memory is just a limitation of uio? Is it necessary to use huge page
> backed memory for the VM in case of vfio?
> 

This is the question that VFIO kernel module developers could have answare
for. But I bet $5 that it is NOT true. Let me write this again: memory
for DMA need to be:

1. Pinned
2. vtophys(addr) translation need to possible during memory registration
3. vtophys(addr) must always return the same result for the same 'add'

Kernel can do all above for any pages at any time but in userspace, only
hugepages guarantee all these so we are using them.

There is interesting article here https://lwn.net/Articles/600502/ about DMA
and memory. Maybe it an describe it better than me :)

> >
> >>> Have you hit any segfault there?
> >> Yes. I will give you a brief description.
> >>
> >> As I have already announced here:
> >>
> >> https://lists.01.org/pipermail/spdk/2018-October/002528.html
> >>
> >> I am currently working on an alternative vhost-user transport. I am
> >> shipping the SPDK vhost target into a dedicated storage appliance VM.
> >> Inspired by this post:
> >>
> >> https://wiki.qemu.org/Features/VirtioVhostUser
> >>
> >> I am using a dedicated virtio device called “virtio-vhost-user” to
> >> extend the vhost-user control plane. This device intercepts the
> >> vhost-user protocol messages from the unix domain socket on the host
> and
> >> inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
> >> from the unix socket, it maps the memory regions set by the master and
> >> exposes them to the slave guest as an MMIO PCI memory region.
> >>
> >> So, instead of mapping hugepage backed memory regions, the vhost
> target,
> >> running in slave guest user space, maps segments of an MMIO BAR of the
> >> virtio-vhost-user device.
> >>
> >> Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
> >> The segfault is happening in a specific test case. That is when I do
> >> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
> >> “construct_vhost_scsi_controller”.
> >> In my code, this implies calling “spdk_pci_device_attach” ->
> >> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
> >> which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” -
> >
> >> “rte_pci_map_device”.
> >>
> >> During the first map, the MMIO BAR is always mapped to a 2MB aligned
> >> address (btw I can’t explain this, it can’t be a coincidence).
> >> However, this is not the case after the second map. The result is that I
> >> get a segfault when I register this non-2MB aligned address.
> >>
> >> So, I am seeking for a solution. I think the best would be to support
> >> registering non-2MB aligned addresses. This would be useful in general,
> >> when you want to register an MMIO BAR, which is necessary in cases of
> >> peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
> >> between NVMe SSDs in SPDK. I wonder how you manage the 2MB
> alignment
> >> restriction in that case.
> > Anything that you don't pass to DMA don't need to be 2MB aligned. If you
> > read/write this using CPU it don't need to be HP backed either.
> >
> > For DMA I think you will have to obey memory limitation I wrote above.
> >
> > Adding Darek, he can have some more (up to date) knowledge.
> 
> OK, let me get this a little bit more clear. The dataplane is unchanged.
> The vhost target passes all the received descriptor addresses to the
> underlying storage backend for DMA (after address translation and iovec
> splitting). What I did was just to change the way the vhost target
> accesses the VM’s memory.
> 
> The previous case was that the vhost target was running on the host and
> it mapped the master vhost memory regions sent over the unix socket.
> These memory regions relied on huge pages on the host physical memory.
> 
> The current case is that the vhost target is running inside a VM and
> needs to have access to the other VM’s memory lying on host hugetlbfs.
> Therefore, I use a special device called virtio-vhost-user, which maps
> the master vhost memory regions and exposes them to guest user space as
> an MMIO BAR. That’s how the vhost target has access to host hugetlbfs
> from guest user space.
> 
> So, the current case is that the storage backend (say an emulated NVMe
> controller) performs peer-to-peer DMA from this MMIO BAR. This requires
> that the vhost target has registered this BAR to the vtophys map. And
> here is the problem because spdk_mem_register() requires the address to
> be 2MB aligned but the MMIO BAR is not necessarily mapped to a 2MB
> aligned virtual address.
> 
> Currently, I am using a temporary solution. I am mapping all PCI BARs
> from all PCI devices to 2MB aligned virtual addresses. I think this is
> not going to trigger any implications, is it? The other solution, is to

Should be fine for 2MB huge pages. The mmap() might fail for hugepages >2MB.

> modify the env_dpdk library in order to allow registering non-2MB
> aligned addresses. Darek, in case you are reading this, I would
> appreciate any feedback at this point. I think you are working on this.
> 
> >
> >> Last but not least, in case you may know, I would appreciate if you
> >> could give me a situation where page touching in
> >> vtophys_get_paddr_pagemap() here:
> >>
> >>
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> >>
> >> is necessary. Is this related to vhost exclusively? In case of vhost,
> >> the memory regions are backed by hugepages and these are not allocated
> >> on demand by the kernel. What am I missing?
> > When you mmap() huge page you are getting virtual address but actual
> > physical hugepage might not be assigned yet. We are touching each page
> > to force kernel to assign the huge page to virtual addrsss so we can
> discover
> > vtophys mmaping.
> >
> >>>> Thanks,
> >>>> Nikos
> >>>>
> >>>> _______________________________________________
> >>>> SPDK mailing list
> >>>> SPDK(a)lists.01.org
> >>>> https://lists.01.org/mailman/listinfo/spdk
> >>> _______________________________________________
> >>> SPDK mailing list
> >>> SPDK(a)lists.01.org
> >>> https://lists.01.org/mailman/listinfo/spdk
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-22 18:52 Nikos Dragazis
  0 siblings, 0 replies; 10+ messages in thread
From: Nikos Dragazis @ 2018-11-22 18:52 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12008 bytes --]


On 12/11/18 1:48 μ.μ., Wodkowski, PawelX wrote:
>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
>> Sent: Saturday, November 10, 2018 3:37 AM
>> To: spdk(a)lists.01.org
>> Subject: Re: [SPDK] Questions about vhost memory registration
>>
>> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
>>>> -----Original Message-----
>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
>> Dragazis
>>>> Sent: Thursday, November 8, 2018 1:49 AM
>>>> To: spdk(a)lists.01.org
>>>> Subject: [SPDK] Questions about vhost memory registration
>>>>
>>>> Hi all,
>>>>
>>>> I would like to raise a couple of questions about vhost target.
>>>>
>>>> My first question is:
>>>>
>>>> During vhost-user negotiation, the master sends its memory regions to
>>>> the slave. Slave maps each region in its own address space. The mmap
>>>> addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
>>>> aligned. When vhost registers the memory regions in
>>>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB
>>>> here:
>>> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require that
>> initiator
>>> pass memory backed by huge pages >= 2MB in size. On x86 MMU this imply
>>> that page alignment is the same as page size which is >= 2MB (99% sure -
>>> can someone confirm this to get this +1% ;) ).
>> Yes, you are probably right. I didn’t know how the kernel achieves
>> having a single page table entry for a contiguous 2MB virtual address
>> range. If I get this right, in case of x86_64, the answer is using a
>> page middle directory (PMD) entry pointing directly to a 2MB physical
>> page rather than to a lower-level page table. And since the PMDs are 2MB
>> aligned by definition, the resulting virtual address will be 2MB
>> aligned.
>>>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
>>>>
>>>> The aligned addresses may not have a valid page table entry. So, in case
>>>> of uio, it is possible that during vtophys translation, the aligned
>>>> addresses are touched here:
>>>>
>>>>
>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>>> and this could lead to a segfault. Is this a possible scenario?
>>>>
>>>> My second question is:
>>>>
>>>> The commit message here:
>>>>
>>>> https://review.gerrithub.io/c/spdk/spdk/+/410071
>>>>
>>>> says:
>>>>
>>>> “We've had cases (especially with vhost) in the past where we have
>>>> a valid vaddr but the backing page was not assigned yet.”.
>>>>
>>>> This refers to the vhost target, where shared memory is allocated by the
>>>> QEMU process and the SPDK process maps this memory.
>>>>
>>>> Let’s consider this case. After mapping vhost-user memory regions, they
>>>> are registered to the vtophys map. In case vfio is disabled,
>>>> vtophys_get_paddr_pagemap() finds the corresponding physical
>> addresses.
>>>> These addresses must refer to pinned memory because vfio is not there
>> to
>>>> do the pinning. Therefore, VM’s memory has to be backed by hugepages.
>>>> Hugepages are allocated by the QEMU process, way before vhost
>> memory
>>>> registration. After their allocation, hugepages will always have a
>>>> backing page because they never get swapped out. So, I do not see any
>>>> such case where backing page is not assigned yet and thus I do not see
>>>> any need to touch the mapped page.
>>>>
>>>> This is my current understanding in brief and I'd welcome any feedback
>>>> you may have:
>>>>
>>>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
>> because
>>>>    the aligned address may not have a valid page table entry thus
>>>>    triggering a segfault when being touched in
>>>>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
>>>> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
>>>>    because VM’s memory has to be backed by hugepages and hugepages
>> are
>>>>    not handled by demand paging strategy and they are never swapped
>> out.
>>>> I am looking forward to your feedback.
>>>>
>>> Current start/end calculation in spdk_vhost_dev_mem_register() might be
>> a actually
>>> NOP for memory backed by hugepages.
>> It seems so. However, there are other platforms that support hugepage
>> sizes less than 2MB. I do not know if SPDK supports such platforms.
> I think that currently only >=2MB HP are supported.
>
>>> I think that we can try to validate alignmet of the memory in
>> spdk_vhost_dev_mem_register()
>>> and fail if it is not 2MB aligned.
>> This sounds reasonable to me. However, I believe it would be better if
>> we could support registering non-2MB aligned virtual addresses. Is this
>> a WIP? I have found this commit:
>>
>> https://review.gerrithub.io/c/spdk/spdk/+/427816/1
>>
>> It is not clear to me why the community has chosen 2MB granularity for
>> the SPDK map tables.
> SPKD vhost was created some time after iSCSI and NVMf targets and it
> needs to obey existing limitations. To be honest, vhost don't really need
> to use huge pages, as this is the limitation of:
>
> 1. DMA - memory passed to DMA need to be:
>    - pinned memory - can't be swapped, physical address can't change
>    - contiguous (VFIO complicate this case)
>   - virtual address must have assigned huge page so SPKD can discover
>      its physical address
>
> 2. env_dpdk/memory 
> this was implemented for NVMe drivers that have limitations that single
> transaction can't span 2MB address boundary - PRP have this limitation
> I don't know if SGLs overcome this. This also required from us to implement
> this in vhost:
> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462
>
> This is why 2MB granularity was chosen.

So, you are saying that vhost doesn’t really need to use huge pages. Are
you referring to SPDK’s memory? This would make sense. And I think, this
is also true for the nvme and virtio-scsi bdev modules, which I am
currently using. In these cases, the storage backend performs zero-copy
DMA directly from VM’s huge page backed memory. Is this correct?

As far as VM’s memory is concerned, is it true that huge page backed
memory is just a limitation of uio? Is it necessary to use huge page
backed memory for the VM in case of vfio?

>
>>> Have you hit any segfault there?
>> Yes. I will give you a brief description.
>>
>> As I have already announced here:
>>
>> https://lists.01.org/pipermail/spdk/2018-October/002528.html
>>
>> I am currently working on an alternative vhost-user transport. I am
>> shipping the SPDK vhost target into a dedicated storage appliance VM.
>> Inspired by this post:
>>
>> https://wiki.qemu.org/Features/VirtioVhostUser
>>
>> I am using a dedicated virtio device called “virtio-vhost-user” to
>> extend the vhost-user control plane. This device intercepts the
>> vhost-user protocol messages from the unix domain socket on the host and
>> inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
>> from the unix socket, it maps the memory regions set by the master and
>> exposes them to the slave guest as an MMIO PCI memory region.
>>
>> So, instead of mapping hugepage backed memory regions, the vhost target,
>> running in slave guest user space, maps segments of an MMIO BAR of the
>> virtio-vhost-user device.
>>
>> Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
>> The segfault is happening in a specific test case. That is when I do
>> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
>> “construct_vhost_scsi_controller”.
>> In my code, this implies calling “spdk_pci_device_attach” ->
>> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
>> which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” ->
>> “rte_pci_map_device”.
>>
>> During the first map, the MMIO BAR is always mapped to a 2MB aligned
>> address (btw I can’t explain this, it can’t be a coincidence).
>> However, this is not the case after the second map. The result is that I
>> get a segfault when I register this non-2MB aligned address.
>>
>> So, I am seeking for a solution. I think the best would be to support
>> registering non-2MB aligned addresses. This would be useful in general,
>> when you want to register an MMIO BAR, which is necessary in cases of
>> peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
>> between NVMe SSDs in SPDK. I wonder how you manage the 2MB alignment
>> restriction in that case.
> Anything that you don't pass to DMA don't need to be 2MB aligned. If you
> read/write this using CPU it don't need to be HP backed either.
>
> For DMA I think you will have to obey memory limitation I wrote above.
>
> Adding Darek, he can have some more (up to date) knowledge.

OK, let me get this a little bit more clear. The dataplane is unchanged.
The vhost target passes all the received descriptor addresses to the
underlying storage backend for DMA (after address translation and iovec
splitting). What I did was just to change the way the vhost target
accesses the VM’s memory.

The previous case was that the vhost target was running on the host and
it mapped the master vhost memory regions sent over the unix socket.
These memory regions relied on huge pages on the host physical memory.

The current case is that the vhost target is running inside a VM and
needs to have access to the other VM’s memory lying on host hugetlbfs.
Therefore, I use a special device called virtio-vhost-user, which maps
the master vhost memory regions and exposes them to guest user space as
an MMIO BAR. That’s how the vhost target has access to host hugetlbfs
from guest user space.

So, the current case is that the storage backend (say an emulated NVMe
controller) performs peer-to-peer DMA from this MMIO BAR. This requires
that the vhost target has registered this BAR to the vtophys map. And
here is the problem because spdk_mem_register() requires the address to
be 2MB aligned but the MMIO BAR is not necessarily mapped to a 2MB
aligned virtual address.

Currently, I am using a temporary solution. I am mapping all PCI BARs
from all PCI devices to 2MB aligned virtual addresses. I think this is
not going to trigger any implications, is it? The other solution, is to
modify the env_dpdk library in order to allow registering non-2MB
aligned addresses. Darek, in case you are reading this, I would
appreciate any feedback at this point. I think you are working on this.

>
>> Last but not least, in case you may know, I would appreciate if you
>> could give me a situation where page touching in
>> vtophys_get_paddr_pagemap() here:
>>
>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>
>> is necessary. Is this related to vhost exclusively? In case of vhost,
>> the memory regions are backed by hugepages and these are not allocated
>> on demand by the kernel. What am I missing?
> When you mmap() huge page you are getting virtual address but actual
> physical hugepage might not be assigned yet. We are touching each page
> to force kernel to assign the huge page to virtual addrsss so we can discover
> vtophys mmaping.
>
>>>> Thanks,
>>>> Nikos
>>>>
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/spdk
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-12 11:48 Wodkowski, PawelX
  0 siblings, 0 replies; 10+ messages in thread
From: Wodkowski, PawelX @ 2018-11-12 11:48 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 9424 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Saturday, November 10, 2018 3:37 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] Questions about vhost memory registration
> 
> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> Dragazis
> >> Sent: Thursday, November 8, 2018 1:49 AM
> >> To: spdk(a)lists.01.org
> >> Subject: [SPDK] Questions about vhost memory registration
> >>
> >> Hi all,
> >>
> >> I would like to raise a couple of questions about vhost target.
> >>
> >> My first question is:
> >>
> >> During vhost-user negotiation, the master sends its memory regions to
> >> the slave. Slave maps each region in its own address space. The mmap
> >> addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
> >> aligned. When vhost registers the memory regions in
> >> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB
> >> here:
> > Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require that
> initiator
> > pass memory backed by huge pages >= 2MB in size. On x86 MMU this imply
> > that page alignment is the same as page size which is >= 2MB (99% sure -
> > can someone confirm this to get this +1% ;) ).
> Yes, you are probably right. I didn’t know how the kernel achieves
> having a single page table entry for a contiguous 2MB virtual address
> range. If I get this right, in case of x86_64, the answer is using a
> page middle directory (PMD) entry pointing directly to a 2MB physical
> page rather than to a lower-level page table. And since the PMDs are 2MB
> aligned by definition, the resulting virtual address will be 2MB
> aligned.
> >> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
> >>
> >> The aligned addresses may not have a valid page table entry. So, in case
> >> of uio, it is possible that during vtophys translation, the aligned
> >> addresses are touched here:
> >>
> >>
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> >>
> >> and this could lead to a segfault. Is this a possible scenario?
> >>
> >> My second question is:
> >>
> >> The commit message here:
> >>
> >> https://review.gerrithub.io/c/spdk/spdk/+/410071
> >>
> >> says:
> >>
> >> “We've had cases (especially with vhost) in the past where we have
> >> a valid vaddr but the backing page was not assigned yet.”.
> >>
> >> This refers to the vhost target, where shared memory is allocated by the
> >> QEMU process and the SPDK process maps this memory.
> >>
> >> Let’s consider this case. After mapping vhost-user memory regions, they
> >> are registered to the vtophys map. In case vfio is disabled,
> >> vtophys_get_paddr_pagemap() finds the corresponding physical
> addresses.
> >> These addresses must refer to pinned memory because vfio is not there
> to
> >> do the pinning. Therefore, VM’s memory has to be backed by hugepages.
> >> Hugepages are allocated by the QEMU process, way before vhost
> memory
> >> registration. After their allocation, hugepages will always have a
> >> backing page because they never get swapped out. So, I do not see any
> >> such case where backing page is not assigned yet and thus I do not see
> >> any need to touch the mapped page.
> >>
> >> This is my current understanding in brief and I'd welcome any feedback
> >> you may have:
> >>
> >> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
> because
> >>    the aligned address may not have a valid page table entry thus
> >>    triggering a segfault when being touched in
> >>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
> >> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
> >>    because VM’s memory has to be backed by hugepages and hugepages
> are
> >>    not handled by demand paging strategy and they are never swapped
> out.
> >>
> >> I am looking forward to your feedback.
> >>
> > Current start/end calculation in spdk_vhost_dev_mem_register() might be
> a actually
> > NOP for memory backed by hugepages.
> It seems so. However, there are other platforms that support hugepage
> sizes less than 2MB. I do not know if SPDK supports such platforms.

I think that currently only >=2MB HP are supported.

> > I think that we can try to validate alignmet of the memory in
> spdk_vhost_dev_mem_register()
> > and fail if it is not 2MB aligned.
> This sounds reasonable to me. However, I believe it would be better if
> we could support registering non-2MB aligned virtual addresses. Is this
> a WIP? I have found this commit:
> 
> https://review.gerrithub.io/c/spdk/spdk/+/427816/1
> 
> It is not clear to me why the community has chosen 2MB granularity for
> the SPDK map tables.

SPKD vhost was created some time after iSCSI and NVMf targets and it
needs to obey existing limitations. To be honest, vhost don't really need
to use huge pages, as this is the limitation of:

1. DMA - memory passed to DMA need to be:
   - pinned memory - can't be swapped, physical address can't change
   - contiguous (VFIO complicate this case)
  - virtual address must have assigned huge page so SPKD can discover
     its physical address

2. env_dpdk/memory 
this was implemented for NVMe drivers that have limitations that single
transaction can't span 2MB address boundary - PRP have this limitation
I don't know if SGLs overcome this. This also required from us to implement
this in vhost:
https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462

This is why 2MB granularity was chosen.

> > Have you hit any segfault there?
> Yes. I will give you a brief description.
> 
> As I have already announced here:
> 
> https://lists.01.org/pipermail/spdk/2018-October/002528.html
> 
> I am currently working on an alternative vhost-user transport. I am
> shipping the SPDK vhost target into a dedicated storage appliance VM.
> Inspired by this post:
> 
> https://wiki.qemu.org/Features/VirtioVhostUser
> 
> I am using a dedicated virtio device called “virtio-vhost-user” to
> extend the vhost-user control plane. This device intercepts the
> vhost-user protocol messages from the unix domain socket on the host and
> inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
> from the unix socket, it maps the memory regions set by the master and
> exposes them to the slave guest as an MMIO PCI memory region.
> 
> So, instead of mapping hugepage backed memory regions, the vhost target,
> running in slave guest user space, maps segments of an MMIO BAR of the
> virtio-vhost-user device.
> 
> Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
> The segfault is happening in a specific test case. That is when I do
> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
> “construct_vhost_scsi_controller”.
> In my code, this implies calling “spdk_pci_device_attach” ->
> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
> which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” ->
> “rte_pci_map_device”.
> 
> During the first map, the MMIO BAR is always mapped to a 2MB aligned
> address (btw I can’t explain this, it can’t be a coincidence).
> However, this is not the case after the second map. The result is that I
> get a segfault when I register this non-2MB aligned address.
> 
> So, I am seeking for a solution. I think the best would be to support
> registering non-2MB aligned addresses. This would be useful in general,
> when you want to register an MMIO BAR, which is necessary in cases of
> peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
> between NVMe SSDs in SPDK. I wonder how you manage the 2MB alignment
> restriction in that case.

Anything that you don't pass to DMA don't need to be 2MB aligned. If you
read/write this using CPU it don't need to be HP backed either.

For DMA I think you will have to obey memory limitation I wrote above.

Adding Darek, he can have some more (up to date) knowledge.

> 
> Last but not least, in case you may know, I would appreciate if you
> could give me a situation where page touching in
> vtophys_get_paddr_pagemap() here:
> 
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> 
> is necessary. Is this related to vhost exclusively? In case of vhost,
> the memory regions are backed by hugepages and these are not allocated
> on demand by the kernel. What am I missing?

When you mmap() huge page you are getting virtual address but actual
physical hugepage might not be assigned yet. We are touching each page
to force kernel to assign the huge page to virtual addrsss so we can discover
vtophys mmaping.

> >> Thanks,
> >> Nikos
> >>
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-10  2:36 Nikos Dragazis
  0 siblings, 0 replies; 10+ messages in thread
From: Nikos Dragazis @ 2018-11-10  2:36 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7274 bytes --]

On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
>> Sent: Thursday, November 8, 2018 1:49 AM
>> To: spdk(a)lists.01.org
>> Subject: [SPDK] Questions about vhost memory registration
>>
>> Hi all,
>>
>> I would like to raise a couple of questions about vhost target.
>>
>> My first question is:
>>
>> During vhost-user negotiation, the master sends its memory regions to
>> the slave. Slave maps each region in its own address space. The mmap
>> addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
>> aligned. When vhost registers the memory regions in
>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB
>> here:
> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require that initiator
> pass memory backed by huge pages >= 2MB in size. On x86 MMU this imply
> that page alignment is the same as page size which is >= 2MB (99% sure -
> can someone confirm this to get this +1% ;) ).
Yes, you are probably right. I didn’t know how the kernel achieves
having a single page table entry for a contiguous 2MB virtual address
range. If I get this right, in case of x86_64, the answer is using a
page middle directory (PMD) entry pointing directly to a 2MB physical
page rather than to a lower-level page table. And since the PMDs are 2MB
aligned by definition, the resulting virtual address will be 2MB
aligned.
>> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
>>
>> The aligned addresses may not have a valid page table entry. So, in case
>> of uio, it is possible that during vtophys translation, the aligned
>> addresses are touched here:
>>
>> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>
>> and this could lead to a segfault. Is this a possible scenario?
>>
>> My second question is:
>>
>> The commit message here:
>>
>> https://review.gerrithub.io/c/spdk/spdk/+/410071
>>
>> says:
>>
>> “We've had cases (especially with vhost) in the past where we have
>> a valid vaddr but the backing page was not assigned yet.”.
>>
>> This refers to the vhost target, where shared memory is allocated by the
>> QEMU process and the SPDK process maps this memory.
>>
>> Let’s consider this case. After mapping vhost-user memory regions, they
>> are registered to the vtophys map. In case vfio is disabled,
>> vtophys_get_paddr_pagemap() finds the corresponding physical addresses.
>> These addresses must refer to pinned memory because vfio is not there to
>> do the pinning. Therefore, VM’s memory has to be backed by hugepages.
>> Hugepages are allocated by the QEMU process, way before vhost memory
>> registration. After their allocation, hugepages will always have a
>> backing page because they never get swapped out. So, I do not see any
>> such case where backing page is not assigned yet and thus I do not see
>> any need to touch the mapped page.
>>
>> This is my current understanding in brief and I'd welcome any feedback
>> you may have:
>>
>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy because
>>    the aligned address may not have a valid page table entry thus
>>    triggering a segfault when being touched in
>>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
>> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
>>    because VM’s memory has to be backed by hugepages and hugepages are
>>    not handled by demand paging strategy and they are never swapped out.
>>
>> I am looking forward to your feedback.
>>
> Current start/end calculation in spdk_vhost_dev_mem_register() might be a actually
> NOP for memory backed by hugepages.
It seems so. However, there are other platforms that support hugepage
sizes less than 2MB. I do not know if SPDK supports such platforms.
> I think that we can try to validate alignmet of the memory in spdk_vhost_dev_mem_register()
> and fail if it is not 2MB aligned.
This sounds reasonable to me. However, I believe it would be better if
we could support registering non-2MB aligned virtual addresses. Is this
a WIP? I have found this commit:

https://review.gerrithub.io/c/spdk/spdk/+/427816/1

It is not clear to me why the community has chosen 2MB granularity for
the SPDK map tables.
> Have you hit any segfault there?
Yes. I will give you a brief description.

As I have already announced here:

https://lists.01.org/pipermail/spdk/2018-October/002528.html

I am currently working on an alternative vhost-user transport. I am
shipping the SPDK vhost target into a dedicated storage appliance VM.
Inspired by this post:

https://wiki.qemu.org/Features/VirtioVhostUser

I am using a dedicated virtio device called “virtio-vhost-user” to
extend the vhost-user control plane. This device intercepts the
vhost-user protocol messages from the unix domain socket on the host and
inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
from the unix socket, it maps the memory regions set by the master and
exposes them to the slave guest as an MMIO PCI memory region.

So, instead of mapping hugepage backed memory regions, the vhost target,
running in slave guest user space, maps segments of an MMIO BAR of the
virtio-vhost-user device.

Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
The segfault is happening in a specific test case. That is when I do
“construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
“construct_vhost_scsi_controller”.
In my code, this implies calling “spdk_pci_device_attach” ->
“spdk_pci_device_detach” -> “spdk_pci_device_attach”
which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” ->
“rte_pci_map_device”.

During the first map, the MMIO BAR is always mapped to a 2MB aligned
address (btw I can’t explain this, it can’t be a coincidence).
However, this is not the case after the second map. The result is that I
get a segfault when I register this non-2MB aligned address.

So, I am seeking for a solution. I think the best would be to support
registering non-2MB aligned addresses. This would be useful in general,
when you want to register an MMIO BAR, which is necessary in cases of
peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
between NVMe SSDs in SPDK. I wonder how you manage the 2MB alignment
restriction in that case.

Last but not least, in case you may know, I would appreciate if you
could give me a situation where page touching in
vtophys_get_paddr_pagemap() here:

https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287

is necessary. Is this related to vhost exclusively? In case of vhost,
the memory regions are backed by hugepages and these are not allocated
on demand by the kernel. What am I missing?
>> Thanks,
>> Nikos
>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [SPDK] Questions about vhost memory registration
@ 2018-11-08  8:45 Wodkowski, PawelX
  0 siblings, 0 replies; 10+ messages in thread
From: Wodkowski, PawelX @ 2018-11-08  8:45 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3596 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Thursday, November 8, 2018 1:49 AM
> To: spdk(a)lists.01.org
> Subject: [SPDK] Questions about vhost memory registration
> 
> Hi all,
> 
> I would like to raise a couple of questions about vhost target.
> 
> My first question is:
> 
> During vhost-user negotiation, the master sends its memory regions to
> the slave. Slave maps each region in its own address space. The mmap
> addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
> aligned. When vhost registers the memory regions in
> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB
> here:

Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require that initiator
pass memory backed by huge pages >= 2MB in size. On x86 MMU this imply
that page alignment is the same as page size which is >= 2MB (99% sure -
can someone confirm this to get this +1% ;) ).

> 
> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
> 
> The aligned addresses may not have a valid page table entry. So, in case
> of uio, it is possible that during vtophys translation, the aligned
> addresses are touched here:
> 
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> 
> and this could lead to a segfault. Is this a possible scenario?
> 
> My second question is:
> 
> The commit message here:
> 
> https://review.gerrithub.io/c/spdk/spdk/+/410071
> 
> says:
> 
> “We've had cases (especially with vhost) in the past where we have
> a valid vaddr but the backing page was not assigned yet.”.
> 
> This refers to the vhost target, where shared memory is allocated by the
> QEMU process and the SPDK process maps this memory.
> 
> Let’s consider this case. After mapping vhost-user memory regions, they
> are registered to the vtophys map. In case vfio is disabled,
> vtophys_get_paddr_pagemap() finds the corresponding physical addresses.
> These addresses must refer to pinned memory because vfio is not there to
> do the pinning. Therefore, VM’s memory has to be backed by hugepages.
> Hugepages are allocated by the QEMU process, way before vhost memory
> registration. After their allocation, hugepages will always have a
> backing page because they never get swapped out. So, I do not see any
> such case where backing page is not assigned yet and thus I do not see
> any need to touch the mapped page.
> 
> This is my current understanding in brief and I'd welcome any feedback
> you may have:
> 
> 1. address alignment in spdk_vhost_dev_mem_register() is buggy because
>    the aligned address may not have a valid page table entry thus
>    triggering a segfault when being touched in
>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
>    because VM’s memory has to be backed by hugepages and hugepages are
>    not handled by demand paging strategy and they are never swapped out.
> 
> I am looking forward to your feedback.
> 

Current start/end calculation in spdk_vhost_dev_mem_register() might be a actually
NOP for memory backed by hugepages.

I think that we can try to validate alignmet of the memory in spdk_vhost_dev_mem_register()
and fail if it is not 2MB aligned.

Have you hit any segfault there?

> Thanks,
> Nikos
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-12-03  8:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08  0:49 [SPDK] Questions about vhost memory registration Nikos Dragazis
2018-11-08  8:45 Wodkowski, PawelX
2018-11-10  2:36 Nikos Dragazis
2018-11-12 11:48 Wodkowski, PawelX
2018-11-22 18:52 Nikos Dragazis
2018-11-23  8:27 Wodkowski, PawelX
2018-11-28 23:24 Nikos Dragazis
2018-11-29  9:22 Wodkowski, PawelX
2018-11-30 18:00 Nikos Dragazis
2018-12-03  8:19 Wodkowski, PawelX

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.