All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wodkowski, PawelX <pawelx.wodkowski at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] Questions about vhost memory registration
Date: Mon, 12 Nov 2018 11:48:39 +0000	[thread overview]
Message-ID: <F6F2A6264E145F47A18AB6DF8E87425D7037A431@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: AM6PR07MB386465E4DFBD489DD353049E98C70@AM6PR07MB3864.eurprd07.prod.outlook.com

[-- Attachment #1: Type: text/plain, Size: 9424 bytes --]



> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Saturday, November 10, 2018 3:37 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] Questions about vhost memory registration
> 
> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Nikos
> Dragazis
> >> Sent: Thursday, November 8, 2018 1:49 AM
> >> To: spdk(a)lists.01.org
> >> Subject: [SPDK] Questions about vhost memory registration
> >>
> >> Hi all,
> >>
> >> I would like to raise a couple of questions about vhost target.
> >>
> >> My first question is:
> >>
> >> During vhost-user negotiation, the master sends its memory regions to
> >> the slave. Slave maps each region in its own address space. The mmap
> >> addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
> >> aligned. When vhost registers the memory regions in
> >> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB
> >> here:
> > Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require that
> initiator
> > pass memory backed by huge pages >= 2MB in size. On x86 MMU this imply
> > that page alignment is the same as page size which is >= 2MB (99% sure -
> > can someone confirm this to get this +1% ;) ).
> Yes, you are probably right. I didn’t know how the kernel achieves
> having a single page table entry for a contiguous 2MB virtual address
> range. If I get this right, in case of x86_64, the answer is using a
> page middle directory (PMD) entry pointing directly to a 2MB physical
> page rather than to a lower-level page table. And since the PMDs are 2MB
> aligned by definition, the resulting virtual address will be 2MB
> aligned.
> >> https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
> >>
> >> The aligned addresses may not have a valid page table entry. So, in case
> >> of uio, it is possible that during vtophys translation, the aligned
> >> addresses are touched here:
> >>
> >>
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> >>
> >> and this could lead to a segfault. Is this a possible scenario?
> >>
> >> My second question is:
> >>
> >> The commit message here:
> >>
> >> https://review.gerrithub.io/c/spdk/spdk/+/410071
> >>
> >> says:
> >>
> >> “We've had cases (especially with vhost) in the past where we have
> >> a valid vaddr but the backing page was not assigned yet.”.
> >>
> >> This refers to the vhost target, where shared memory is allocated by the
> >> QEMU process and the SPDK process maps this memory.
> >>
> >> Let’s consider this case. After mapping vhost-user memory regions, they
> >> are registered to the vtophys map. In case vfio is disabled,
> >> vtophys_get_paddr_pagemap() finds the corresponding physical
> addresses.
> >> These addresses must refer to pinned memory because vfio is not there
> to
> >> do the pinning. Therefore, VM’s memory has to be backed by hugepages.
> >> Hugepages are allocated by the QEMU process, way before vhost
> memory
> >> registration. After their allocation, hugepages will always have a
> >> backing page because they never get swapped out. So, I do not see any
> >> such case where backing page is not assigned yet and thus I do not see
> >> any need to touch the mapped page.
> >>
> >> This is my current understanding in brief and I'd welcome any feedback
> >> you may have:
> >>
> >> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
> because
> >>    the aligned address may not have a valid page table entry thus
> >>    triggering a segfault when being touched in
> >>    vtophys_get_paddr_pagemap() -> rte_atomic64_read().
> >> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
> >>    because VM’s memory has to be backed by hugepages and hugepages
> are
> >>    not handled by demand paging strategy and they are never swapped
> out.
> >>
> >> I am looking forward to your feedback.
> >>
> > Current start/end calculation in spdk_vhost_dev_mem_register() might be
> a actually
> > NOP for memory backed by hugepages.
> It seems so. However, there are other platforms that support hugepage
> sizes less than 2MB. I do not know if SPDK supports such platforms.

I think that currently only >=2MB HP are supported.

> > I think that we can try to validate alignmet of the memory in
> spdk_vhost_dev_mem_register()
> > and fail if it is not 2MB aligned.
> This sounds reasonable to me. However, I believe it would be better if
> we could support registering non-2MB aligned virtual addresses. Is this
> a WIP? I have found this commit:
> 
> https://review.gerrithub.io/c/spdk/spdk/+/427816/1
> 
> It is not clear to me why the community has chosen 2MB granularity for
> the SPDK map tables.

SPKD vhost was created some time after iSCSI and NVMf targets and it
needs to obey existing limitations. To be honest, vhost don't really need
to use huge pages, as this is the limitation of:

1. DMA - memory passed to DMA need to be:
   - pinned memory - can't be swapped, physical address can't change
   - contiguous (VFIO complicate this case)
  - virtual address must have assigned huge page so SPKD can discover
     its physical address

2. env_dpdk/memory 
this was implemented for NVMe drivers that have limitations that single
transaction can't span 2MB address boundary - PRP have this limitation
I don't know if SGLs overcome this. This also required from us to implement
this in vhost:
https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462

This is why 2MB granularity was chosen.

> > Have you hit any segfault there?
> Yes. I will give you a brief description.
> 
> As I have already announced here:
> 
> https://lists.01.org/pipermail/spdk/2018-October/002528.html
> 
> I am currently working on an alternative vhost-user transport. I am
> shipping the SPDK vhost target into a dedicated storage appliance VM.
> Inspired by this post:
> 
> https://wiki.qemu.org/Features/VirtioVhostUser
> 
> I am using a dedicated virtio device called “virtio-vhost-user” to
> extend the vhost-user control plane. This device intercepts the
> vhost-user protocol messages from the unix domain socket on the host and
> inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
> from the unix socket, it maps the memory regions set by the master and
> exposes them to the slave guest as an MMIO PCI memory region.
> 
> So, instead of mapping hugepage backed memory regions, the vhost target,
> running in slave guest user space, maps segments of an MMIO BAR of the
> virtio-vhost-user device.
> 
> Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
> The segfault is happening in a specific test case. That is when I do
> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
> “construct_vhost_scsi_controller”.
> In my code, this implies calling “spdk_pci_device_attach” ->
> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
> which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” ->
> “rte_pci_map_device”.
> 
> During the first map, the MMIO BAR is always mapped to a 2MB aligned
> address (btw I can’t explain this, it can’t be a coincidence).
> However, this is not the case after the second map. The result is that I
> get a segfault when I register this non-2MB aligned address.
> 
> So, I am seeking for a solution. I think the best would be to support
> registering non-2MB aligned addresses. This would be useful in general,
> when you want to register an MMIO BAR, which is necessary in cases of
> peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
> between NVMe SSDs in SPDK. I wonder how you manage the 2MB alignment
> restriction in that case.

Anything that you don't pass to DMA don't need to be 2MB aligned. If you
read/write this using CPU it don't need to be HP backed either.

For DMA I think you will have to obey memory limitation I wrote above.

Adding Darek, he can have some more (up to date) knowledge.

> 
> Last but not least, in case you may know, I would appreciate if you
> could give me a situation where page touching in
> vtophys_get_paddr_pagemap() here:
> 
> https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
> 
> is necessary. Is this related to vhost exclusively? In case of vhost,
> the memory regions are backed by hugepages and these are not allocated
> on demand by the kernel. What am I missing?

When you mmap() huge page you are getting virtual address but actual
physical hugepage might not be assigned yet. We are touching each page
to force kernel to assign the huge page to virtual addrsss so we can discover
vtophys mmaping.

> >> Thanks,
> >> Nikos
> >>
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

             reply	other threads:[~2018-11-12 11:48 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-12 11:48 Wodkowski, PawelX [this message]
  -- strict thread matches above, loose matches on Subject: below --
2018-12-03  8:19 [SPDK] Questions about vhost memory registration Wodkowski, PawelX
2018-11-30 18:00 Nikos Dragazis
2018-11-29  9:22 Wodkowski, PawelX
2018-11-28 23:24 Nikos Dragazis
2018-11-23  8:27 Wodkowski, PawelX
2018-11-22 18:52 Nikos Dragazis
2018-11-10  2:36 Nikos Dragazis
2018-11-08  8:45 Wodkowski, PawelX
2018-11-08  0:49 Nikos Dragazis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F6F2A6264E145F47A18AB6DF8E87425D7037A431@IRSMSX102.ger.corp.intel.com \
    --to=spdk@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.