From: Steven Sistare <steven.sistare@oracle.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Alex Bennée" <alex.bennee@linaro.org>,
"Juan Quintela" <quintela@redhat.com>,
qemu-devel@nongnu.org,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Anthony Yznaga" <anthony.yznaga@oracle.com>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Philippe Mathieu-Daudé" <philmd@redhat.com>,
"Markus Armbruster" <armbru@redhat.com>
Subject: Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
Date: Mon, 19 Oct 2020 12:33:59 -0400 [thread overview]
Message-ID: <095c1adf-d027-6e2e-4afc-2f7a69960db3@oracle.com> (raw)
In-Reply-To: <20201015143654.166e0639@w520.home>
On 10/15/2020 4:36 PM, Alex Williamson wrote:
> On Thu, 8 Oct 2020 12:32:35 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>> On 8/24/2020 6:30 PM, Alex Williamson wrote:
>>> On Wed, 19 Aug 2020 17:52:26 -0400
>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>> On 8/17/2020 10:42 PM, Alex Williamson wrote:
>>>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:
>>>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>
>>>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:
>>>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>> include/qemu/osdep.h | 7 +++++++
>>>>>>>>>> 1 file changed, 7 insertions(+)
>>>>>>>>>
>>>>>>>>> Hi Alex,
>>>>>>>>> The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu
>>>>>>>>> live update series, is getting a chilly reception on lkml. We could instead
>>>>>>>>> create guest memory using memfd_create and preserve the fd across exec. However,
>>>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously,
>>>>>>>>> which is a problem for memory that was registered with vfio, as the original VA
>>>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions,
>>>>>>>>> such as vfio_iommu_replay.
>>>>>>>>>
>>>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>>>> new_vaddr. The implementation finds an exact match for (iova, size) and replaces
>>>>>>>>> vaddr with new_vaddr. Flags cannot be changed.
>>>>>>>>>
>>>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>>>
>>>>>>>>> What do you think
>>>>>>>>
>>>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>>>> existing MAP_DMA ioctl for this variant.
>>>>>>>
>>>>>>> Sounds good.
>>>>>>>
>>>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>>>> MAP_DMA though. Originally this was used to test for virtually
>>>>>>>> contiguous mappings for merging and splitting purposes. This is
>>>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>>>> mdev devices. If an mdev device is not backed by an IOMMU device and
>>>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>>>> container. The mdev vendor driver can then later either request pages
>>>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>>>> simulate DMA via the CPU.
>>>>>>>>
>>>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>>>> IOMMU backend that could drop vaddr use.
>>>>>>>
>>>>>>> Yes. I did not explain on lkml as you do here (thanks), but I reached the
>>>>>>> same conclusion.
>>>>>>>
>>>>>>>> I'm a bit concerned this new
>>>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>>>> vaddrs.
>>>>>>>
>>>>>>> Agreed. After a quick glance at the code, holding iommu->lock during
>>>>>>> remap might be sufficient, but it needs more study.
>>>>>>
>>>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>>>> becoming invalid and the remap is unprotected.
>>>>
>>>> OK. What if we exclude mediated devices? Its appears they are the only
>>>> ones where the kernel may async'ly use the vaddr, via call chains ending in
>>>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>>>
>>>> The other functions that use dma->vaddr are
>>>> vfio_dma_do_map
>>>> vfio_pin_map_dma
>>>> vfio_iommu_replay
>>>> vfio_pin_pages_remote
>>>> and they are all initiated via userland ioctl (if I have traced all the code
>>>> paths correctly). Thus iommu->lock would protect them.
>>>>
>>>> We would block live update in qemu if the config includes a mediated device.
>>>>
>>>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.
>>>
>>> That's not a solution I'd really be in favor of. We're eliminating an
>>> entire class of devices because they _might_ make use of these
>>> interfaces, but anyone can add a vfio bus driver, even exposing the
>>> same device API, and maybe make use of some of these interfaces in that
>>> driver. Maybe we'd even have reason to do it in vfio-pci if we had
>>> reason to virtualize some aspect of a device. I think we're setting
>>> ourselves up for a very complicated support scenario if we just
>>> arbitrarily decide to deny drivers using certain interfaces.
>>>
>>>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?
>>>>>>>
>>>>>>> No. We want to support any guest, and the guest is not aware that qemu
>>>>>>> live update is occurring.
>>>>>>>
>>>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>>>> stop a device, would we need to use this here? If we do have a
>>>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>>>> everything and remapping it into the new address space the latency in
>>>>>>>> performing that operation? Thanks,
>>>>>>>
>>>>>>> Same answer - we don't require that the guest has vfio migration support.
>>>>>>
>>>>>> QEMU toggling the runstate of the device via the vfio migration
>>>>>> interface could be done transparently to the guest, but if your
>>>>>> intention is to support any device (where none currently support the
>>>>>> migration interface) perhaps it's a moot point.
>>>>
>>>> That sounds useful when devices support. Can you give me some function names
>>>> or references so I can study this qemu-based "vfio migration interface".
>>>
>>> The uAPI is documented in commit a8a24f3f6e38. We're still waiting on
>>> the QEMU support or implementation in an mdev vendor driver.
>>> Essentially migration exposes a new region of the device which would be
>>> implemented by the vendor driver. A register within that region
>>> manipulates the device state, so a device could be stopped by clearing
>>> the 'run' bit in that register.
>>>
>>>>>> It seems like this
>>>>>> scheme only works with IOMMU backed devices where the device can
>>>>>> continue to operate against pinned pages, anything that might need to
>>>>>> dynamically pin pages against the process vaddr as it's running async
>>>>>> to the QEMU re-exec needs to be blocked or stopped. Thanks,
>>>>
>>>> Yes, true of this remap proposal.
>>>>
>>>> I wanted to unconditionally support all devices, which is why I think that
>>>>
>>>> MADV_DOEXEC is a nifty solution. If you agree, please add your voice to the
>>>> lkml discussion.
>>
>> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
>> from mediated device drivers.
>>
>> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
>> VFIO_IOMMU_MAP_DMA.
>>
>> VFIO_IOMMU_UNMAP_DMA flags=VFIO_DMA_MAP_FLAG_REMAP
>> Discard vaddr on the existing dma region defined by (iova, size), but keep the
>> struct vfio_dma. Subsequent translation requests are blocked.
>> The implementation sets a flag in struct vfio_dma. vfio_pin_pages() and
>> vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>> Called before exec.
>>
>> VFIO_IOMMU_MAP_DMA flags=VFIO_DMA_MAP_FLAG_REMAP
>> Remap the region (iova, size) to vaddr, and resume translation requests.
>> Called after exec.
>>
>> Unfortunately, remap as defined above has an undesirable side effect. The mdev
>> driver may use kernel worker threads which serve requests from multiple clients
>> (eg i915/gvt workload_thread). A process that fails to call MAP_DMA with REMAP,
>> or is tardy doing so, will delay other processes who are stuck waiting in
>> vfio_pin_pages or vfio_dma_rw. This is unacceptable, and I mention this scheme in
>> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
>> instance?), or in case you see a way to salvage it.
>
> Right, that's my first thought when I hear that the pin and dma_rw paths
> are blocked as well, we cannot rely on userspace to unblock anything.
> A malicious user may hold out just to see how long until the host
> becomes unusable. Userspace determines how many groups share a
> vfio_iommu.
>
>> Here is a more robust implementation. It only works for dma regions backed by
>> a file, such as memfd or shm.
>>
>> VFIO_IOMMU_UNMAP_DMA flags=VFIO_DMA_MAP_FLAG_REMAP
>> Find the file and offset for iova, and save the struct file pointer in
>> struct vfio_dma. In vfio_pin_pages and vfio_dma_rw and their descendants,
>> if file* is set, then call pagecache_get_page() to get the pfn, instead of
>> get_user_pages.
>>
>> VFIO_IOMMU_MAP_DMA flags=VFIO_DMA_MAP_FLAG_REMAP
>> Remap the region (iova, size) to vaddr and drop the file reference.
>>
>> This begs the question of whether we can always use pagecache_get_page, and
>> eliminate the dependency on vaddr. The translation performance could be
>> different, though.
>>
>> I have not implemented this yet. Any thoughts before I do?
>
> That's a pretty hefty usage restriction, but what I take from it is
> that these are mechanisms which provide a fallback lookup path that can
> service callers in the interim during the gap of the range being
> remapped. The callers are always providing an IOVA and wishing to do
> something to the memory referenced by that IOVA, we just need a
> translation mechanism. The IOMMU itself is also such an alternative
> lookup, via iommu_iova_to_phys(), but of course requiring an
> IOMMU-backed device is just another usage restriction, potentially one
> that's not even apparent to the user.
>
> Is a more general solution to make sure there's always an IOVA-to-phys
> lookup mechanism available, implementing one if not provided by the
> IOMMU or memory backing interface? We'd need to adapt the dma_rw
> interface to work on either a VA or PA, and pinning pages on
> UNMAP+REMAP, plus stashing them in a translation structure, plus
> dynamically adapting to changes (ex. the IOMMU backed device being
> removed, leaving a non-IOMMU backed device in the vfio_iommu) all
> sounds pretty complicated, especially as the vfio-iommu-type1 backend
> becomes stretched to be more and more fragile. Possibly it's still
> feasible though. Thanks,
Requiring file backed memory is not a restriction in practice, because there is
no way to preserve MAP_ANON memory across exec and map it into the new qemu process.
That is what MADV_DOEXEC would have provided. Without it, one cannot do live
update with MAP_ANON memory.
For qemu, when allocating anonymous memory for guest memory regions, we would modify the
allocation functions to call memfd_create + mmap(fd) instead of mmap(MAP_ANON). The
implementation of memfd_create creates a /dev/shm file and unlinks it. Thus the memory is
backed by a file, and the VFIO UNMAP/REMAP proposal works.
- Steve
next prev parent reply other threads:[~2020-10-19 16:35 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
2020-09-11 16:24 ` Dr. David Alan Gilbert
2020-09-24 21:43 ` Steven Sistare
2020-09-25 9:07 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
2020-07-30 16:12 ` Eric Blake
2020-07-30 17:52 ` Steven Sistare
2020-09-11 16:43 ` Dr. David Alan Gilbert
2020-09-25 18:43 ` Steven Sistare
2020-09-25 22:22 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
2020-09-11 16:57 ` Dr. David Alan Gilbert
2020-09-24 21:44 ` Steven Sistare
2020-09-25 9:26 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
2020-07-30 16:14 ` Eric Blake
2020-07-30 18:00 ` Steven Sistare
2020-09-11 17:18 ` Dr. David Alan Gilbert
2020-09-24 21:49 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
2020-07-30 16:17 ` Eric Blake
2020-07-30 18:02 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
2020-09-11 17:27 ` Dr. David Alan Gilbert
2020-09-24 21:50 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
2020-09-11 17:35 ` Dr. David Alan Gilbert
2020-09-24 21:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
2020-09-11 17:50 ` Dr. David Alan Gilbert
2020-09-25 18:07 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
2020-09-11 17:53 ` Dr. David Alan Gilbert
2020-09-24 20:42 ` Steven Sistare
2020-09-25 9:03 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
2020-07-30 16:20 ` Eric Blake
2020-07-30 18:11 ` Steven Sistare
2020-07-31 10:07 ` Daniel P. Berrangé
2020-07-31 15:18 ` Steven Sistare
2020-07-30 17:03 ` Alex Bennée
2020-07-30 18:14 ` Steven Sistare
2020-07-31 9:44 ` Alex Bennée
2020-09-11 17:59 ` Dr. David Alan Gilbert
2020-09-24 21:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
2020-09-11 18:41 ` Dr. David Alan Gilbert
2020-09-24 21:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
2020-07-30 16:22 ` Eric Blake
2020-07-30 18:14 ` Steven Sistare
2020-09-11 18:44 ` Dr. David Alan Gilbert
2020-09-24 21:44 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
2020-09-11 18:49 ` Dr. David Alan Gilbert
2020-09-24 21:47 ` Steven Sistare
2020-09-25 15:52 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
2020-09-11 18:52 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
2020-09-11 19:00 ` Dr. David Alan Gilbert
2020-09-24 21:52 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
2020-08-17 18:30 ` Steven Sistare
2020-08-17 20:48 ` Alex Williamson
2020-08-17 21:20 ` Steven Sistare
2020-08-17 21:44 ` Alex Williamson
2020-08-18 2:42 ` Alex Williamson
2020-08-19 21:52 ` Steven Sistare
2020-08-24 22:30 ` Alex Williamson
2020-10-08 16:32 ` Steven Sistare
2020-10-15 20:36 ` Alex Williamson
2020-10-19 16:33 ` Steven Sistare [this message]
2020-10-26 18:28 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
2020-09-15 17:33 ` Dr. David Alan Gilbert
2020-09-15 17:53 ` Daniel P. Berrangé
2020-09-24 21:54 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
2020-07-31 9:06 ` Daniel P. Berrangé
2020-07-31 16:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
2020-08-06 10:22 ` Jason Zeng
2020-08-07 20:38 ` Steven Sistare
2020-08-10 3:50 ` Jason Zeng
2020-08-19 21:15 ` Steven Sistare
2020-08-20 10:33 ` Jason Zeng
2020-10-07 21:25 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
2020-09-15 18:49 ` Dr. David Alan Gilbert
2020-09-24 21:52 ` Steven Sistare
2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
2020-07-30 18:48 ` Steven Sistare
2020-07-31 8:53 ` Daniel P. Berrangé
2020-07-31 15:27 ` Steven Sistare
2020-07-31 15:52 ` Daniel P. Berrangé
2020-07-31 17:20 ` Steven Sistare
2020-08-11 19:08 ` Dr. David Alan Gilbert
2020-07-30 17:15 ` Paolo Bonzini
2020-07-30 19:09 ` Steven Sistare
2020-07-30 21:39 ` Paolo Bonzini
2020-07-31 19:22 ` Steven Sistare
2020-07-30 17:49 ` Dr. David Alan Gilbert
2020-07-30 19:31 ` Steven Sistare
2020-08-04 18:18 ` Steven Sistare
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=095c1adf-d027-6e2e-4afc-2f7a69960db3@oracle.com \
--to=steven.sistare@oracle.com \
--cc=alex.bennee@linaro.org \
--cc=alex.williamson@redhat.com \
--cc=anthony.yznaga@oracle.com \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=dgilbert@redhat.com \
--cc=marcandre.lureau@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=philmd@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).