qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Juan Quintela" <quintela@redhat.com>,
	qemu-devel@nongnu.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Anthony Yznaga" <anthony.yznaga@oracle.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>
Subject: Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
Date: Wed, 19 Aug 2020 17:52:26 -0400	[thread overview]
Message-ID: <64d3c11a-dda4-a9ed-0bdd-621a0b4f6f75@oracle.com> (raw)
In-Reply-To: <20200817204242.6b384ef7@x1.home>

On 8/17/2020 10:42 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 15:44:03 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Mon, 17 Aug 2020 17:20:57 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:  
>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>     
>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:    
>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>  1 file changed, 7 insertions(+)      
>>>>>
>>>>> Hi Alex,
>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>> such as vfio_iommu_replay.
>>>>>
>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>
>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>
>>>>> What do you think    
>>>>
>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>> ioctl, so I think we should just use one of the flag bits on the
>>>> existing MAP_DMA ioctl for this variant.    
>>>
>>> Sounds good.
>>>   
>>>> Reading through the discussion on the kernel side there seems to be
>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>> contiguous mappings for merging and splitting purposes.  This is
>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>> ioctl essentially just registers the translation within the vfio
>>>> container.  The mdev vendor driver can then later either request pages
>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>> simulate DMA via the CPU.
>>>>
>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>> IOMMU backend that could drop vaddr use.      
>>>
>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>> same conclusion.
>>>   
>>>> I'm a bit concerned this new
>>>> remap proposal also raises the question of how do we prevent userspace
>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>> vaddrs.      
>>>
>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>> remap might be sufficient, but it needs more study.  
>>
>> Unless you're suggesting an extended hold of the lock across the entire
>> re-exec of QEMU, that's only going to prevent a race between a remap
>> and a vendor driver pin or access, the time between the previous vaddr
>> becoming invalid and the remap is unprotected.

OK.  What if we exclude mediated devices?  Its appears they are the only
ones where the kernel may async'ly use the vaddr, via call chains ending in 
vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.

The other functions that use dma->vaddr are
    vfio_dma_do_map 
    vfio_pin_map_dma 
    vfio_iommu_replay 
    vfio_pin_pages_remote
and they are all initiated via userland ioctl (if I have traced all the code 
paths correctly).  Thus iommu->lock would protect them.

We would block live update in qemu if the config includes a mediated device.

VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.

>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?    
>>>
>>> No.  We want to support any guest, and the guest is not aware that qemu
>>> live update is occurring.
>>>   
>>>> The vfio migration interface we've developed does have a mechanism to
>>>> stop a device, would we need to use this here?  If we do have a
>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>> everything and remapping it into the new address space the latency in
>>>> performing that operation?  Thanks,    
>>>
>>> Same answer - we don't require that the guest has vfio migration support.  
>>
>> QEMU toggling the runstate of the device via the vfio migration
>> interface could be done transparently to the guest, but if your
>> intention is to support any device (where none currently support the
>> migration interface) perhaps it's a moot point.  

That sounds useful when devices support.  Can you give me some function names
or references so I can study this qemu-based "vfio migration interface".

>> It seems like this
>> scheme only works with IOMMU backed devices where the device can
>> continue to operate against pinned pages, anything that might need to
>> dynamically pin pages against the process vaddr as it's running async
>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Yes, true of this remap proposal.

I wanted to unconditionally support all devices, which is why I think that

MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the

lkml discussion.

> Hmm, even if a device is running against pinned memory, how do we
> handle device interrupts that occur during QEMU's downtime?  I see that
> we reconfigure interrupts, but does QEMU need to drain the eventfd and
> manually inject those missed interrupts or will setting up the irqfds
> trigger a poll?  Thanks,

My existing code is apparently deficient in this area; I close the pre-exec eventfd,
and post exec create a new eventfd and attach it to the vfio device.  Jason and I
are discussing alternatives in this thread of the series:
  https://lore.kernel.org/qemu-devel/0da862c8-74bc-bf06-a436-4ebfcb9dd8d4@oracle.com/

I am hoping that unconditionally injecting a (potentially spurious) interrupt on a 
new eventfd after exec will solve the problem.

BTW, thanks for discussing these issues.  I appreciate it.

- Steve


  reply	other threads:[~2020-08-19 21:53 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
2020-09-11 16:24   ` Dr. David Alan Gilbert
2020-09-24 21:43     ` Steven Sistare
2020-09-25  9:07       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
2020-07-30 16:12   ` Eric Blake
2020-07-30 17:52     ` Steven Sistare
2020-09-11 16:43   ` Dr. David Alan Gilbert
2020-09-25 18:43     ` Steven Sistare
2020-09-25 22:22       ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
2020-09-11 16:57   ` Dr. David Alan Gilbert
2020-09-24 21:44     ` Steven Sistare
2020-09-25  9:26       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
2020-07-30 16:14   ` Eric Blake
2020-07-30 18:00     ` Steven Sistare
2020-09-11 17:18       ` Dr. David Alan Gilbert
2020-09-24 21:49         ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
2020-07-30 16:17   ` Eric Blake
2020-07-30 18:02     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
2020-09-11 17:27   ` Dr. David Alan Gilbert
2020-09-24 21:50     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
2020-09-11 17:35   ` Dr. David Alan Gilbert
2020-09-24 21:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
2020-09-11 17:50   ` Dr. David Alan Gilbert
2020-09-25 18:07     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
2020-09-11 17:53   ` Dr. David Alan Gilbert
2020-09-24 20:42     ` Steven Sistare
2020-09-25  9:03       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
2020-07-30 16:20   ` Eric Blake
2020-07-30 18:11     ` Steven Sistare
2020-07-31 10:07       ` Daniel P. Berrangé
2020-07-31 15:18         ` Steven Sistare
2020-07-30 17:03   ` Alex Bennée
2020-07-30 18:14     ` Steven Sistare
2020-07-31  9:44       ` Alex Bennée
2020-09-11 17:59       ` Dr. David Alan Gilbert
2020-09-24 21:51         ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
2020-09-11 18:41   ` Dr. David Alan Gilbert
2020-09-24 21:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
2020-07-30 16:22   ` Eric Blake
2020-07-30 18:14     ` Steven Sistare
2020-09-11 18:44   ` Dr. David Alan Gilbert
2020-09-24 21:44     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
2020-09-11 18:49   ` Dr. David Alan Gilbert
2020-09-24 21:47     ` Steven Sistare
2020-09-25 15:52       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
2020-09-11 18:52   ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
2020-09-11 19:00   ` Dr. David Alan Gilbert
2020-09-24 21:52     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
2020-08-17 18:30   ` Steven Sistare
2020-08-17 20:48     ` Alex Williamson
2020-08-17 21:20       ` Steven Sistare
2020-08-17 21:44         ` Alex Williamson
2020-08-18  2:42           ` Alex Williamson
2020-08-19 21:52             ` Steven Sistare [this message]
2020-08-24 22:30               ` Alex Williamson
2020-10-08 16:32                 ` Steven Sistare
2020-10-15 20:36                   ` Alex Williamson
2020-10-19 16:33                     ` Steven Sistare
2020-10-26 18:28                       ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
2020-09-15 17:33   ` Dr. David Alan Gilbert
2020-09-15 17:53     ` Daniel P. Berrangé
2020-09-24 21:54     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
2020-07-31  9:06   ` Daniel P. Berrangé
2020-07-31 16:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
2020-08-06 10:22   ` Jason Zeng
2020-08-07 20:38     ` Steven Sistare
2020-08-10  3:50       ` Jason Zeng
2020-08-19 21:15         ` Steven Sistare
2020-08-20 10:33           ` Jason Zeng
2020-10-07 21:25             ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
2020-09-15 18:49   ` Dr. David Alan Gilbert
2020-09-24 21:52     ` Steven Sistare
2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
2020-07-30 18:48   ` Steven Sistare
2020-07-31  8:53     ` Daniel P. Berrangé
2020-07-31 15:27       ` Steven Sistare
2020-07-31 15:52         ` Daniel P. Berrangé
2020-07-31 17:20           ` Steven Sistare
2020-08-11 19:08           ` Dr. David Alan Gilbert
2020-07-30 17:15 ` Paolo Bonzini
2020-07-30 19:09   ` Steven Sistare
2020-07-30 21:39     ` Paolo Bonzini
2020-07-31 19:22       ` Steven Sistare
2020-07-30 17:49 ` Dr. David Alan Gilbert
2020-07-30 19:31   ` Steven Sistare
2020-08-04 18:18 ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64d3c11a-dda4-a9ed-0bdd-621a0b4f6f75@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=alex.bennee@linaro.org \
    --cc=alex.williamson@redhat.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).