qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Steven Sistare <steven.sistare@oracle.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Juan Quintela" <quintela@redhat.com>,
	qemu-devel@nongnu.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>
Subject: Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
Date: Mon, 17 Aug 2020 20:42:42 -0600	[thread overview]
Message-ID: <20200817204242.6b384ef7@x1.home> (raw)
In-Reply-To: <20200817154403.39ee42c6@x1.home>

On Mon, 17 Aug 2020 15:44:03 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 17 Aug 2020 17:20:57 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
> > On 8/17/2020 4:48 PM, Alex Williamson wrote:  
> > > On Mon, 17 Aug 2020 14:30:51 -0400
> > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > >     
> > >> On 7/30/2020 11:14 AM, Steve Sistare wrote:    
> > >>> Anonymous memory segments used by the guest are preserved across a re-exec
> > >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> > >>> in the Linux kernel. For the madvise patches, see:
> > >>>
> > >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>
> > >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >>> ---
> > >>>  include/qemu/osdep.h | 7 +++++++
> > >>>  1 file changed, 7 insertions(+)      
> > >>
> > >> Hi Alex,
> > >>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> > >> live update series, is getting a chilly reception on lkml.  We could instead 
> > >> create guest memory using memfd_create and preserve the fd across exec.  However, 
> > >> the subsequent mmap(fd) will return a different VA than was used previously, 
> > >> which  is a problem for memory that was registered with vfio, as the original VA 
> > >> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> > >> such as vfio_iommu_replay.
> > >>
> > >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> > >> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> > >> vaddr with new_vaddr.  Flags cannot be changed.
> > >>
> > >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> > >> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> > >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> > >>
> > >> What do you think    
> > > 
> > > Your new REMAP ioctl would have parameters identical to the MAP_DMA
> > > ioctl, so I think we should just use one of the flag bits on the
> > > existing MAP_DMA ioctl for this variant.    
> > 
> > Sounds good.
> >   
> > > Reading through the discussion on the kernel side there seems to be
> > > some confusion around why vfio needs the vaddr beyond the user call to
> > > MAP_DMA though.  Originally this was used to test for virtually
> > > contiguous mappings for merging and splitting purposes.  This is
> > > defunct in the v2 interface, however the vaddr is now used largely for
> > > mdev devices.  If an mdev device is not backed by an IOMMU device and
> > > does not share a container with an IOMMU device, then a user MAP_DMA
> > > ioctl essentially just registers the translation within the vfio
> > > container.  The mdev vendor driver can then later either request pages
> > > to be pinned for device DMA or can perform copy_to/from_user() to
> > > simulate DMA via the CPU.
> > > 
> > > Therefore I don't see that there's a simple re-architecture of the vfio
> > > IOMMU backend that could drop vaddr use.      
> > 
> > Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> > same conclusion.
> >   
> > > I'm a bit concerned this new
> > > remap proposal also raises the question of how do we prevent userspace
> > > remapping vaddrs racing with asynchronous kernel use of the previous
> > > vaddrs.      
> > 
> > Agreed.  After a quick glance at the code, holding iommu->lock during 
> > remap might be sufficient, but it needs more study.  
> 
> Unless you're suggesting an extended hold of the lock across the entire
> re-exec of QEMU, that's only going to prevent a race between a remap
> and a vendor driver pin or access, the time between the previous vaddr
> becoming invalid and the remap is unprotected.
> 
> > > Are we expecting guest drivers/agents to quiesce the device,
> > > or maybe relying on clearing bus-master, for PCI devices, to halt DMA?    
> > 
> > No.  We want to support any guest, and the guest is not aware that qemu
> > live update is occurring.
> >   
> > > The vfio migration interface we've developed does have a mechanism to
> > > stop a device, would we need to use this here?  If we do have a
> > > mechanism to quiesce the device, is the only reason we're not unmapping
> > > everything and remapping it into the new address space the latency in
> > > performing that operation?  Thanks,    
> > 
> > Same answer - we don't require that the guest has vfio migration support.  
> 
> QEMU toggling the runstate of the device via the vfio migration
> interface could be done transparently to the guest, but if your
> intention is to support any device (where none currently support the
> migration interface) perhaps it's a moot point.  It seems like this
> scheme only works with IOMMU backed devices where the device can
> continue to operate against pinned pages, anything that might need to
> dynamically pin pages against the process vaddr as it's running async
> to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Hmm, even if a device is running against pinned memory, how do we
handle device interrupts that occur during QEMU's downtime?  I see that
we reconfigure interrupts, but does QEMU need to drain the eventfd and
manually inject those missed interrupts or will setting up the irqfds
trigger a poll?  Thanks,

Alex



  reply	other threads:[~2020-08-18  2:43 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
2020-09-11 16:24   ` Dr. David Alan Gilbert
2020-09-24 21:43     ` Steven Sistare
2020-09-25  9:07       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
2020-07-30 16:12   ` Eric Blake
2020-07-30 17:52     ` Steven Sistare
2020-09-11 16:43   ` Dr. David Alan Gilbert
2020-09-25 18:43     ` Steven Sistare
2020-09-25 22:22       ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
2020-09-11 16:57   ` Dr. David Alan Gilbert
2020-09-24 21:44     ` Steven Sistare
2020-09-25  9:26       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
2020-07-30 16:14   ` Eric Blake
2020-07-30 18:00     ` Steven Sistare
2020-09-11 17:18       ` Dr. David Alan Gilbert
2020-09-24 21:49         ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
2020-07-30 16:17   ` Eric Blake
2020-07-30 18:02     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
2020-09-11 17:27   ` Dr. David Alan Gilbert
2020-09-24 21:50     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
2020-09-11 17:35   ` Dr. David Alan Gilbert
2020-09-24 21:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
2020-09-11 17:50   ` Dr. David Alan Gilbert
2020-09-25 18:07     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
2020-09-11 17:53   ` Dr. David Alan Gilbert
2020-09-24 20:42     ` Steven Sistare
2020-09-25  9:03       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
2020-07-30 16:20   ` Eric Blake
2020-07-30 18:11     ` Steven Sistare
2020-07-31 10:07       ` Daniel P. Berrangé
2020-07-31 15:18         ` Steven Sistare
2020-07-30 17:03   ` Alex Bennée
2020-07-30 18:14     ` Steven Sistare
2020-07-31  9:44       ` Alex Bennée
2020-09-11 17:59       ` Dr. David Alan Gilbert
2020-09-24 21:51         ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
2020-09-11 18:41   ` Dr. David Alan Gilbert
2020-09-24 21:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
2020-07-30 16:22   ` Eric Blake
2020-07-30 18:14     ` Steven Sistare
2020-09-11 18:44   ` Dr. David Alan Gilbert
2020-09-24 21:44     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
2020-09-11 18:49   ` Dr. David Alan Gilbert
2020-09-24 21:47     ` Steven Sistare
2020-09-25 15:52       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
2020-09-11 18:52   ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
2020-09-11 19:00   ` Dr. David Alan Gilbert
2020-09-24 21:52     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
2020-08-17 18:30   ` Steven Sistare
2020-08-17 20:48     ` Alex Williamson
2020-08-17 21:20       ` Steven Sistare
2020-08-17 21:44         ` Alex Williamson
2020-08-18  2:42           ` Alex Williamson [this message]
2020-08-19 21:52             ` Steven Sistare
2020-08-24 22:30               ` Alex Williamson
2020-10-08 16:32                 ` Steven Sistare
2020-10-15 20:36                   ` Alex Williamson
2020-10-19 16:33                     ` Steven Sistare
2020-10-26 18:28                       ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
2020-09-15 17:33   ` Dr. David Alan Gilbert
2020-09-15 17:53     ` Daniel P. Berrangé
2020-09-24 21:54     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
2020-07-31  9:06   ` Daniel P. Berrangé
2020-07-31 16:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
2020-08-06 10:22   ` Jason Zeng
2020-08-07 20:38     ` Steven Sistare
2020-08-10  3:50       ` Jason Zeng
2020-08-19 21:15         ` Steven Sistare
2020-08-20 10:33           ` Jason Zeng
2020-10-07 21:25             ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
2020-09-15 18:49   ` Dr. David Alan Gilbert
2020-09-24 21:52     ` Steven Sistare
2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
2020-07-30 18:48   ` Steven Sistare
2020-07-31  8:53     ` Daniel P. Berrangé
2020-07-31 15:27       ` Steven Sistare
2020-07-31 15:52         ` Daniel P. Berrangé
2020-07-31 17:20           ` Steven Sistare
2020-08-11 19:08           ` Dr. David Alan Gilbert
2020-07-30 17:15 ` Paolo Bonzini
2020-07-30 19:09   ` Steven Sistare
2020-07-30 21:39     ` Paolo Bonzini
2020-07-31 19:22       ` Steven Sistare
2020-07-30 17:49 ` Dr. David Alan Gilbert
2020-07-30 19:31   ` Steven Sistare
2020-08-04 18:18 ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200817204242.6b384ef7@x1.home \
    --to=alex.williamson@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=steven.sistare@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).