All of lore.kernel.org
 help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Zeng" <jason.zeng@linux.intel.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Juan Quintela" <quintela@redhat.com>,
	qemu-devel@nongnu.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>
Subject: Re: [PATCH V3 00/22] Live Update
Date: Fri, 14 May 2021 11:15:18 -0400	[thread overview]
Message-ID: <29356c3a-0b3f-2773-4f9f-18ff4ed4d5bb@oracle.com> (raw)
In-Reply-To: <YJ5kokhzyA5tCom3@stefanha-x1.localdomain>

On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>> to a new version in between.
>>>>
>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>> paused state and waits for the cprload command.
>>>
>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>> already opens fds passed using this mechanism.
>>>
>>> I haven't checked but it may be possible to drop some patches by reusing
>>> QEMU's monitor file descriptor passing since the code already knows how
>>> to open from 'getfd' fds.
>>>
>>> The reason why using QMP is interesting is because it eliminates the
>>> need for execve(2). QEMU may be unable to execute a program due to
>>> chroot, seccomp, etc.
>>>
>>> QMP would enable cprsave/cprload to work both with and without
>>> execve(2).
>>>
>>> One tricky thing with this approach might be startup ordering: how to
>>> get fds via the QMP monitor in the new process before processing the
>>> entire command-line.
>>
>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>> I suspect my recent vfio extensions would smooth the rough edges.
> 
> I wonder about the reason for VFIO's pid limitation, maybe because it
> pins pages from the original process?

The dma unmap code verifies that the requesting task is the same as the task that mapped
the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
to fix locked memory accounting, which is associated with the mm of the original task.

> Is this VFIO pid limitation the main reason why you chose to make QEMU
> execve(2) the new binary?

That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
I was working against vfio rather than with it.

Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
without exec, I still need the exec option.

>> However, the main issue is that guest ram must be backed by named shared memory, and
>> we would need to add code to support shared memory for all the secondary memory objects.
>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>> instances with anonymous guest memory.
> 
> Thanks for explaining this more in the other sub-thread. The secondary
> memory objects you mentioned are relatively small so I don't think
> saving them in the traditional way is a problem.
> 
> Two approaches for zero-copy memory migration fit into QEMU's existing
> migration infrastructure:
> 
> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>   etc) so they are not saved into the savevm file. The existing --object
>   memory-backend-file syntax can be used.
> 
> - Extending the live migration protocol to detect when file descriptor
>   passing is available (i.e. UNIX domain socket migration) and using
>   that for memory-backend-* objects that have fds.
> 
> Either of these approaches would handle RAM with existing savevm/migrate
> commands.

Yes, but the vfio issues would still need to be solved, and we would need new
command line options to back existing and future secondary memory objects with 
named shared memory.

> The remaining issue is how to migrate VFIO and other file descriptors
> that cannot be reopened by the new process. As mentioned, QEMU already
> has file descriptor passing support in the QMP monitor and support for
> opening passed file descriptors (see qemu_open_internal(),
> monitor_fd_param(), and socket_get_fd()).
> 
> The advantage of integrating live update functionality into the existing
> savevm/migrate commands is that it will work in more use cases with
> less code duplication/maintenance/bitrot prevention than the
> special-case cprsave command in this patch series.
> 
> Maybe there is a fundamental technical reason why live update needs to
> be different from QEMU's existing migration commands but I haven't
> figured it out yet.

vfio and anonymous memory.

Regarding code duplication, I did consider whether to extend the migration
syntax and implementation versus creating something new.  Those functions
handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
use case, and the cpr functions handle state that is n/a for the migration case.
I judged that handling both in the same functions would be less readable and
maintainable.  After feedback during the V1 review, I simplified the cprsave
code by by calling qemu_save_device_state, as Xen does, thus eliminating any
interaction with the migration code.

Regarding bit rot, I still need to add a cpr test to the test suite, when the 
review is more complete and folks agree on the final form of the functionality.

I do like the idea of supporting update without exec, but as a future project, 
and not at the expense of dropping update with exec.

- Steve


  reply	other threads:[~2021-05-14 15:50 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
2021-05-07 12:24 ` [PATCH V3 01/22] as_flat_walk Steve Sistare
2021-05-07 12:25 ` [PATCH V3 02/22] qemu_ram_volatile Steve Sistare
2021-05-07 12:25 ` [PATCH V3 03/22] oslib: qemu_clr_cloexec Steve Sistare
2021-05-07 12:25 ` [PATCH V3 04/22] util: env var helpers Steve Sistare
2021-05-07 12:25 ` [PATCH V3 05/22] machine: memfd-alloc option Steve Sistare
2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
2021-05-07 14:31   ` Eric Blake
2021-05-13 20:19     ` Steven Sistare
2021-05-14  8:18       ` Daniel P. Berrangé
2021-05-12 16:27   ` Stefan Hajnoczi
2021-05-13 20:20     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 07/22] cpr Steve Sistare
2021-05-12 16:19   ` Stefan Hajnoczi
2021-05-13 20:21     ` Steven Sistare
2021-05-14 11:28       ` Stefan Hajnoczi
2021-05-14 15:14         ` Steven Sistare
2021-05-18 13:42           ` Stefan Hajnoczi
2021-05-07 12:25 ` [PATCH V3 08/22] cpr: QMP interfaces Steve Sistare
2021-06-04 13:59   ` Eric Blake
2021-06-07 17:19     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 09/22] cpr: HMP interfaces Steve Sistare
2021-05-07 12:25 ` [PATCH V3 10/22] pci: export functions for cpr Steve Sistare
2021-05-07 12:25 ` [PATCH V3 11/22] vfio-pci: refactor " Steve Sistare
2021-05-19 22:38   ` Alex Williamson
2021-05-21 13:33     ` Steven Sistare
2021-05-21 21:07       ` Alex Williamson
2021-05-21 21:18         ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 12/22] vfio-pci: cpr part 1 Steve Sistare
2021-05-21 22:24   ` Alex Williamson
2021-05-24 18:29     ` Steven Sistare
2021-06-11 18:15       ` Steven Sistare
2021-06-11 19:43         ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 13/22] vfio-pci: cpr part 2 Steve Sistare
2021-05-21 22:24   ` Alex Williamson
2021-05-24 18:31     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 14/22] vhost: reset vhost devices upon cprsave Steve Sistare
2021-05-07 12:25 ` [PATCH V3 15/22] hostmem-memfd: cpr support Steve Sistare
2021-05-07 12:25 ` [PATCH V3 16/22] chardev: cpr framework Steve Sistare
2021-05-07 14:33   ` Eric Blake
2021-05-13 20:19     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 17/22] chardev: cpr for simple devices Steve Sistare
2021-05-07 12:25 ` [PATCH V3 18/22] chardev: cpr for pty Steve Sistare
2021-05-07 12:25 ` [PATCH V3 19/22] chardev: cpr for sockets Steve Sistare
2021-05-07 12:25 ` [PATCH V3 20/22] cpr: only-cpr-capable option Steve Sistare
2021-05-07 12:25 ` [PATCH V3 21/22] cpr: maintainers Steve Sistare
2021-05-07 12:25 ` [PATCH V3 22/22] simplify savevm Steve Sistare
2021-05-07 13:00 ` [PATCH V3 00/22] Live Update no-reply
2021-05-13 20:42   ` Steven Sistare
2021-05-12 16:42 ` Stefan Hajnoczi
2021-05-13 20:21   ` Steven Sistare
2021-05-14 11:53     ` Stefan Hajnoczi
2021-05-14 15:15       ` Steven Sistare [this message]
2021-05-17 11:40         ` Stefan Hajnoczi
2021-05-17 19:10           ` Alex Williamson
2021-05-18 13:39             ` Stefan Hajnoczi
2021-05-18 15:48               ` Steven Sistare
2021-05-18  9:57         ` Dr. David Alan Gilbert
2021-05-18 16:00           ` Steven Sistare
2021-05-18 19:23             ` Dr. David Alan Gilbert
2021-05-18 20:01               ` Alex Williamson
2021-05-18 20:14               ` Steven Sistare
2021-05-20 13:00                 ` [PATCH V3 00/22] Live Update [reboot] Dr. David Alan Gilbert
2021-05-21 14:55                   ` Steven Sistare
2021-06-15 19:14                     ` Dr. David Alan Gilbert
2021-06-24 15:05                       ` Steven Sistare
2021-07-06 17:31                         ` Steven Sistare
2021-05-20 13:13                 ` [PATCH V3 00/22] Live Update [restart] Dr. David Alan Gilbert
2021-05-21 14:56                   ` Steven Sistare
2021-05-24 10:39                     ` Dr. David Alan Gilbert
2021-06-02 13:51                       ` Steven Sistare
2021-06-03 19:36                         ` Dr. David Alan Gilbert
2021-06-03 20:44                           ` Daniel P. Berrangé
2021-06-07 16:40                             ` [PATCH V3 00/22] Live Update [restart] : exec Steven Sistare
2021-06-14 14:31                               ` Steven Sistare
2021-06-14 14:36                                 ` Daniel P. Berrangé
2021-06-15 19:05                               ` Dr. David Alan Gilbert
2021-06-07 18:08                           ` [PATCH V3 00/22] Live Update [restart] : code replication Steven Sistare
2021-06-14 14:33                             ` Steven Sistare
2021-05-19 16:43 ` [PATCH V3 00/22] Live Update Steven Sistare
2021-06-02 15:19   ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29356c3a-0b3f-2773-4f9f-18ff4ed4d5bb@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=alex.bennee@linaro.org \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jason.zeng@linux.intel.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.