All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Fabiano Rosas" <farosas@suse.de>,
	qemu-devel@nongnu.org, "Claudio Fontana" <cfontana@suse.de>,
	jfehlig@suse.com, dfaggioli@suse.com, dgilbert@redhat.com,
	"Juan Quintela" <quintela@redhat.com>,
	"Nikolay Borisov" <nborisov@suse.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>
Subject: Re: [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability
Date: Fri, 31 Mar 2023 10:39:23 -0400	[thread overview]
Message-ID: <ZCbwm8qLMOyK93T/@x1n> (raw)
In-Reply-To: <ZCaSEfMphjQ1ic2j@redhat.com>

On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > From: Nikolay Borisov <nborisov@suse.com>
> > > 
> > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > each ram page of the migration stream has a specific offset in the
> > > resulting migration stream. The reason why we'd want such behavior are
> > > two fold:
> > > 
> > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > >    bounded size, since pages which are dirtied multiple times will
> > >    always go to a fixed location in the file, rather than constantly
> > >    being added to a sequential stream. This eliminates cases where a vm
> > >    with, say, 1G of ram can result in a migration file that's 10s of
> > >    GBs, provided that the workload constantly redirties memory.
> > > 
> > >  - It paves the way to implement DIO-enabled save/restore of the
> > >    migration stream as the pages are ensured to be written at aligned
> > >    offsets.
> > > 
> > > The feature requires changing the stream format. First, a bitmap is
> > > introduced which tracks which pages have been written (i.e are
> > > dirtied) during migration and subsequently it's being written in the
> > > resulting file, again at a fixed location for every ramblock. Zero
> > > pages are ignored as they'd be zero in the destination migration as
> > > well. With the changed format data would look like the following:
> > > 
> > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > 
> > What happens with huge pages?  Would page size matter here?
> > 
> > I would assume it's fine it uses a constant (small) page size, assuming
> > that should match with the granule that qemu tracks dirty (which IIUC is
> > the host page size not guest's).
> > 
> > But I didn't yet pay any further thoughts on that, maybe it would be
> > worthwhile in all cases to record page sizes here to be explicit or the
> > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > field just for sanity check too).
> 
> I think recording the page sizes is an anti-feature in this case.
> 
> The migration format / state needs to reflect the guest ABI, but
> we need to be free to have different backend config behind that
> either side of the save/restore.
> 
> IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> small pages initially and after restore use 2 x 1 GB hugepages,
> or vica-verca.
> 
> The important thing with the pages that are saved into the file
> is that they are a 1:1 mapping guest RAM regions to file offsets.
> IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> in the file.
> 
> If the src VM used 1 GB pages, we would be writing a full 2 GB
> of data assuming both pages were dirty.
> 
> If the src VM used 4k pages, we would be writing some subset of
> the 2 GB of data, and the rest would be unwritten.
> 
> Either way, when reading back the data we restore it into either
> 1 GB pages of 4k pages, beause any places there were unwritten
> orignally  will read back as zeros.

I think there's already the page size information, because there's a bitmap
embeded in the format at least in the current proposal, and the bitmap can
only be defined with a page size provided in some form.

Here I agree the backend can change before/after a migration (live or
not).  Though the question is whether page size matters in the snapshot
layout rather than what the loaded QEMU instance will use as backend.

> 
> > If postcopy might be an option, we'd want the page size to be the host page
> > size because then looking up the bitmap will be straightforward, deciding
> > whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> > (UFFDIO_ZEROPAGE).
> 
> This format is only intended for the case where we are migrating to
> a random-access medium, aka a file, because the fixed RAM mappings
> to disk mean that we need to seek back to the original location to
> re-write pages that get dirtied. It isn't suitable for a live
> migration stream, and thus postcopy is inherantly out of scope.

Yes, I've commented also in the cover letter, but I can expand a bit.

I mean support postcopy only when loading, but not when saving.

Saving to file definitely cannot work with postcopy because there's no dest
qemu running.

Loading from file, OTOH, can work together with postcopy.

Right now AFAICT current approach is precopy loading the whole guest image
with the supported snapshot format (if I can call it just a snapshot).

What I want to say is we can consider supporting postcopy on loading in
that we start an "empty" QEMU dest node, when any page fault triggered we
do it using userfault and lookup the snapshot file instead rather than
sending a request back to the source.  I mentioned that because there'll be
two major benefits which I mentioned in reply to the cover letter quickly,
but I can also extend here:

  - Firstly, the snapshot format is ideally storing pages in linear
    offsets, it means when we know some page missing we can use O(1) time
    looking it up from the snapshot image.

  - Secondly, we don't need to let the page go through the wires, neither
    do we need to send a request to src qemu or anyone.  What we need here
    is simply test the bit on the snapshot bitmap, then:

    - If it is copied, do UFFDIO_COPY to resolve page faults,
    - If it is not copied, do UFFDIO_ZEROPAGE (e.g., if not hugetlb,
      hugetlb can use a fake UFFDIO_COPY)

So this is a perfect testing ground for using postcopy in a very efficient
way against a file snapshot.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2023-03-31 14:40 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-30 18:03 [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 01/26] migration: Add support for 'file:' uri for source migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 02/26] migration: Add support for 'file:' uri for incoming migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 03/26] tests/qtest: migration: Add migrate_incoming_qmp helper Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 04/26] tests/qtest: migration-test: Add tests for file-based migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 05/26] migration: Initial support of fixed-ram feature for analyze-migration.py Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 06/26] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 07/26] io: Add generic pwritev/preadv interface Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 08/26] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 09/26] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 10/26] migration/ram: Introduce 'fixed-ram' migration stream capability Fabiano Rosas
2023-03-30 22:01   ` Peter Xu
2023-03-31  7:56     ` Daniel P. Berrangé
2023-03-31 14:39       ` Peter Xu [this message]
2023-03-31 15:34         ` Daniel P. Berrangé
2023-03-31 16:13           ` Peter Xu
2023-03-31 15:05     ` Fabiano Rosas
2023-03-31  5:50   ` Markus Armbruster
2023-03-30 18:03 ` [RFC PATCH v1 11/26] migration: Refactor precopy ram loading code Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 12/26] migration: Add support for 'fixed-ram' migration restore Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 13/26] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 14/26] migration: Add completion tracepoint Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 15/26] migration/multifd: Remove direct "socket" references Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 16/26] migration/multifd: Allow multifd without packets Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 17/26] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 18/26] migration/multifd: Add incoming " Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 19/26] migration/multifd: Add pages to the receiving side Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 20/26] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 21/26] migration/ram: Add a wrapper for fixed-ram shadow bitmap Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 22/26] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 23/26] migration/multifd: Support incoming " Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 24/26] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 25/26] migration: Add direct-io parameter Fabiano Rosas
2023-03-30 18:03 ` [RFC PATCH v1 26/26] tests/migration/guestperf: Add file, fixed-ram and direct-io support Fabiano Rosas
2023-03-30 21:41 ` [RFC PATCH v1 00/26] migration: File based migration with multifd and fixed-ram Peter Xu
2023-03-31 14:37   ` Fabiano Rosas
2023-03-31 14:52     ` Peter Xu
2023-03-31 15:30       ` Fabiano Rosas
2023-03-31 15:55         ` Peter Xu
2023-03-31 16:10           ` Daniel P. Berrangé
2023-03-31 16:27             ` Peter Xu
2023-03-31 18:18               ` Fabiano Rosas
2023-03-31 21:52                 ` Peter Xu
2023-04-03  7:47                   ` Claudio Fontana
2023-04-03 19:26                     ` Peter Xu
2023-04-04  8:00                       ` Claudio Fontana
2023-04-04 14:53                         ` Peter Xu
2023-04-04 15:10                           ` Claudio Fontana
2023-04-04 15:56                             ` Peter Xu
2023-04-06 16:46                               ` Fabiano Rosas
2023-04-07 10:36                                 ` Claudio Fontana
2023-04-11 15:48                                   ` Peter Xu
2023-04-18 16:58               ` Daniel P. Berrangé
2023-04-18 19:26                 ` Peter Xu
2023-04-19 17:12                   ` Daniel P. Berrangé
2023-04-19 19:07                     ` Peter Xu
2023-04-20  9:02                       ` Daniel P. Berrangé
2023-04-20 19:19                         ` Peter Xu
2023-04-21  7:48                           ` Daniel P. Berrangé
2023-04-21 13:56                             ` Peter Xu
2023-03-31 15:46       ` Daniel P. Berrangé
2023-04-03  7:38 ` David Hildenbrand
2023-04-03 14:41   ` Fabiano Rosas
2023-04-03 16:24     ` David Hildenbrand
2023-04-03 16:36       ` Fabiano Rosas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCbwm8qLMOyK93T/@x1n \
    --to=peterx@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=david@redhat.com \
    --cc=dfaggioli@suse.com \
    --cc=dgilbert@redhat.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=jfehlig@suse.com \
    --cc=nborisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.