xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration
@ 2018-06-17 10:18 Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 01/23] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
                   ` (23 more replies)
  0 siblings, 24 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Hi,

A little over a year ago, I posted a patch series implementing support for
post-copy live migration via xenpaging [1].  Following Andrew and Wei's review
of the initial refactoring patches, I promised to follow up with revised
patches, a design document and an experimental performance evaluation.  It took
a lot longer than I thought, but I've finally prepared all of those things now -
hopefully better late than never :)

The patches are the v2 of the series from [1], rebased against "3fafdc2 xen/arm:
p2m: Fix incorrect mapping of superpages", the tip of master when I performed
the rebase and experiments: late May 2017.  They're accessible on GitHub at [2].

Changes from v2:
 - addressed the feedback received from the first round
 - fixed bugs discovered during performance experiments
 - based on results from the performance experiments, added a paging op to
   populate pages directly into the evicted state

Though I haven't actually tried to do so myself, a quick look at the relevant
code indicates that a relatively painless rebase should still be possible.

The body of this mail is the report.  It is intended to describe the purpose,
design and behaviour of live migration both before and after the patches in
sufficient detail to enable a future contributor or academic researcher with
only general Xen familiarity to pick them up if they turn out to be useful in
the future.  I prepared it in plain text for the mailing list, and based its
format on Haozhong Zhang's vNVDIMM design document [3].

TL;DR: These (now slightly stale) patches implement post-copy live migration
       using xenpaging.  They provide a modest downtime reduction when used in
       hybrid mode with pre-copy, likely because they permit the memory
       migration to proceed in parallel with guest device model set-up.  This
       benefit probably doesn't outweigh the cost in terms of increased
       implementation complexity.

Thanks for reading!

- Joshua Otto

[1] https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg03491.html
[2] https://github.com/jtotto/xen/commits/postcopy-v2
[3] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html

Note: I've sent this from my personal e-mail account because I'm no longer able
to send mail from my old school address, though I'm still able to receive mail
sent to it.

Post-Copy Live Migration for Xen - Design and Performance Evaluation
====================================================================

Xen supports live migration of guests between physical hosts.  Documentation of
this feature can be found at [a] - summarized briefly, it enables system
administrators to 'move' a running guest from one physical host running Xen to
another.

One of the most difficult sub-problems of live migration is the memory
migration.  Today, Xen's live memory migration employs an iterative pre-copy
algorithm, in which all guest memory is transmitted from the migration sender to
receiver _before_ execution is stopped at the sender and resumed at the
receiver.  This document describes the design, implementation and performance
evaluation of an alternative live memory migration algorithm, _post-copy_ live
migration, that attempts to address some of the shortcomings of pre-copy
migration by deferring the transmission of part or all of the guest's memory
until after it is resumed at its destination.  The described design adds support
for post-copy without altering the existing architecture of the migration
feature, taking advantage of the xenpaging mechanism to implement post-copy
paging purely in the toolstack.  The experimental performance evaluation of the
new feature indicates that, for the SQL database workload evaluated, post-copy
in combination with some number of pre-copy iterations yields modest
downtime-reduction benefits, but that pure post-copy results in unacceptable
application-level guest downtime.

Content
=======
1. Background
 1.1 Implemented in Xen today: pre-copy memory migration
 1.2 Proposed enhancement: post-copy memory migration
2. Design
 2.1 Current design
  2.1.1 `xl migrate` <-> `xl migrate-receive`, Part One
  2.1.2 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part One
  2.1.3 libxl__stream_write <-> libxl__stream_read, Part One
  2.1.4 xc_domain_save() <-> xc_domain_restore()
  2.1.5 libxl__stream_write <-> libxl__stream_read, Part Two
  2.1.6 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part Two
  2.1.7 `xl migrate` <-> `xl migrate-receive`, Part Two
 2.2 Proposed design changes
  2.2.1 `xl migrate` <-> `xl migrate-receive`, Part One
  2.2.2 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part One
  2.2.3 libxl__stream_write <-> libxl__stream_read, Part One
  2.2.4 xc_domain_save() <-> xc_domain_restore(), Part One
   2.2.4.1 Pre-copy policy
   2.2.4.2 Post-copy transition
  2.2.5 libxl__stream_write <-> libxl__stream_read, Part Two
  2.2.6 xc_domain_save() <-> xc_domain_restore(), Part Two: memory post-copy
   2.2.6.1 Background: xenpaging
   2.2.6.2 Post-copy paging
   2.2.6.3 Batch page-out operations
  2.2.7 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part Two
  2.2.8 `xl migrate` <-> `xl migrate-receive`, Part Two
3. Performance evaluation
 3.1 Prior work and metrics
 3.2 Experiment: pgbench
  3.2.1 Experiment design
  3.2.2 Results
   3.2.2.1 Algorithms A vs. E: stop-and-copy vs. post-copy after iterative pre-copy
   3.2.2.2 Algorithm C: pure post-copy
   3.2.2.3 Algorithm B vs. D: post-copy after a single pre-copy iteration
 3.3 Further Experiments
4. Conclusion
5. References

1. Background

1.1 Implemented in Xen today: pre-copy memory migration

 Live migration of guest memory in Xen is currently implemented using an
 iterative pre-copy algorithm with fixed iteration-count and remaining-page
 thresholds.  It can be described at a high level by the following sequence of
 steps:
  1) transmit all of the guest's pages to the migration receiver
  2) while more than DIRTY_THRESHOLD pages have been dirtied since they were
     last transmitted and fewer than ITERATION_THRESHOLD transmission iterations
     have been performed...
        3) transmit all guest pages modified since last transmission, goto 2)
  4) pause the guest
  5) transmit any remaining dirty pages, along with the guest's virtual hardware
     configuration and the state of its virtual devices

 If the migration process can transmit pages faster than they are dirtied by the
 guest, the migration loop converges - each successive iteration begins with
 fewer dirty pages than the last.  If it converges sufficiently quickly, the
 number of dirty pages drops below DIRTY_THRESHOLD pages in fewer than
 ITERATION_THRESHOLD iterations and the guest experiences minimal downtime. (the
 current values of DIRTY_THRESHOLD and ITERATION_THRESHOLD are 50 and 5,
 respectively)

 This approach has worked extremely well for the last >10 years, but has some
 drawbacks:
  - The guest's page-dirtying rate is likely non-uniform across its pages.
    Instead, most guests will dirty a subset of their pages much more frequently
    than the rest (this subset is often referred to as the Writable Working Set,
    or WWS).  If the WWS is larger than DIRTY_THRESHOLD and its pages are
    dirtied at a higher rate than the migration transmission rate, the migration
    will 'get stuck' trying to migrate these pages.  In this situation:
     - All the time and bandwidth spent attempting pre-copy of these pages is
       wasted (no further reduction in stop-and-copy downtime can be gained)
     - The guest suffers downtime for the full duration required to transmit the
       WWS at the end of the migration anyway.
  - Migrating guests continue to consume CPU and I/O resources at the sending
    host for the entire duration of the memory migration, which limits the
    effectiveness of migration for the purpose of load-balancing these
    resources.

1.2 Proposed enhancement: post-copy memory migration

 Post-copy live migration is an alternative memory migration technique that (at
 least theoretically) addresses these problems.  Under post-copy migration,
 execution of the guest is moved from the sending to receiving host _before_ the
 memory migration is complete.  As the guest executes at the receiver, any
 attempts to access unmigrated pages are intercepted as page-faults and the
 guest is paused while the accessed pages are synchronously fetched from the
 sender.  When not servicing faults for specific unmigrated pages, the sender
 can push the remaining unmigrated pages in the background.  The technique can
 be employed immediately at the start of a migration, or after any amount of
 pre-copying (including in the middle of a pre-copy iteration).

 The post-copy technique exploits the fact that the guest can make some progress
 at the receiver without access to all of its memory, permitting execution to
 proceed in parallel with the continuing memory migration and breaking up the
 single long stop-and-copy downtime into smaller intervals interspersed with
 periods of execution.  Depending on the nature of the application running in
 the guest, this can be the difference between only degraded performance and
 observable downtime.

 Compared to the existing pre-copy technique, post-copy also has the same
 total-migration-time and bandwidth consumption advantages as outright
 stop-and-copy: each page is migrated exactly once, rather than arbitrarily many
 times according to the dirtying behaviour of the guest.

2. Design

 The live migration feature is implemented almost entirely in the toolstack, by
 a set of cooperating dom0 processes distributed between both peers whose
 functionality is split across four layers:

 Layer         |     Sender Process     |     Receiver Process
 --------------+------------------------+------------------------------
  xl           | `xl migrate`           | `xl migrate-receive`
  libxl        | libxl_domain_suspend() | libxl_domain_create_restore()
  libxl stream | libxl__stream_write    | libxl__stream_read
  -------- (libxl-save-helper process boundary) -----------------------
  libxc        | xc_domain_save()       | xc_domain_restore()

 Section 2.1 describes the flow of control through each of these layers in the
 existing design for the case of a live migration of an HVM domain.  Section 2.2
 describes the changes to the existing design required to accommodate the
 introduction of post-copy memory migration.

2.1 Current design

2.1.1 `xl migrate` <-> `xl migrate-receive`, Part One

 An administrator (or automation) initiates a live migration with the `xl
 migrate` command at the sending host, specifying the domain to be migrated and
 the receiving host.

 `xl migrate`:
  - Gathers the domain's xl.cfg(5)-level configuration.
  - Spawns an SSH child process that launches `xl migrate-receive` at the
    destination host, thereby establishing a secure, bidirectional
    stream-oriented communication channel between the remotely cooperating
    migration processes.
  - Waits to receive the `migrate_receiver_banner` message transmitted by the
    `xl migrate-receive` peer, confirming the viability of the link and the
    readiness of the peer.
  - Transmits the domain configuration gathered previously.
  - Calls libxl_domain_suspend().  This is the handoff to the libxl API, which
    handles the migration of the domain's _state_ now that its configuration is
    taken care of.

 Meanwhile, the setup path in the `xl migrate-receive` peer at the destination:
  - Immediately transmits the `migrate_receiver_banner`.
  - Receives the xl.cfg(5) domain configuration in binary format and computes
    from it a libxl_domain_config structure.
  - Calls libxl_domain_create_restore() with the computed configuration and
    communication streams.

2.1.2 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part One

 libxl_domain_send():
  - Initializes various mechanisms required to support live migration (e.g.
    guest suspension and QEMU logdirty support)
  - Calls libxl__stream_write_start() to kick off the async stream-writing path.
  - Drops into the AO_INPROGRESS async event loop, which drives control flow for
    the remainder of the migration.

    Note: as a first-time reader of libxl when exploring this code path, I found
    the invocation of the AO_INPROGRESS event loop at [s] to be _extremely_
    non-obvious, because the macro isn't function-like - at first I assumed
    something like '#define AO_INPROGRESS EINPROGRESS', while reality is closer
    to '#define AO_INPROGRESS do { poll(); dispatch_events(); } while (!done)'

 libxl_domain_create_restore(), meanwhile:
  - Validates the configuration of the domain for compatibility with the
    receiving host and the live migration process.
  - Creates an 'empty' domain via xc_domain_create(), to be subsequently filled
    in with the state of the migrating domain.
  - Prepares the new domain's XenStore hierarchy.
  - Calls libxl__stream_read_start() to kick off the async stream-reading path.
  - Drops into the AO_INPROGRESS async event loop.

2.1.3 libxl__stream_write <-> libxl__stream_read, Part One

 The stream writer:
  - Writes the stream header.
  - Writes the LIBXC_CONTEXT record, indicating that control of the stream is to
    be transferred to libxc for the migration of the guest's virtual
    architectural state.
  - Launches the libxl-save-helper, which exists to permit the synchronous
    execution of xc_domain_save() while keeping the libxl API asynchronous from
    the perspective of the library client.

 The stream reader:
  - Reads the stream header.
  - Reads the LIBXC_CONTEXT record, and launches its own libxl-save-helper to
    run xc_domain_restore().

2.1.4 xc_domain_save() <-> xc_domain_restore()

 xc_domain_save():
  - Writes the Image and Domain headers.
  - Allocates the dirty_bitmap, a bitmap tracking the set of guest pages whose
    up-to-date contents aren't known at the receiver.
  - Enables the 'logdirty' hypervisor and emulator mechanisms.
  - Transmits all of the pages of the guest in sequence.
      - Guest pages are transmitted in batches of 1024 at a time.  Transmitting
        a batch entails mapping each page in the batch into the process via the
        xenforeignmemory_map() interface and collecting them into an iovec for
        consumption by writev(2).
  - Iteratively, until either of the conditions for termination are met:
      - Refreshes the contents of the dirty_bitmap via
        XEN_DOMCTL_SHADOW_OP_CLEAN, which atomically records the current state
        of the dirty bitmap maintained in Xen and clears it (marking all pages
        as 'clean' again for the next round).
      - Re-transmits each of the pages marked in the updated dirty_bitmap.
  - Suspends the domain (via IPC to the controlling libxl).
  - Obtains the 'final' dirty_bitmap - since the guest is now paused, it
    can no longer dirty pages, so transmitting the pages marked in this
    bitamp will ensure that the receiving peer has the up-to-date contents
    of every page.
  - Transmits these pages.
  - Collects and transmitting the rest of the HVM state: at present this
    includes TSC info, the architectural state of each vCPU (encapsulated in
    a blob of 'HVM context' extracted via domctl), and the HVM_PARAMS (which
    describe, among other things, a set of 'magic' pages within the guest).
  - Transmits the END record.

 xc_domain_restore():
  - Reads and validates the Image and Domain headers.
  - Allocates the populated_pfns bitmap, a bitmap tracking the set of guest
    pages that have been 'populated' (allocated by the hypervisor for use by the
    guest).
  - Iteratively consumes the stream of PAGE_DATA records transmitted as the
    sender executes the migration loop.
      - Consuming a PAGE_DATA record entails populating each of the pages in the
        received record that hasn't previously been populated (as recorded in
        populated_pfns), then mapping all of the pages in the batch and updating
        their contents with the data in the new record.
  - After the suspension of the guest by the sender, receives the final sequence
    of PAGE_DATA records and the remainder of the state records.  This entails
    installing the received HVM context, TSC info and HVM params into the guest.
  - Consumes the END record.

 After the sender has transmitted the END record and the receiver has consumed
 it, control flow passes out of libxc and back into the libxl-save-helper on
 each side.  The result on each side is reported via IPC back to the main libxl
 process and both helpers terminate.  These terminations are observed as
 asynchronous events in libxl that resume control flow at that level.

2.1.5 libxl__stream_write <-> libxl__stream_read, Part Two

 The stream writer next proceeds along an asynchronous chain of record
 composition and transmission:
  - First, emulator data maintained in XenStore is collected and transmitted.
  - Next, the state of the emulator itself (the 'emulator context') is collected
    and transmitted.
  - Finally the libxl END record is transmitted.  At this point, the libxl
    stream is complete and the stream completion callback is invoked.

 The stream reader executes an asynchronous record receipt loop that consumes
 each of these records in turn.
  - The emulator XenStore data is mirrored into the receiver XenStore.
  - The emulator context blob is written to local storage for subsequent
    consumption during emulator establishment.
  - When the END record is received, the completion callback of the stream
    reader is invoked.

2.1.6 libxl_domain_suspend() <-> libxl_domain_create_restore(), Part Two

 Relatively little happens in this phase at the sender side: some teardown is
 carried out and then the libxl AO is marked as complete, terminating the
 AO_INPROGRESS event loop of libxl_domain_suspend() and returning control flow
 back to `xl migrate`.

 In libxl_domain_create_restore(), on the other hand, the work of unpacking the
 rest of the received guest state and preparing it for resumption ('building
 it', in the vocabulary of the code) remains.  To be completely honest, my
 understanding of exactly what this entails is a little shaky, but the flow of
 asynchronous control through this process following completion of the stream
 follows roughly this path, with the names of each step giving a reasonable hint
 at the work being performed:
  -> domcreate_stream_done() [b]
  -> domcreate_rebuild_done() [c]
  -> domcreate_launch_dm() [d]
  -> domcreate_devmodel_started() [e]
  -> domcreate_attach_devices() (re-entered iteratively for each device) [f]
  -> domcreate_complete() [g]

 domcreate_complete() marks the libxl AO as complete, terminating the
 AO_INPROGRESS loop of libxl_domain_create_restore() and returning control flow
 to `xl migrate-receive`.

2.1.7 `xl migrate` <-> `xl migrate-receive`, Part Two

 At this point, the guest is paused at the sending host and ready to be unpaused
 at the receiving host.  Logic in the xl tools then carries out the following
 handshake to safely destroy the guest at the sender and actually unpause it at
 the receiver:

 Sender: After the return of libxl_domain_suspend(), the sender waits
         synchronously to receive the `migrate_receiver_ready` message.

 Receiver: After libxl_domain_create_restore() returns, the receiver transmits
           the `migrate_receiver_ready` message and synchronously waits to
           receive the `migrate_permission_to_go` message.

 Sender: After `migrate_receiver_ready` is received, the sender renames the
         domain with the '--migratedaway' suffix, and transmits
         `migrate_permission_to_go`.

 Receiver: After `migrate_permission_to_go` is received, the receiver renames
           the newly-restored domain to strip its original '--incoming' suffix.
           It then attempts to unpause it, and reports the success or failure of
           this operation as the `migrate_report`.

           If all has gone well up to this point, the guest is now live and
           executing at the receiver.

 Sender: If the `migrate_report` indicates success, the '--migratedaway' domain
         is destroyed.

 If any of the steps in this sequence prior to the transmission of the
 `migrate_permission_to_go` message fail _or_ a positive report of failure from
 the receiver arrives, the receiver destroys their copy of the domain and the
 sender recovers by unpausing its (still-valid!) copy of the guest.  If,
 however, the sender transmits `migrate_permission_to_go` and a positive report
 of success from the receiver fails to arrive, the migration has fallen into the
 'failed badly' scenario where the sender cannot safely recover by resuming its
 local copy, because the receiver's copy _may_ be executing.  This is the most
 serious possible failure mode of the scheme described here.

2.2 Proposed design changes

 The proposed patch series [h] introduces support for a new post-copy phase in
 the live memory migration.  In doing so, it makes no architectural changes to
 the live migration feature: it is still implemented in the user-space
 toolstack, and the layering of components within the toolstack is preserved.
 The most substantial changes are made to the core live memory migration
 implementation in libxc, with a few supporting changes in the libxl migration
 stream and even fewer higher up in libxl/xl.

 To carry out the transition to the new post-copy phase:
  - At the end of the pre-copy phase of the memory migration, the sender now
    transmits only the _pfns_ of the final set of dirty pages where previously
    it transmitted their contents.
  - The receiving save helper process registers itself as a _pager_ for the
    domain being restored, and marks each of the pages in the set as 'paged
    out'.  This is the key mechanism by which post-copy's characteristic
    demand-faulting is implemented.
  - The sender next transmits the remaining guest execution context.  This
    includes the libxl context, requiring that control of the stream be
    _temporarily_ handed up from libxc back to libxl.  After all libxl context
    is transmitted, control of the stream is handed back to libxc.
  - The receiver installs this additional context exactly as before (requiring a
    symmetric temporary handoff to libxl on this side as well).
  - At this point, all state except the outstanding post-copy pages has been
    transmitted, and the guest is ready for resumption.  The receiving libxl
    process (the parent of the receiver migration helper) then initiates the
    resumption process described in 2.1.6.  This completes the transition to the
    post-copy phase.

 Once in the post-copy phase:
  - The sender iterates over the set of post-copy pages, transmitting them in
    batches.  Between batches, it checks if any pages have been specifically
    requested by the receiver, and prioritizes them for transmission.
  - The receiver, as the pager of a now-live guest, forwards faulting pages to
    the sender.  When batches arrive from the sender, they are installed via the
    page-in path.

 These loops terminate when all of the post-copy pages have been sent and
 received, respectively, after which all that remains is teardown (the paused
 image of the guest at the sender is destroyed, etc.).

 The rest of this section presents a more detailed description of the control
 flow of a live migration with a post-copy phase, focusing on the changes to
 each corresponding subsection of 2.1.

2.2.1 `xl migrate` <-> `xl migrate-receive`, Part One

 As before, live migration is initiated with the `xl migrate` command at the
 sending host.  A new '--postcopy' option is added to the command, which is used
 to compute the value of a new 'memory_strategy' parameter to
 libxl_domain_suspend() (or rather, to libxl_domain_live_migrate(), a new libxl
 API entrypoint like libxl_domain_suspend() but with additional parameters that
 are only meaningful in the context of live migration).  Two values for this
 parameter are possible:
  - STOP_AND_COPY specifies that upon termination of the pre-copy loop the
    migration should be terminated with a stop-and-copy migration of the final
    set of dirty pages
  - POSTCOPY specifies that upon termination of the pre-copy loop the migration
    should transition to the post-copy phase

 An additional boolean out-parameter, 'postcopy_transitioned', is also passed to
 libxl_domain_live_migrate().  This bit is set within libxl_domain_send() at the
 end of the post-copy transition (from the sender's point of view, this is after
 the libxl POSTCOPY_TRANSITION_END is sent), and is used by the caller to decide
 whether or not it's safe to attempt to resume the paused guest locally in the
 event of failure.

 A similar boolean out-parameter, 'postcopy_resumed', is now passed to
 libxl_domain_create_restore().  It is set during the post-copy phase when the
 domain is (or isn't) successfully unpaused at the end of the
 domain-building/resumption process, and is used by the caller to determine
 whether or not the unpause handshake should occur.

2.2.2 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part One

 This stage is mostly unchanged on both sides.  The new memory_strategy
 parameter of libxl_domain_live_migrate() is stashed in the libxl async request
 structure for later use by a new 'precopy policy' RPC callback for
 xc_domain_save(), described in section 2.2.4.

2.2.3 libxl__stream_write <-> libxl__stream_read, Part One

 This stage is entirely unchanged on both sides.

2.2.4 xc_domain_save() <-> xc_domain_restore(), Part One

2.2.4.1 Pre-copy policy

 The first major change to xc_domain_save() is the generalization of the
 'pre-copy policy', i.e. the algorithm used to decide how long the pre-copy
 phase of the migration should continue before transitioning forward.
 As described earlier, the historical policy has been to continue until either
 the ITERATION_THRESHOLD is exceeded or fewer than DIRTY_THRESHOLD pages remain
 at the end of a round, at which point an unconditional transition to
 stop-and-copy has occurred.

 The generalization of this policy is introduced early in the proposed patch
 series.  It factors the decision-making logic out of the mechanism of the
 migration loop and into a new save_callbacks function with the following
 prototype:

     struct precopy_stats
     {
         unsigned int iteration;
         unsigned int total_written;
         int dirty_count; /* -1 if unknown */
     };

     /* Policy decision return codes. */
     #define XGS_POLICY_ABORT          (-1)
     #define XGS_POLICY_CONTINUE_PRECOPY 0
     #define XGS_POLICY_STOP_AND_COPY    1
     #define XGS_POLICY_POSTCOPY         2

     int precopy_policy(struct precopy_stats stats, void *data);

 This new hook is invoked after each _batch_ of pre-copy pages is transmitted, a
 much finer granularity than the previous policy which was evaluated only at
 iteration boundaries.  This introduces a bit of extra complexity to the problem
 of computing the 'final' set of dirty pages: where previously it was sufficient
 to execute one final XEN_DOMCTL_SHADOW_OP_CLEAN after pausing the guest, now
 the true set of final dirty pages is the union of the results of the final
 CLEAN and the subset of the last CLEAN result set remaining in the interrupted
 final pre-copy iteration.  To solve this problem, pages are cleared from the
 dirty_bitmap as they are added to the current migration batch, meaning that the
 dirty_bitmap at the point of interruption is exactly the subset not yet
 migrated during the previous iteration.  These bits are temporarily transferred
 to the deferred_pages bitmap while the final CLEAN is executed, and then merged
 back into dirty_bitmap.

 In making this change, my motivation was to permit two new sorts of policies:
  1) 'Continue until some budget of network bandwidth/wallclock time is
     exceeded, then transition to post-copy', which seemed like it would be
     useful to administrators wishing to take advantage of post-copy to set a
     hard bound on the amount of time or bandwidth allowed for a migration while
     still offering the best-effort liveness of post-copy.
  2) 'Continue until the live human operator decides to initiate post-copy',
     which was explicitly to match the equivalent QEMU post-copy feature [i].

 In retrospect, I should probably have focused more on making the post-copy
 mechanism work under the existing policy and left the broader issue for later
 discussion.  It's not really required, and the patches in their current state
 provide implement in this hook exactly the previous policy, simply returning
 POSTCOPY rather than STOP_AND_COPY at the old termination point based on the
 libxl_domain_live_migrate() memory_strategy (which itself simply reflects
 whether or not the user specified '--postcopy' to `xl migrate`).

2.2.4.2 Post-copy transition

 For the sender, the transition from the pre-copy to the post-copy phase begins
 by:
  1) Suspending the guest and collecting the final dirty bitmap, just as for
     stop-and-copy.
  2) Transmitting to the receiver a new POSTCOPY_BEGIN record, to prime them for
     subsequent records whose handling differs between stop-and-copy and
     post-copy.
  3) Transmitting the set of 'end-of-checkpoint' records (e.g. TSC_INFO,
     HVM_CONTEXT and HVM_PARAMS in the case of HVM domains)
  4) Transmitting a POSTCOPY_PFNS_BEGIN record, followed by a sequence of
     POSTCOPY_PFNS records enumerating the set of pfns to post-copy migrated.
     Post-copy PFNs are transmitted in POSTCOPY_PFNS records, which are like
     PAGE_DATA records but without the trailing actual page contents.  Each
     batch can hold up to 512k 64-bit pfns while staying within the stream
     protocol's 4mb record size cap.

 At this point, the only state needed to resume the guest not yet available at
 the receiver is the higher-level libxl context.  Control of the stream must
 therefore be handed back to the libxl stream writer.  This is done by:
  5) Writing a new POSTCOPY_TRANSITION record, to co-ordinate a symmetric
     hand-off at the receiver.
  6) Executing the synchronous postcopy_transition RPC, to which the libxl
     parent will reply when the libxl stream is finished.

 At the receiver (numbered to match corresponding sender steps):
  2) When the POSTCOPY_BEGIN record arrives, only the restore context 'postcopy'
     bit is set.
  3) The end-of-checkpoint records arrive next and are handled as in
     stop-and-copy, with one exception: when the HVM_PARAMS record arrives, the
     magic page parameters (HVM_PARAM_*_PFN) are explicitly populated, as they
     may not yet have been.  This is to ensure that the magic pages that must be
     cleared can be, and in the case of the PAGING_RING so that the immediately
     following pager setup succeeds.
  4) When the POSTCOPY_PFNS_BEGIN record arrives, the receiving helper enables
     paging on the migrating guest by establishing itself as its pager.  As the
     subsequent POSTCOPY_PFNS records arrive, each of the pages in the post-copy
     set are marked as 'paged out' (the paging component of the change is
     described in greater detail in section 2.2.6).
  6) When the POSTCOPY_TRANSITION record arrives, the synchronous receive-side
     postcopy_transition RPC is executed, transferring control of the stream
     back to the receiving libxl parent.

2.2.5 libxl__stream_write <-> libxl__stream_read, Part Two

 The postcopy_transition() RPC from the libxc save helper is plumbed to
 libxl__stream_write_start_postcopy_transition(), which records in the stream
 writer context that it's executing a new SWS_PHASE_POSTCOPY_TRANSITION callback
 chain and then kicks off exactly the same chain as before, starting with the
 emulator XenStore record.  At the end of the chain, a POSTCOPY_TRANSITION_END
 record is written, indicating to the receiver that control of the migration
 stream is to be transferred back to libxc in the helpers for the duration of
 the post-copy memory migration phase.  This transfer is then carried out by
 signalling the completion of the postcopy_transition() RPC to the libxc save
 helper.

 The postcopy_transition RPC from the libxc _receiver_ helper is plumbed to
 libxl__stream_read_start_postcopy_transition(), which is symmetric in spirit
 and implementation to its companion at the sender.

 At the end of the libxl post-copy transition, two concurrent stages of the
 migration begin: the libxc post-copy memory migration, and the receiver libxl
 domain resumption procedure.

2.2.6 xc_domain_save() <-> xc_domain_restore(), Part Two: memory post-copy

 This stage implements the key functionality of the post-copy memory migration:
 it permits the building, resumption and execution of the migrating guest at the
 receiver before all of its memory is migrated.  To achieve this, the receiver
 must intercept _all_ accesses to the unmigrated pages of the guest as they
 occur and fetch their contents from the sender before allowing them to proceed.
 Fortunately, the fundamental supporting mechanism - guest paging - already
 exists!  It's documented fairly lightly [k] and the description given there
 doesn't do much to inspire confidence in its stability, but it is completely
 sufficient in its current state for the purpose of intercepting accesses to
 unmigrated pages during the post-copy phase.

2.2.6.1 Background: xenpaging

 Paging for a given guest is managed by a 'pager', a process in a privileged
 foreign domain that a) identifies and evicts pages to be paged out and b)
 services faults for evicted pages.  To facilitate this, the hypervisor provides
 a) a family of paging operations under the `memory_op` hypercall and b) an
 event ring, into which it _produces_ events when paged pages are accessed
 (pausing the accessing vCPU at the same time) and from which it _consumes_
 pager responses indicating that accessed pages were loaded (and correspondingly
 unpausing the accessing vCPU).

 Evicting a page requires the pager to perform two operations:
  1) When a page is first selected for eviction by the pager's policy it is
     marked with the `nominate` operation, which sets up the page to trap upon
     writes to detect modifications during page-out.  The pager then maps and
     write the page's contents to its backing store.
  2) After the page's contents are saved, the pager tries to complete the
     process with the `evict` operation.  If the page was not modified since its
     nomination the eviction succeeds and its memory can be freed.  If it was,
     however, the eviction has failed.

 Re-installing a paged page is performed in a single `prep` operation, which
 atomically allocates and copies in the content of the paged page.

 The protocol for the paging event ring consists of a request and response:
  1) The hypervisor emits a request into the event ring when it intercepts an
     access to a paged page.  There are two classes of accesses that can occur,
     though they are indistinguishable from the point of view of the ring
     protocol:
      a) accesses from within the guest, which result in the accessing vCPU
         being paused
      b) mappings of paged pages by foreign domains, which are made to fail
         (with the expectation that the mapper retry after some delay)
     In either case, the request made to the ring communicates the faulting pfn.
     In the former case, it also communicates the faulting vCPU.
  2) The pager consumes these requests, obtains the contents of the faulting
     pfns by its own unique means, and after performing the `prep` operation to
     install them, emits back into the ring a response containing exactly the
     information in the original request.

2.2.6.2 Post-copy paging

 Post-copy paging setup occurs when the POSTCOPY_PFNS_BEGIN record arrives.  The
 libxc restore helper begins by registering itself as the new guest's pager,
 enabling paging and setting up the event ring.

 As subsequent POSTCOPY_PFNS records arrive, the pfns they contain must all be
 marked as paged out at the hypervisor level.  Doing so naively can be
 prohibitively costly when the number of post-copy pages is large; this problem,
 and its solution, are described in 2.2.6.3.

 After control of the stream is returned at the end of the post-copy transition,
 the steady-state of the post-copy phase begins.  Crucially, this occurs even
 while the libxl 'building' of the guest proceeds.  This is important because
 domain building can and does access guest memory - in particular, QEMU maps
 guest memory.

 For the duration of the post-copy phase, the receiver maintains a simple
 state-machine for each post-copy pfn, described in this source-code comment at
 the declaration of its storage:

     /*
      * Prior to the receipt of the first POSTCOPY_PFNS record, all
      * pfns are 'invalid', meaning that we don't (yet) believe that
      * they need to be migrated as part of the postcopy phase.
      *
      * Pfns received in POSTCOPY_PFNS records become 'outstanding',
      * meaning that they must be migrated but haven't yet been
      * requested, received or dropped.
      *
      * A pfn transitions from outstanding to requested when we
      * receive a request for it on the paging ring and request it
      * from the sender, before having received it.  There is at
      * least one valid entry in pending_requests for each requested
      * pfn.
      *
      * A pfn transitions from either outstanding or requested to
      * ready when its contents are received.  Responses to all
      * previous pager requests for this pfn are pushed at this time,
      * and subsequent pager requests for this pfn can be responded
      * to immediately.
      *
      * A pfn transitions from outstanding to dropped if we're
      * notified on the ring of the drop.  We track this explicitly
      * so that we don't panic upon subsequently receiving the
      * contents of this page from the sender.
      *
      * In summary, the per-pfn postcopy state machine is:
      *
      * invalid -> outstanding -> requested -> ready
      *                |                        ^
      *                +------------------------+
      *                |
      *                +------ -> dropped
      *
      * The state of each pfn is tracked using these four bitmaps.
      */
     unsigned long *outstanding_pfns;
     unsigned long *requested_pfns;
     unsigned long *ready_pfns;
     unsigned long *dropped_pfns;

 A given pfn's state is defined by the set it's in (set memberships are mutually
 exclusive).

 The receiver's post-copy loop can be expressed in pseudo-code as:

     outstanding_pfns = { final dirty pages }
     requested_pfns = { }
     ready_pfns = { }
     while (outstanding_pfns is not empty) {
         /*
          * Wait for a notification on the paging ring event channel, or for
          * data to arrive on the migration stream.
          */
         wait_for_events();

         /*
          * Consume any new faults generated by the guest and forward them to
          * the sender so that their transmission is prioritized.
          */
         faults = {}
         while (!empty(paging_ring)) {
             fault = take(paging_ring)
             if (fault in ready_pfns) {
                 /*
                  * It's possible that the faulting page may have arrived and
                  * been loaded after the fault occurred but before we got
                  * around to consuming the event from the ring.  In this case,
                  * reply immediately.
                  */
                 notify(paging_ring)
             } else {
                 faults += fault
             }
         }

         outstanding_pfns -= faults
         requested_pfns += faults
         send(faults)

         /*
          * Consume incoming page data records by installing their contents into
          * the guest.  If a guest vCPU is paused waiting for the arrival of a
          * given page, unpause it now that it's safe to continue.
          */
         while (record = read_record(migration_stream)) {
             paging_load(record.pfn, record.data)
             if (record.pfn in requested_pfns) {
                 notify(paging_ring)
             }

             requested_pfns -= record.pfn
             ready_pfns += record.pfn
         }
     }

 The sender's companion post-copy loop is simpler:

     remaining_pages = { final dirty pages }
     while (remaining_pages) {
         transmission_batch = {}

         /* Service new faults. */
         faults = recv_faults()
         if (faults) {
             transmission_batch += faults
         }

         /* Fill out the rest of the batch with background pages. */
         remainder = take(remaining_pages, BATCH_SIZE - count(faults))
         transmission_batch += remainder
         remaining_pages -= remainder

         send(transmission_batch)
     }

 One interesting problem is in deciding which not-yet-requested pages should be
 pushed in the next background batch.  Ideally, they should be sent in the order
 they'll be accessed, to minimize the faults at the receiver.  In practice,
 the general problem of predicting the guest's page access stream is _extremely_
 difficult - this is the well-known pre-paging problem, which has been explored
 by decades of academic and industrial research.  The current version of the
 patch series exploits spatial locality in the physical page access stream by
 starting at the next unsent pfn after the last faulting pfn and proceeding
 forward.

 The sender terminates the libxc stream with a POSTCOPY_COMPLETE record so that
 the receiver can flush (i.e. consume) all in-flight POSTCOPY_PAGE_DATA records
 before control of the stream is handed back to libxl on both sides.

2.2.6.3 Batch page-out operations

 When a batch of POSTCOPY_PFNS arrives during the post-copy transition, all of
 the pfns in the batch must be marked paged-out as quickly as possible to
 minimize the downtime required.  Doing so using the existing paging primitives
 requires:
  1) 'Populating' (i.e. allocating a backing physical page for) any pages in the
     batch that aren't already, because only populated pages can be 'paged out'.
     In a migration that transitions to post-copy before the end of the first
     iteration, any pages not sent during the partial first round will be
     unpopulated during the post-copy transition.  In the special case of an
     instant post-copy migration, this will be _all_ of the guest's pages.
  2) Performing the `nominate` and `evict` operations individually on each page
     in turn, because only nominated pages can be evicted.

 There are a few obvious inefficiencies here:
  - Populating the pages is unnecessary.
  - The `nominate` operation is entirely unnecessary when the page's contents
    are already available from the pager's backing store and can't be
    invalidated by modification by the guest or other foreign domains.
  - The `evict` operation acts on a single page at a time even when many pages
    are known to need eviction up front.

 Together, these inefficiencies make the combined operation of 'evicting' many
 pfns at a time during the critical post-copy transition downtime phase quite
 costly.  Quantitatively, in my experimental set-up (described in detail in
 section 3.2.1) I measured the time required to evict a batch of 512k pfns at
 8.535s, which is _enormous_ for outright downtime.

 To solve this problem, the last patches in the series introduce a new memory
 paging op designed to address specifically this situation, called
 `populate_evicted`.  This operation takes a _batch_ of pfns and, for each one:
  - de-populates it if populated
  - transitions it directly to the paged-out state, skipping the nomination step

 With a further patch rewriting the POSTCOPY_PFNS handler to use this new
 primitive, I measured a 512k-pfn eviction time of 1.590s, a 5.4x improvement.

2.2.7 libxl_domain_live_migrate() <-> libxl_domain_create_restore(), Part Two

 The sender side of this stage is unchanged: libxl_domain_live_migrate() returns
 almost immediately back up to `xl migrate`.

 The receiver side becomes more complicated, however, as it now has to
 manage the completion of two asynchronous operations:
  1) its libxc helper, which terminates when all of the outstanding post-copy
     pages have arrived and been installed
  2) the domain-building/resumption process detailed in section 2.1.6 (the
     functional sequence is unchanged - the fact that some guest memory remains
     unmigrated is made completely transparent by the libxc pager)

 These operations (referred to in code as the 'stream' and 'resume' operations,
 respectively) can complete in either order, and either can fail.  A pair of
 state machines encoding the progress of each are therefore introduced to the
 libxl domain create state context structure, with three possible states each:

     INPROGRESS --+--> FAILED (rc)
                  |
                  +--> SUCCESS (rc == 0)

 In a healthy migration, the first operation to make the INPROGRESS -> SUCCESS
 transition simply records its final state as such and waits for the completion
 of the second.  The second operation to complete then finds the other already
 complete, and calls the overall completion callback to report success.  For
 example, in the case of a long memory post-copy phase, the resume operation is
 expected to complete first.  When it does, it finds that the stream operation
 is still running, so it simply transitions to SUCCESS.  When the post-copy
 migration is finished and the libxc helper terminates, the new
 domcreate_postcopy_stream_done() callback finds the resume successfully
 completed and reports the completion of the entire operation.

 The 'resume' operation is initiated from the
 domcreate_postcopy_transition_callback(), kicking off the same callback
 sequence as started by domcreate_stream_done() in the non-post-copy case.  All
 termination points along this callback sequence are hooked by the new
 domcreate_report_result(), which when given a successful result to report also
 unpauses the guest to begin true post-copy execution.  If the resume fails and
 the stream isn't yet complete, we latch the error, actively abort the stream
 and then wait for the failure completion of the stream to complete the overall
 operation.

 The 'stream' operation's completion is signalled by
 domcreate_postcopy_stream_done(), which is wired up to the libxc helper's
 SIGCHLD in the way that domcreate_stream_done() was previously.  If the stream
 fails, its error is simply stashed (and no other action taken) on the
 assumption that the resumption will eventually complete one way or another and
 find it.

2.2.8 `xl migrate` <-> `xl migrate-receive`, Part Two

 If everything has gone according to plan up to this point, the migration is
 effectively complete - the guest is now unpaused and executing at the receiver
 with all of its memory migrated.  The cautious final unpause handshake is
 therefore no longer necessary, so the sender simply destroys its copy of the
 domain, the receiver strips the migration suffix from the name of its copy and
 the entire process is complete!  The receiver does still send a completion
 message to the sender, however, simply to signal to an interactive user at the
 sender exactly when the operation has completed.

 If, however, something has gone awry, the penalty of post-copy's extra
 vulnerability to failure is paid.  At the receiver, any failure reported by
 libxl_domain_create_restore() results in the prompt destruction of the local
 copy of the domain, even if it's already executing as part of the post-copy
 phase and some of the guest's state exists only in this copy, because no sane
 recovery mode exists with other parts of its state locally unavailable.  At the
 sender, the new postcopy_transitioned out-parameter of
 libxl_domain_live_migrate() is examined:
  - if the transition record wasn't transmitted (postcopy_transitioned ==
    false), there's no way that the guest could possibly have begun executing at
    the sender, so it's safe to recover by unpausing the original copy of the
    domain
  - if it was, however (postcopy_transitioned == true), it's _possible_ (though
    not certain) that the guest may have executed (or may even still _be_
    executing) at the destination, so unpausing it locally isn't safe

 In the latter case, the policy is essentially the same as in the existing
 'failed_badly' scenario of normal pre-copy migration in which the sender fails
 to receive the migration success report after transmitting the
 `migrate_permission_to_go` message: the local copy of the domain is suffixed
 with --postcopy-inconsistent, and a diagnostic message is printed explaining
 the failure.

 One major possible improvement to this scheme is especially worth noting: at
 the sender, the current postcopy_transitioned bit is a very conservative
 indication of whether it's safe to attempt local recovery.  There are many ways
 in which the attempted resumption of the domain at the receiver could fail
 without rendering the communication stream between the sender and receiver
 inoperable (e.g. one of the domain's disks might be unavailable at the
 receiver), and in these scenarios the receiver could send a message explicitly
 indicating to the sender that it should attempt recovery.

3. Performance evaluation

3.1 Prior work and metrics

 "Post-copy live migration of virtual machines" [u] describes an earlier
 implementation of post-copy live migration in Xen and evaluates its
 performance.  Although the details of their implementation - written nearly a
 decade ago and requiring in-guest kernel support - are vastly different than
 what's proposed here, their metrics and approach to performance evaluation are
 still useful  Section 3 of the paper enumerates the following performance
 metrics:

  >  1. Preparation Time: This is the time between initiating migration and
  >     transferring the VM’s processor state to the target node, during which
  >     the VM continues to execute and dirty its memory. For pre-copy, this
  >     time includes the entire iterative memory copying phase, whereas it is
  >     negligible for post-copy.
  >  2. Downtime: This is time during which the migrating VM’s execution is
  >     stopped. At the minimum this includes the transfer of processor state.
  >     For pre-copy, this transfer also includes any remaining dirty pages.
  >     For post-copy this includes other minimal execution state, if any,
  >     needed by the VM to start at the target.
  >  3. Resume Time: This is the time between resuming the VM’s execution at
  >     the target and the end of migration altogether, at which point all
  >     dependencies on the source must be eliminated. For pre-copy, one needs
  >     only to re-schedule the target VM and destroy the source copy. On the
  >     other hand, majority of our postcopy approach operates in this period.
  >  4. Pages Transferred: This is the total count of memory pages transferred,
  >     including duplicates, across all of the above time periods. Pre-copy
  >     transfers most of its pages during preparation time, whereas post-copy
  >     transfers most during resume time.
  >  5. Total Migration Time: This is the sum of all the above times from start
  >     to finish. Total time is important because it affects the release of
  >     resources on both participating nodes as well as within the VMs on both
  >     nodes. Until the completion of migration, we cannot free the source
  >     VM’s memory.
  >  6. Application Degradation: This is the extent to which migration slows
  >     down the applications running in the VM. Pre-copy must track dirtied
  >     pages by trapping write accesses to each page, which significantly
  >     slows down write-intensive workloads. Similarly, postcopy needs to
  >     service network faults generated at the target, which also slows down
  >     VM workloads.

 The performance of a memory migration algorithm with respect to these metrics
 will vary significantly with the workload running in the guest.  More
 specifically, it will vary according to the behaviour of the memory access
 stream - the pace, read/write mix, and locality of accesses (within pages and
 between pages).  See "Downtime Analysis of Virtual Machine Live Migration" [v]
 for a quantitative investigation of the effect of these workload parameters on
 pre-copy migration in Xen and other hypervisors.

 The relative importance of these metrics obviously varies with deployment
 context, but in my opinion the most common ordering is likely:
  - Downtime
  - Application Degradation
  - Preparation Time
  - Total Migration Time
  - Resume Time
  - Pages Transferred

 Stop-and-copy and pure post-copy schemes, which transmit each guest page
 exactly once, will obviously outperform pre-copy at Preparation Time, Pages
 Transferred and Total Migration Time, but because of this practical preference
 ordering it's post-copy's potential to reduce Downtime by trading it for
 Application Degradation that makes it the most interesting.  Because
 write-heavy workloads with large writable working sets experience the greatest
 downtime under pre-copy, I decided to investigate them first.

3.2 Experiment: pgbench

 When selecting a particular application workload to represent the class of
 pre-copy-resistant workloads with large writable working sets, I looked for a
 few other desirable properties:
  1) It should involve some amount of I/O, which could help the guest make
     progress even during synchronous page faults (as the I/O could proceed in
     parallel to the servicing of a subsequent fault).
  2) It should be possible to sample instantaneous application performance for
     the Application Degradation metric, and to perform such sampling reasonably
     frequently over the course of the migration.
  3) It should be reasonably representative of an interesting real-world
     application, to avoid being confounded by differences in behaviour between
     purely-synthetic workloads and the ones we're actually interested in.  For
     example, the 'dirty page generators' commonly found in the live migration
     literature aren't very useful for evaluating any mechanism that attempts
     pre-paging based on the memory access stream because their memory access
     streams are generally nothing like real applications (often being either
     perfectly sequential or perfectly random).

 With these properties and the 'large writable working set' criterion in mind, I
 eventually decided upon the pgbench [x] benchmark for PostgreSQL:
  > pgbench is a simple program for running benchmark tests on PostgreSQL. It
  > runs the same sequence of SQL commands over and over, possibly in multiple
  > concurrent database sessions, and then calculates the average transaction
  > rate (transactions per second). By default, pgbench tests a scenario that is
  > loosely based on TPC-B, involving five SELECT, UPDATE, and INSERT commands
  > per transaction.
  <snip>
  > The default built-in transaction script (also invoked with -b tpcb-like)
  > issues seven commands per transaction over randomly chosen aid, tid, bid and
  > balance.  The scenario is inspired by the TPC-B benchmark, but is not
  > actually TPC-B, hence the name.
  > 1. BEGIN;
  > 2. UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
  > 3. SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
  > 4. UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
  > 5. UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
  > 6. INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
  >        VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
  > 7. END;

 I evaluated the performance of five live migration algorithm variants:
  A) traditional five-iteration pre-copy (the status-quo today)
  B) single-iteration pre-copy followed by stop-and-copy
  C) direct post-copy
  D) single-iteration pre-copy followed by post-copy (often called 'hybrid
     migration)
  E) five-iteration pre-copy followed by post-copy

3.2.1 Experiment design

 The physical test-bed was composed of:
  - Two Intel NUC5CPYH [z] mini PCs, each with 8GB of RAM and a 120GB SSD (this
    was the cheapest Intel hardware with EPT support I could easily obtain two
    identical units of)
  - a Cisco Meraki MS220-8P [l] gigabit ethernet switch
  - my personal laptop computer

 One NUC PC was chosen to be the sender (S), and the other the receiver (R).
 The test-bed configuration was:

 S - Switch - R
       |
     Laptop

 I.e. each host had a gigabit link to all the others.  See [m] and [n] for the
 full output of `xl info` on S and R; the subsets that seem relevant to me
 are:

    S:
        release                : 3.16.0-4-amd64
        version                : #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19)
        machine                : x86_64
        nr_cpus                : 2
        max_cpu_id             : 1
        nr_nodes               : 1
        cores_per_socket       : 2
        threads_per_core       : 1
        cpu_mhz                : 1599
        virt_caps              : hvm
        total_memory           : 8112
        xen_version            : 4.9-rc
        xen_scheduler          : credit
        xen_pagesize           : 4096
        xen_changeset          : Fri May 12 23:17:29 2017 -0400 git:c6ed26e
        xen_commandline        : placeholder altp2m=1
        cc_compiler            : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
        cc_compile_by          : jtotto
        cc_compile_date        : Sat May 27 18:29:17 EDT 2017
    R:
        release                : 3.16.0-4-amd64
        version                : #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07)
        machine                : x86_64
        nr_cpus                : 2
        max_cpu_id             : 1
        nr_nodes               : 1
        cores_per_socket       : 2
        threads_per_core       : 1
        cpu_mhz                : 1599
        virt_caps              : hvm
        total_memory           : 8112
        xen_version            : 4.9-rc
        xen_scheduler          : credit
        xen_pagesize           : 4096
        xen_changeset          : Fri May 12 23:17:29 2017 -0400 git:c6ed26e
        xen_commandline        : placeholder no-real-mode edd=off
        cc_compiler            : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
        cc_compile_by          : jtotto
        cc_compile_date        : Sat May 27 18:29:17 EDT 2017

 Particularly noteworthy as a potential experimental confound is the relatively
 old dom0 kernel - I couldn't get anything newer to boot on the NUC hardware,
 unfortunately.  I'm aware that the privcmd driver has changed since then, and
 experimented with back-porting a more recent version of the kernel module to
 the base kernel I was experimenting with when evaluating the performance of the
 batch page-out operation, but it didn't appear to make any difference.

 To evaluate each algorith variant, I migrated a guest running a PostgreSQL from
 S to R while running the pgbench client against it from my laptop.  My laptop
 was also configured as an NFS server and hosted the guest's storage.

 The xl.cfg(5) configuration of the test domain was:
     builder='hvm'
     vcpus=4
     memory=2048
     shadow_memory=512
     name='debvm'
     disk=['file:/mnt/mig/debvm-1.img,xvda,w']
     boot="c"
     vif=['bridge=xenbr0']

     sdl=0
     stdvga=1
     serial='pty'
     usbdevice='tablet'
     on_poweroff='destroy'
     on_reboot='restart'
     on_crash='restart'
     vnc=1
     vnclisten=""
     vncpasswd=""
     vfb=['type=vnc']

     altp2m="external"

 The exact experiment shell script can be found inline at [o].  The procedure,
 executed from my laptop, was basically:

 repeat 5 times:
     for each algorithm variant:
         `xl create` the test domain at S and wait for it to boot
         re-initialize the test database
         launch the pgbench client
         wait for 20 seconds to let the benchmark warm up
         initiate migration of the test domain to R
         shut down the test domain from within

 To measure Preparation Time, Downtime, Resume Time and Total Migration Time, I
 added some simple timestamp printf()s at key points in the migration sequence:

 Preparation Time (recorded at the sender)
    Start: Upon entry to save()
    End:   In suspend_domain() immediately _before_ the libxl suspension hook

 Downtime (recorded at the receiver)
    For pre-copy migrations (variants A/B):
        Start: Since the end of the downtime period can only be recorded at the
               receiver, a way to record the beginning of the period was needed.
               A new dummy record type, PERF_STOP_AND_COPY, was added for this
               purpose, which is emitted by the sender immediately after the
               suspension of the domain.  The receiver records the time at which
               this record is received as the beginning of the period.
        End:   In `xl migrate-receive` immediately after libxl_domain_unpause()

    For post-copy and hybrid migrations (variants C/D/E):
        Start: Upon receipt of the existing POSTCOPY_BEGIN record
        End:   In domcreate_report_result() immediately after
               libxl_domain_unpause()

 Resume Time
    Start: Exactly where Downtime ends, after libxl_domain_unpause()
    End:   In postcopy_restore() after all pages have been loaded

 All timestamps were recorded via `clock_gettime(CLOCK_MONOTONIC)`.  The
 additional patches implementing this tracing can be found at [p].

 To measure instantaneous Application Degradation, I ran pgbench in its `--log
 --aggregate-interval` mode with an interval of 1 second, thus each second
 sampling:
  - the total number of transactions committed in that second
  - the sum of the latencies of these transactions (with which the mean latency
    can be computed)
  - the minimum and maximum latencies of these transactions

 Of these, I think that 'transactions/second' is the easiest to interpret, and
 is what I'll use throughout my analysis.

3.2.2 Results

 A plot of the raw phase duration measurements for each run:

 Figure 1:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure1.pdf

 Given that the results from run to run for each algorithm variant were
 relatively stable, they can more easily be considered in aggregate via their
 arithmetic means:

 Figure 2:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure2.pdf

 For fun, I re-rendered the above plot using gnuplot's 'dumb' terminal driver:

        Migration algorithm variant vs. average phase and total durations

  90 +-----------+-----------+-----------+----------+-----------+-----------+
     |                                                     Preparing *      |
     |                                                          Down #      |
     |                                                      Resuming $      |
     |       ########                                                       |
     |       ********                                                       |
  80 +       *      *                                        ########       +
     |       *      *                                        ********       |
     |       *      *                            $$$$$$$$    *      *       |
     |       *      *    ########                $      $    *      *       |
     |       *      *    #      #                $      $    *      *       |
     |       *      *    #      #                $      $    *      *       |
  70 +       *      *    #      #    $$$$$$$$    $      $    *      *       +
     |       *      *    #      #    $      $    $      $    *      *       |
     |       *      *    #      #    $      $    ########    *      *       |
     |       *      *    #      #    $      $    #      #    *      *       |
     |       *      *    ********    $      $    ********    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
  60 +       *      *    *      *    $      $    *      *    *      *       +
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
  50 +       *      *    *      *    $      $    *      *    *      *       +
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
  40 +       *      *    *      *    $      $    *      *    *      *       +
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
  30 +       *      *    *      *    $      $    *      *    *      *       +
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
  20 +       *      *    *      *    $      $    *      *    *      *       +
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
  10 +       *      *    *      *    $      $    *      *    *      *       +
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    $      $    *      *    *      *       |
     |       *      *    *      *    ########    *      *    *      *       |
     |       *      *    *      *    #      #    *      *    *      *       |
     |       *      *    *      *    #      #    *      *    *      *       |
   0 +-------********----********----********----********----********-------+
                 A           B           C          D           E
                                Algorithm variant

 --
 N.B. Before getting into comparisons between the algorithm variants, these
 results reveal an interesting property of the test-setup common to all of them:
 the network is _not_ the bottleneck in the migration page transfer.  iperf
 between S and R measured the bandwidth as roughly the gigabit limit advertised
 by the switch and NICs, but the effective 'bandwidth' actually observed during
 the pre-copy iterations can be computed as (524357 pages / 63.47s) ~= 8261
 pages/s, or ~271Mbps.  I didn't collect timings of the sender batch mapping or
 receive batch installation routines, but I suspect it must be one of these two
 that's the limiting factor in this set-up.  Interpret all of these timing
 measurements accordingly.
 --

3.2.2.1 Algorithms A vs. E: stop-and-copy vs. post-copy after iterative pre-copy

 I'm going to focus first on the two multi-iteration pre-copy variants, A and E
 (the current five-iteration pre-copy algorithm and five-iteration pre-copy +
 post-copy, respectively).  We can see that they:
  a) Require the longest preparation time, as expected.
  b) Actually still achieve the best downtime, despite my expectation that the
     workload would be write-heavy enough to cause problems.  A's downtime is
     only slightly worse than D's, and is almost twice as good a C's!  More on
     this shortly.

 Most interestingly, we can see that E required 30% less downtime than A on
 average, and always completed its post-copy phase _before_ the guest unpaused
 (recall that the post-copy phase proceeds in parallel with libxl domain
 creation and can complete before the guest is ready to unpause).  Why?

 Focusing first on Algorithm A, we can plot the number of pages transmitted
 during each pre-copy iteration:
 
 Figure 3:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure3.pdf

 We can see that the migration made substantial progress toward convergence,
 with strong successive decreases between iterations and an average final dirty
 set of ~7.5k pages (30.1 MiB).  The downtime period during which these pages
 are transferred can be divided into two interesting sub-phases: the phase
 during which the memory migration and libxc stream are completed (ending with
 the termination of the libxc helper), and the phase during which the higher
 level libxl domain establishment occurs (ending with the unpausing of the
 domain).  Plotting the measurements of these phases:

 Figure 4:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure4.pdf

 This shows that around 2/3 of the total downtime is incurred _after_ the memory
 migration is complete.

 We can make similar plots for Algorithm E.  Here are the number of pages
 transmitted during each iteration:

 Figure 5:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure5.pdf

 We can see that the E-runs transmitted slightly fewer pages on average
 during the pre-copy iterations than the A-runs.  This is presumably
 experimental noise - there's no reason to expect them to be different.
 Significantly, however, the E-runs actually needed to post-copy slightly _more_
 pages on average than the A-runs needed to stop-and-copy.  This means the 30%
 downtime reduction isn't just experimental noise in favour of E, but evidence
 of an algorithmic advantage.

 I think the explanation for this advantage is clear: because the post-copy
 phase can proceed in parallel with the libxl domain set-up procedure, the
 downtime duration is reduced to the _minimum_ of the durations of these
 processes, rather than their sum.

 This effect can be seen clearly in the corresponding plot for the libxc/libxl
 sub-phase breakdown:

 Figure 6:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure6.pdf

 As further evidence in support of this explanation, on average only 9 faults
 were incurred during each brief post-copy phase, indicating that the pages
 required by QEMU/etc. to proceed with domain creation most weren't in the
 working set.

 Another important question is: what impact do the two approaches have on the
 Application Degradation metric?  This plot shows the number of benchmark
 transactions committed each second over the course of each of the Algorithm A
 migrations:

 Figure 7:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure7.pdf

 The dotted line at the 20-second mark indicates roughly where the migration was
 initiated.  We can see that:
  - Throughout each phase of the migration, there are occasional severe
    single-second degradations in tps.  I'm not sure what caused these, but they
    appear to be distributed roughly evenly and occur during every phase, so I
    think they can safely be ignored for the rest of the discussion.
  - During the 20-second warmup period at host S, the benchmark measures a
    relatively consistent ~325tps.
  - Once the migration starts, benchmark performance quickly degrades to roughly
    200tps.  This clearly indicates an interaction of some kind between the
    application and the migration process, though which resource they're
    contending for isn't clear.  CPU or network seem like the most likely
    candidates.
  - At roughly the 80-second mark, performance 'degrades' to 0tps as the domain
    is suspended for the stop-and-copy phase of the migration.  This is where
    things get interesting: although the internal measurements from the previous
    set of plots indicate that the guest is only truly paused for around 2.5s,
    from a network observer's point of view the actual application is completely
    unavailable for around 9s.  Not great.
  - When the application recovers and begins to make progress, it rebounds to
    only 275tps rather than 325, indicating some kind of asymmetry between S and
    R that I can't entirely account for.

 How do these measurements look for Algorithm E?  Here's the plot:

 Figure 8:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure8.pdf

 The behaviour appears to be much the same, with a slight improvement in average
 application-visible downtime because of particularly good results in rounds 2
 and 3 with 4s and 5s application downtime measurements, respectively.

3.2.2.2 Algorithm C: pure post-copy

 Turning our attention next to pure post-copy, we can make a few interesting
 observations directly from the Figure 2 phase-timing measurements:
  - At 4.9s, C's average Downtime is almost twice that of A!  For an algorithm
    intended to trade outright Downtime for Application Degradation, that's not
    great.
  - At 69.2s, C's Total Migration Time is the lowest of any of the algorithms
    and ~17% lower than A's.
  - C's post-copy phase takes roughly as long as a single pre-copy iteration.

 What's ballooning C's downtime?  Figure 9 breaks down the sub-phases:

 Figure 9:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure9.pdf

 There are three major contributors to the Downtime here:
  1) It takes ~0.7s for the sender to gather and transmit the set of all
     post-copy pfns.  This makes some sense, as the sender needs to check the
     _type_ of every pfn in the guest's addressable range to see which ones are
     unpopulated and which must actually be migrated.
  2) It takes ~1.7s to populate-and-evict all of these pages, even with the
     batching hypercall introduced at the end of the patch series.
  3) It takes ~2.4s to complete all of the libxl-level domain set-up after the
     libxc stream helper is ready to enter the post-copy phase.  I think this is
     really interesting - recall that this step took only ~1.8s for algorithm A.
     The explanation for this is that the device model (i.e. QEMU) needs to map
     guest pages immediately during its set-up, and these mappings immediately
     cause post-copy faults!  Algorithm E didn't encounter this because these
     pages apparently aren't in the working set and so were covered by earlier
     pre-copy iterations.

 What application degradation do we observe during the post-copy phase:

 Figure 10:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure10.pdf

 Ouch!  The actual application-observed downtime is frankly horrific.  Round 3
 has what appears to be the closest thing to degraded execution at around the 40
 second mark, but only briefly.  In general, the application suffers around _50_
 seconds of observable downtime before recovering to a state of reasonable (but
 still degraded) performance.  This recovery occurs at the ~75 second mark,
 where performance recovers to ~175tps.  The migration finishes and the
 application recovers to ~275tps at roughly 90 seconds.

 To get a clearer picture of why the migration behaves this way, we can plot the
 faulting behaviour of the resuming guest during the post-copy phase.

 Figure 11:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure11.pdf
 Figure 12:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure12.pdf
 Figure 13:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure13.pdf
 
 Looking first at the fault counts, we can see that the first ~5-10 seconds are
 relatively quiet.  Then, there's a sudden burst to ~100-200 faults/s for a few
 seconds, followed by a decline to a stable rate of ~50 faults/s until the 50
 second mark where they essentially stop.  Latencies are high during the initial
 period, averaging 50-100ms per fault, and decline to 10-20ms per fault during
 the later steady-state.

 I'm not really sure how to account for the burst around the 10 second mark, or
 the comparatively lower steady-state rate.  Because it occurs so long after the
 first observed fault I don't think it can be bulk device model mappings - the
 guest is already unpaused at this point.  But, if the vCPUs were capable of
 generating faults this rapidly (i.e. if the post-copy stack was capable of
 servicing faults quickly enough to _let_ them be generated this quickly), why
 the subsequent decline to a lower rate for the rest of the phase?

 One possible explanation is that the guest actually isn't generating faults at
 its maximum possible rate at steady-state.  Instead, it could be alternating
 between faulting and making non-trivial progress.  If the post-copy background
 push scheme consistently selected the wrong background pages to push, the time
 required to commit the first application transaction of the post-copy phase
 would then be the _sum_ of the time spent executing this non-trivial work and
 the time required to synchronously fault each of the non-predicted pages in
 sequence.

 Since the guest transitions relatively suddenly from 0tps to 175tps (65% of the
 full 275tps after recovery), I infer that this set of poorly-predicted
 necessary pages is common between transactions.  As a result, once it has been
 faulted over from the first transaction or two, all subsequent transactions can
 proceed quickly.

 This discussion raises several questions:
  1) Does it make sense that the background pre-paging scheme made poor
     predictions in the context of the application's actual memory access
     stream?
  2) Would alternative pre-paging schemes have made better predictions?
  3) How much would those better predictions improve application performance?

 To answer 1), I conducted a further experiment: I prepared a more invasive
 additional tracing patch that disabled the background push logic for the first
 90 seconds of the post-copy phase and logged every individual faulting pfn to
 visualize the guest memory access stream over time.  I obtained these traces:

 Figure 14 (i-v):
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-1.pdf
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-2.pdf
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-3.pdf
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-4.pdf
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure14-5.pdf

 These traces permit a number of observations:
  a) Over the first ~5 seconds we can very clearly see a large number of
     physically-clustered high PFNs faulting in rapid succession.  I believe
     this is the device model establishment.
  b) For the next ~10 seconds, we can see lots of faults in three physical PFN
     regions: very low memory, 150000-250000, and 475000-525000, with the latter
     seeing the most faults.
  c) At the ~15 second mark, the pattern then shifts: the 475000-525000
     continues to fill in, and a long, descending physical page scan begins,
     either starting from 500000 or from 175000 and 'wrapping around' at the
     bottom.

 This reveals that there _is_ reasonable physical locality in the access stream
 available for exploitation.  In particular, I speculate that the long
 descending scan corresponds to a database table scan.  However, the scheme
 implemented in the current version of the patch, simply scanning _forward_ from
 the last physical pfn to fault, is perhaps not clever enough to fully take
 advantage of it.

 In "Post-Copy Live Migration of Virtual Machines", a number of more clever
 'bubbling' pre-paging strategies are discussed that I imagine would have done
 better.  The approach described in "A Novel Hybrid-Copy Algorithm for Live
 Migration of Virtual Machines" [q] also seems like it might have worked well
 (though it's not as easy to eyeball).  In principle, I think this answers
 question 2) in the affirmative.

 However, although, I didn't have time to implement and experimentally evaluate
 these alternatives, I think it's fairly safe to say that even with perfect
 prediction the application-level downtime entailed by this approach would still
 be worse than for Algorithm A, since such a large set of pages appears to be
 necesssary to permit the application to make even increment of
 externally-visible progress.

 -- Aside:

 Having collected all of the same timing data in this further experiment as I
 did in the first set, I decided to take a look at the fault stats and
 application performance plots and was able to make some interesting
 observations:

 Figure 15:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure15.pdf
 Figure 16:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure16.pdf
 Figure 17:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure17.pdf

 It may not be obvious unless you line it up next to the corresponding plots
 from the first experiment, but:
  - all of the per-fault latencies are reduced by an order of magnitude, with
    the mean falling from ~10ms to ~3ms
  - the fault service rate is increased massively, by a factor from 4 to as much
    as 24

 This makes some sense: in disabling all background pushing, I also disabled the
 logic that 'filled out' the remainder of a batch servicing a fault with other
 pages in its locality, so I'd expect each individual fault request to be
 serviced more quickly and consequently that the guest would be able to generate
 subsequent ones faster.  However, neither of these translate into better
 application performance:

 Figure 18:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure18.pdf

 So, the batching and weak prediction logic implemented in the patch in its
 current state are clearly worth _something_.

 --

3.2.2.3 Algorithm B vs. D: post-copy after a single pre-copy iteration

 The final post-copy variant, Algorithm D, appears to have performed reasonably
 according to its phase timings in Figure 2:
  - At 2.45s, its raw Downtime is slightly less than that of A
  - At 63.47s, its Preparation Time is ~20% less than A's
  - With 10.26s of Resume Time, its Total Migration Time is a slight ~6% less
    than A's

 Algorithm B, its pre-copy-only counterpart, fared less well, with the same
 Preparation Time as D but with the worst outright Downtime of any variant at
 11s.

 Judging by Figures 3 and 5, they both ended their single pre-copy iteration
 with ~90k pages (~351 MiB) dirty.

 The real question, of course, is how their Application Degradation results
 compare to those of A and E:

 Figure 19:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure19.pdf
 Figure 20:
    https://github.com/jtotto/xen-postcopy-figures/blob/master/figure20.pdf

 These results show that Algorithm D's Application Degradation is only
 moderately worse than A's, while B's is almost twice as bad.  Algorithm D
 therefore seems like a potentially useful alternative to A in situations where
 it would be useful to trade a moderate increase in Application Degradation for
 a more significant decrease in Preparation Time.

3.3 Further Experiments

 I only had time to conduct the experiment described in the previous section,
 but there are a number of further experiments that I think would be worth
 conducting if I had time.

 Collecting the same data as in the previous experiment against workloads other
 than pgbench would be my first priority.  Although I tried to choose the
 workload to be as write-heavy as possible while still being realistic, the
 results clearly demonstrated it was fairly amenable to pre-copy anyway.  If a
 non-synthetic workload with an even heavier write-mix were evaluated, post-copy
 might enjoy a more clear advantage.

 Identifying a workload with a more granular increment of progress might also
 demonstrate the ability of post-copy to trade outright Downtime for Application
 Degradation.  As the experiment showed, in the case of pgbench even a single
 transaction required a large subset of all guest pages to complete.

 Moving beyond evaluating the patch in its current state, there are many
 possible post-copy pre-paging schemes in the literature that it could be
 augmented to implement, and it would be interesting evaluate each of them in
 the same way.

 For all of the above experiments, as well as the pgbench one, it would also
 _very_ interesting to conduct them on more production-realistic hardware, to
 see how shifting the bottleneck to the network as opposed to the CPU of the
 migrating machines would affect the results.

4. Conclusion

 In this document, I've described the design and implementation of a proposed
 change to introduce post-copy memory migration support for Xen.  I then
 presented and interpreted the results of performance measurements I collected
 during experiments testing the change.

 In my opinion, the data so far suggest that:
  1) Pure post-copy is probably only useful in situations where you would be
     okay with performing an outright stop-and-copy, but would like to try to do
     a little better if possible.
  2) Hybrid post-copy does seem to perform marginally better than pre-copy
     alone, but this comes at the cost of both additional code complexity and
     worse reliability characteristics.  The costs of the latter seem to me like
     they probably outweigh the former benefit...

     I think it would be very interesting to further investigate how much
     downtime is spent on device model establishment in production set-ups.  If
     it turns out to be as significant as it was in my experiments, a very
     limited form of post-copy that only permits the memory migration to proceed
     in parallel with the device model/etc. setup without actually unpausing the
     guest until it completes could be worth investigating, as it could reduce
     downtime without adversely affecting reliability.

5. References
 [a] migration.pandoc
     https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/features/migration.pandoc

 [b] domcreate_stream_done()
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1127-L1204

 [c] domcreate_rebuild_done()
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1206-L1234

 [d] domcreate_launch_dm()
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1236-L1405

 [e] domcreate_devmodel_started()
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1489-L1519

 [f] domcreate_attach_devices()
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1446-L1487

 [g] domcreate_complete()
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_create.c#L1521-L1569

 [h] Post-copy patches v2
     https://github.com/jtotto/xen/commits/postcopy-v2

 [i] QEMU Post-Copy Live Migration
     https://wiki.qemu.org/Features/PostCopyLiveMigration

 [k] xenpaging
     https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenpaging.txt

 [s] AO_INPROGRESS
     https://github.com/xen-project/xen/blob/0a0dcdcd20e9711cbfb08db5b21af5299ee1eb8b/tools/libxl/libxl_domain.c#L520

 [t] Live Migration of Virtual Machines
     http://www.cl.cam.ac.uk/research/srg/netos/papers/2005-migration-nsdi-pre.pdf

 [u] Post-Copy Live Migration of Virtual Machines
     https://kartikgopalan.github.io/publications/hines09postcopy_osr.pdf

 [v] Downtime Analysis of Virtual Machine Live Migration
     https://citemaster.net/get/e61b2d78-b400-11e3-91be-00163e009cc7/salfner11downtime.pdf

 [x] pgbench 
     https://www.postgresql.org/docs/10/static/pgbench.html

 [z] Intel NUC Kit NUC5CPYH
     https://ark.intel.com/products/85254/Intel-NUC-Kit-NUC5CPYH

 [l] Cisco Meraki MS220-8P
     https://meraki.cisco.com/products/switches/ms220-8

 [q] A Novel Hybrid-Copy Algorithm for Live Migration of Virtual Machine
     http://www.mdpi.com/1999-5903/9/3/37

 [m] 
    $ sudo xl info
    [sudo] password for fydp:
    host                   : fydp
    release                : 3.16.0-4-amd64
    version                : #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19)
    machine                : x86_64
    nr_cpus                : 2
    max_cpu_id             : 1
    nr_nodes               : 1
    cores_per_socket       : 2
    threads_per_core       : 1
    cpu_mhz                : 1599
    hw_caps                : bfebfbff:43d8e3bf:28100800:00000101:00000000:00002282:00000000:00000100
    virt_caps              : hvm
    total_memory           : 8112
    free_memory            : 2066
    sharing_freed_memory   : 0
    sharing_used_memory    : 0
    outstanding_claims     : 0
    free_cpus              : 0
    xen_major              : 4
    xen_minor              : 9
    xen_extra              : -rc
    xen_version            : 4.9-rc
    xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
    xen_scheduler          : credit
    xen_pagesize           : 4096
    platform_params        : virt_start=0xffff800000000000
    xen_changeset          : Fri May 12 23:17:29 2017 -0400 git:c6ed26e
    xen_commandline        : placeholder altp2m=1
    cc_compiler            : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
    cc_compile_by          : jtotto
    cc_compile_domain      :
    cc_compile_date        : Sat May 27 18:29:17 EDT 2017
    build_id               : fc017c8cf375bbe7464c5be8fff2d3fd2e08cbaa
    xend_config_format     : 4

 [n] 
    $ sudo xl info
    [sudo] password for fydp:
    host                   : fydp
    release                : 3.16.0-4-amd64
    version                : #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07)
    machine                : x86_64
    nr_cpus                : 2
    max_cpu_id             : 1
    nr_nodes               : 1
    cores_per_socket       : 2
    threads_per_core       : 1
    cpu_mhz                : 1599
    hw_caps                : bfebfbff:43d8e3bf:28100800:00000101:00000000:00002282:00000000:00000100
    virt_caps              : hvm
    total_memory           : 8112
    free_memory            : 128
    sharing_freed_memory   : 0
    sharing_used_memory    : 0
    outstanding_claims     : 0
    free_cpus              : 0
    xen_major              : 4
    xen_minor              : 9
    xen_extra              : -rc
    xen_version            : 4.9-rc
    xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
    xen_scheduler          : credit
    xen_pagesize           : 4096
    platform_params        : virt_start=0xffff800000000000
    xen_changeset          : Fri May 12 23:17:29 2017 -0400 git:c6ed26e
    xen_commandline        : placeholder no-real-mode edd=off
    cc_compiler            : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
    cc_compile_by          : jtotto
    cc_compile_domain      :
    cc_compile_date        : Sat May 27 18:29:17 EDT 2017
    build_id               : fc017c8cf375bbe7464c5be8fff2d3fd2e08cbaa
    xend_config_format     : 4

 [o] experiment.sh

# Repeat each experiment 5 times.
for i in {1..5};
do
    echo "Experiment iteration $i"

    for experiment in a b c d e
    do
        echo "Conducting experiment $experiment"

        # First, spin up the test VM to be migrated.
        while true
        do
            ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \
                'xl create /home/fydp/vms/multideb.cfg' && break
            sleep 5
        done

        # Wait for the test VM to become accessible.
        echo 'Booting test VM...'
        while true
        do
            ssh -i ~/.ssh/waterloo root@192.168.2.67 echo && break
            sleep 1
        done

        # Initialize the test database.
        pgbench -h 192.168.2.67 -U postgres -i bench -s 70

        # Begin running the test in the background.
        pgbench -h 192.168.2.67 -U postgres -c 4 -j 1 -T 180 -l \
            --aggregate-interval 1 bench &

        # After 20 seconds...
        sleep 20

        # Initiate the migration.
        echo "Starting the migration..."
        case $experiment in
        a)
            ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \
                'xl migrate debvm 192.168.2.63' \
                > pgbench-$experiment-$i.log 2>&1
            ;;
        b)
            ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \
                'xl migrate --precopy-iterations 1 debvm 192.168.2.63' \
                > pgbench-$experiment-$i.log 2>&1
            ;;
        c)
            ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \
                'xl migrate --precopy-iterations 0 --postcopy debvm 192.168.2.63' \
                > pgbench-$experiment-$i.log 2>&1
            ;;
        d)
            ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \
                'xl migrate --precopy-iterations 1 --postcopy debvm 192.168.2.63' \
                > pgbench-$experiment-$i.log 2>&1
            ;;
        e)
            ssh -p 1337 -i ~/.ssh/waterloo root@192.168.2.64 \
                'xl migrate --precopy-iterations 5 --postcopy debvm 192.168.2.63' \
                > pgbench-$experiment-$i.log 2>&1
            ;;
        esac

        # Wait for the benchmark to complete.
        echo "Migration complete."
        wait

        # Rename the benchmark log to something more useful.
        mv pgbench_log.* pgbench-perf-$experiment-$i.log

        # Shut down the test VM.
        echo "Cleaning up..."
        ssh -i ~/.ssh/waterloo root@192.168.2.67 \
            '(sleep 10 && shutdown -h now) < /dev/null > /dev/null 2>&1 &'

        # Wait for it to really be down.
        sleep 20
    done
done

echo 'All done'

 [p] Post-copy tracing patches
     https://github.com/jtotto/xen/commits/postcopy-tracing

Joshua Otto (23):
  tools: rename COLO 'postcopy' to 'aftercopy'
  libxc/xc_sr: parameterise write_record() on fd
  libxc/xc_sr_restore.c: use write_record() in
    send_checkpoint_dirty_pfn_list()
  libxc/xc_sr: naming correction: mfns -> gfns
  libxc/xc_sr_restore: introduce generic 'pages' records
  libxc/xc_sr_restore: factor helpers out of handle_page_data()
  libxc/migration: tidy the xc_domain_save()/restore() interface
  libxc/migration: defer precopy policy to a callback
  libxl/migration: wire up the precopy policy RPC callback
  libxc/xc_sr_save: introduce save batch types
  libxc/migration: correct hvm record ordering specification
  libxc/migration: specify postcopy live migration
  libxc/migration: add try_read_record()
  libxc/migration: implement the sender side of postcopy live migration
  libxc/migration: implement the receiver side of postcopy live
    migration
  libxl/libxl_stream_write.c: track callback chains with an explicit
    phase
  libxl/libxl_stream_read.c: track callback chains with an explicit
    phase
  libxl/migration: implement the sender side of postcopy live migration
  libxl/migration: implement the receiver side of postcopy live
    migration
  tools: expose postcopy live migration support in libxl and xl
  xen/mem_paging: move paging op arguments into a union
  xen/mem_paging: add a populate_evicted paging op
  libxc/xc_sr_restore.c: use populate_evicted()

 docs/specs/libxc-migration-stream.pandoc |  182 +++-
 docs/specs/libxl-migration-stream.pandoc |   19 +-
 tools/libxc/include/xenctrl.h            |    2 +
 tools/libxc/include/xenguest.h           |  237 +++---
 tools/libxc/xc_mem_paging.c              |   39 +-
 tools/libxc/xc_nomigrate.c               |   16 +-
 tools/libxc/xc_private.c                 |   21 +-
 tools/libxc/xc_private.h                 |    2 +
 tools/libxc/xc_sr_common.c               |  116 ++-
 tools/libxc/xc_sr_common.h               |  170 +++-
 tools/libxc/xc_sr_common_x86.c           |    2 +-
 tools/libxc/xc_sr_restore.c              | 1321 +++++++++++++++++++++++++-----
 tools/libxc/xc_sr_restore_x86_hvm.c      |   41 +-
 tools/libxc/xc_sr_save.c                 |  903 ++++++++++++++++----
 tools/libxc/xc_sr_save_x86_hvm.c         |   18 +-
 tools/libxc/xc_sr_save_x86_pv.c          |   17 +-
 tools/libxc/xc_sr_stream_format.h        |   15 +-
 tools/libxc/xg_save_restore.h            |   16 +-
 tools/libxl/libxl.h                      |   40 +-
 tools/libxl/libxl_colo_restore.c         |    2 +-
 tools/libxl/libxl_colo_save.c            |    2 +-
 tools/libxl/libxl_create.c               |  191 ++++-
 tools/libxl/libxl_dom_save.c             |   71 +-
 tools/libxl/libxl_domain.c               |   33 +-
 tools/libxl/libxl_internal.h             |   80 +-
 tools/libxl/libxl_remus.c                |    2 +-
 tools/libxl/libxl_save_callout.c         |   12 +-
 tools/libxl/libxl_save_helper.c          |   60 +-
 tools/libxl/libxl_save_msgs_gen.pl       |   10 +-
 tools/libxl/libxl_sr_stream_format.h     |   13 +-
 tools/libxl/libxl_stream_read.c          |  136 ++-
 tools/libxl/libxl_stream_write.c         |  161 ++--
 tools/ocaml/libs/xl/xenlight_stubs.c     |    2 +-
 tools/xl/xl.h                            |    7 +-
 tools/xl/xl_cmdtable.c                   |    3 +
 tools/xl/xl_migrate.c                    |   65 +-
 tools/xl/xl_vmcontrol.c                  |    8 +-
 xen/arch/x86/mm.c                        |    5 +-
 xen/arch/x86/mm/mem_paging.c             |   40 +-
 xen/arch/x86/mm/p2m.c                    |  101 +++
 xen/arch/x86/x86_64/compat/mm.c          |    6 +-
 xen/arch/x86/x86_64/mm.c                 |    6 +-
 xen/include/asm-x86/mem_paging.h         |    3 +-
 xen/include/asm-x86/p2m.h                |    2 +
 xen/include/public/memory.h              |   25 +-
 45 files changed, 3489 insertions(+), 734 deletions(-)

-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 01/23] tools: rename COLO 'postcopy' to 'aftercopy'
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 02/23] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

The COLO xc domain save and restore procedures both make use of a 'postcopy'
callback to defer part of each checkpoint operation to xl.  In this context, the
name 'postcopy' is meant as "the callback invoked immediately after this
checkpoint's memory callback."  This is an unfortunate name collision with the
other common use of 'postcopy' in the context of live migration, where it is
used to mean "a memory migration that permits the guest to execute at the
destination before all of its memory is migrated by servicing accesses to
unmigrated memory via a network page-fault."

Mechanically rename 'postcopy' -> 'aftercopy' to free up the postcopy namespace
while preserving the original intent of the name in the COLO context.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
Acked-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
---
 tools/libxc/include/xenguest.h     | 4 ++--
 tools/libxc/xc_sr_restore.c        | 4 ++--
 tools/libxc/xc_sr_save.c           | 4 ++--
 tools/libxl/libxl_colo_restore.c   | 2 +-
 tools/libxl/libxl_colo_save.c      | 2 +-
 tools/libxl/libxl_remus.c          | 2 +-
 tools/libxl/libxl_save_msgs_gen.pl | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 40902ee..aa8cc8b 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -53,7 +53,7 @@ struct save_callbacks {
      * xc_domain_save then flushes the output buffer, while the
      *  guest continues to run.
      */
-    int (*postcopy)(void* data);
+    int (*aftercopy)(void* data);
 
     /* Called after the memory checkpoint has been flushed
      * out into the network. Typical actions performed in this
@@ -115,7 +115,7 @@ struct restore_callbacks {
      * Callback function resumes the guest & the device model,
      * returns to xc_domain_restore.
      */
-    int (*postcopy)(void* data);
+    int (*aftercopy)(void* data);
 
     /* A checkpoint record has been found in the stream.
      * returns: */
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 3549f0a..ee06b3d 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -576,7 +576,7 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
                                                 ctx->restore.callbacks->data);
 
         /* Resume secondary vm */
-        ret = ctx->restore.callbacks->postcopy(ctx->restore.callbacks->data);
+        ret = ctx->restore.callbacks->aftercopy(ctx->restore.callbacks->data);
         HANDLE_CALLBACK_RETURN_VALUE(ret);
 
         /* Wait for a new checkpoint */
@@ -855,7 +855,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     {
         /* this is COLO restore */
         assert(callbacks->suspend &&
-               callbacks->postcopy &&
+               callbacks->aftercopy &&
                callbacks->wait_checkpoint &&
                callbacks->restore_results);
     }
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index ca6913b..3837bc1 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -863,7 +863,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
                 }
             }
 
-            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            rc = ctx->save.callbacks->aftercopy(ctx->save.callbacks->data);
             if ( rc <= 0 )
                 goto err;
 
@@ -951,7 +951,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
     if ( hvm )
         assert(callbacks->switch_qemu_logdirty);
     if ( ctx.save.checkpointed )
-        assert(callbacks->checkpoint && callbacks->postcopy);
+        assert(callbacks->checkpoint && callbacks->aftercopy);
     if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
         assert(callbacks->wait_checkpoint);
 
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index 0c535bd..7d8f9ff 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -246,7 +246,7 @@ void libxl__colo_restore_setup(libxl__egc *egc,
     if (init_dsps(&crcs->dsps))
         goto out;
 
-    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
+    callbacks->aftercopy = libxl__colo_restore_domain_resume_callback;
     callbacks->wait_checkpoint = libxl__colo_restore_domain_wait_checkpoint_callback;
     callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
     callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index f687d5a..5921196 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -145,7 +145,7 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
 
     callbacks->suspend = libxl__colo_save_domain_suspend_callback;
     callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
-    callbacks->postcopy = libxl__colo_save_domain_resume_callback;
+    callbacks->aftercopy = libxl__colo_save_domain_resume_callback;
     callbacks->wait_checkpoint = libxl__colo_save_domain_wait_checkpoint_callback;
 
     libxl__checkpoint_devices_setup(egc, &dss->cds);
diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
index 29a4783..1453365 100644
--- a/tools/libxl/libxl_remus.c
+++ b/tools/libxl/libxl_remus.c
@@ -110,7 +110,7 @@ void libxl__remus_setup(libxl__egc *egc, libxl__remus_state *rs)
     dss->sws.checkpoint_callback = remus_checkpoint_stream_written;
 
     callbacks->suspend = libxl__remus_domain_suspend_callback;
-    callbacks->postcopy = libxl__remus_domain_resume_callback;
+    callbacks->aftercopy = libxl__remus_domain_resume_callback;
     callbacks->checkpoint = libxl__remus_domain_save_checkpoint_callback;
 
     libxl__checkpoint_devices_setup(egc, cds);
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 3ae7373..27845bb 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -24,7 +24,7 @@ our @msgs = (
                                                 'unsigned long', 'done',
                                                 'unsigned long', 'total'] ],
     [  3, 'srcxA',  "suspend", [] ],
-    [  4, 'srcxA',  "postcopy", [] ],
+    [  4, 'srcxA',  "aftercopy", [] ],
     [  5, 'srcxA',  "checkpoint", [] ],
     [  6, 'srcxA',  "wait_checkpoint", [] ],
     [  7, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 02/23] libxc/xc_sr: parameterise write_record() on fd
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 01/23] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 03/23] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Right now, write_split_record() - which is delegated to by
write_record() - implicitly writes to ctx->fd.  This means it can't be
used with the restore context's send_back_fd, which is inconvenient.

Add an 'fd' parameter to both write_record() and write_split_record(),
and mechanically update all existing callsites to pass ctx->fd for it.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_common.c       |  6 +++---
 tools/libxc/xc_sr_common.h       |  8 ++++----
 tools/libxc/xc_sr_common_x86.c   |  2 +-
 tools/libxc/xc_sr_save.c         |  6 +++---
 tools/libxc/xc_sr_save_x86_hvm.c |  5 +++--
 tools/libxc/xc_sr_save_x86_pv.c  | 17 +++++++++--------
 6 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 48fa676..c1babf6 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -52,8 +52,8 @@ const char *rec_type_to_str(uint32_t type)
     return "Reserved";
 }
 
-int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
-                       void *buf, size_t sz)
+int write_split_record(struct xc_sr_context *ctx, int fd,
+                       struct xc_sr_record *rec, void *buf, size_t sz)
 {
     static const char zeroes[(1u << REC_ALIGN_ORDER) - 1] = { 0 };
 
@@ -81,7 +81,7 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
     if ( sz )
         assert(buf);
 
-    if ( writev_exact(ctx->fd, parts, ARRAY_SIZE(parts)) )
+    if ( writev_exact(fd, parts, ARRAY_SIZE(parts)) )
         goto err;
 
     return 0;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index a83f22a..2f33ccc 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -361,8 +361,8 @@ struct xc_sr_record
  *
  * Returns 0 on success and non0 on failure.
  */
-int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
-                       void *buf, size_t sz);
+int write_split_record(struct xc_sr_context *ctx, int fd,
+                       struct xc_sr_record *rec, void *buf, size_t sz);
 
 /*
  * Writes a record to the stream, applying correct padding where appropriate.
@@ -371,10 +371,10 @@ int write_split_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
  *
  * Returns 0 on success and non0 on failure.
  */
-static inline int write_record(struct xc_sr_context *ctx,
+static inline int write_record(struct xc_sr_context *ctx, int fd,
                                struct xc_sr_record *rec)
 {
-    return write_split_record(ctx, rec, NULL, 0);
+    return write_split_record(ctx, fd, rec, NULL, 0);
 }
 
 /*
diff --git a/tools/libxc/xc_sr_common_x86.c b/tools/libxc/xc_sr_common_x86.c
index 98f1cef..7b3dd50 100644
--- a/tools/libxc/xc_sr_common_x86.c
+++ b/tools/libxc/xc_sr_common_x86.c
@@ -18,7 +18,7 @@ int write_tsc_info(struct xc_sr_context *ctx)
         return -1;
     }
 
-    return write_record(ctx, &rec);
+    return write_record(ctx, ctx->fd, &rec);
 }
 
 int handle_tsc_info(struct xc_sr_context *ctx, struct xc_sr_record *rec)
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 3837bc1..8aba0d8 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -53,7 +53,7 @@ static int write_end_record(struct xc_sr_context *ctx)
 {
     struct xc_sr_record end = { REC_TYPE_END, 0, NULL };
 
-    return write_record(ctx, &end);
+    return write_record(ctx, ctx->fd, &end);
 }
 
 /*
@@ -63,7 +63,7 @@ static int write_checkpoint_record(struct xc_sr_context *ctx)
 {
     struct xc_sr_record checkpoint = { REC_TYPE_CHECKPOINT, 0, NULL };
 
-    return write_record(ctx, &checkpoint);
+    return write_record(ctx, ctx->fd, &checkpoint);
 }
 
 /*
@@ -646,7 +646,7 @@ static int verify_frames(struct xc_sr_context *ctx)
 
     DPRINTF("Enabling verify mode");
 
-    rc = write_record(ctx, &rec);
+    rc = write_record(ctx, ctx->fd, &rec);
     if ( rc )
         goto out;
 
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index fc5c6ea..54ddbfe 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -42,7 +42,7 @@ static int write_hvm_context(struct xc_sr_context *ctx)
     }
 
     hvm_rec.length = hvm_buf_size;
-    rc = write_record(ctx, &hvm_rec);
+    rc = write_record(ctx, ctx->fd, &hvm_rec);
     if ( rc < 0 )
     {
         PERROR("error write HVM_CONTEXT record");
@@ -116,7 +116,8 @@ static int write_hvm_params(struct xc_sr_context *ctx)
     if ( hdr.count == 0 )
         return 0;
 
-    rc = write_split_record(ctx, &rec, entries, hdr.count * sizeof(*entries));
+    rc = write_split_record(ctx, ctx->fd, &rec, entries,
+                            hdr.count * sizeof(*entries));
     if ( rc )
         PERROR("Failed to write HVM_PARAMS record");
 
diff --git a/tools/libxc/xc_sr_save_x86_pv.c b/tools/libxc/xc_sr_save_x86_pv.c
index 36b1058..5f9b97d 100644
--- a/tools/libxc/xc_sr_save_x86_pv.c
+++ b/tools/libxc/xc_sr_save_x86_pv.c
@@ -571,9 +571,9 @@ static int write_one_vcpu_basic(struct xc_sr_context *ctx, uint32_t id)
     }
 
     if ( ctx->x86_pv.width == 8 )
-        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x64));
+        rc = write_split_record(ctx, ctx->fd, &rec, &vcpu, sizeof(vcpu.x64));
     else
-        rc = write_split_record(ctx, &rec, &vcpu, sizeof(vcpu.x32));
+        rc = write_split_record(ctx, ctx->fd, &rec, &vcpu, sizeof(vcpu.x32));
 
  err:
     return rc;
@@ -613,7 +613,7 @@ static int write_one_vcpu_extended(struct xc_sr_context *ctx, uint32_t id)
     if ( domctl.u.ext_vcpucontext.size == 0 )
         return 0;
 
-    return write_split_record(ctx, &rec, &domctl.u.ext_vcpucontext,
+    return write_split_record(ctx, ctx->fd, &rec, &domctl.u.ext_vcpucontext,
                               domctl.u.ext_vcpucontext.size);
 }
 
@@ -672,7 +672,8 @@ static int write_one_vcpu_xsave(struct xc_sr_context *ctx, uint32_t id)
     if ( domctl.u.vcpuextstate.size == 0 )
         goto out;
 
-    rc = write_split_record(ctx, &rec, buffer, domctl.u.vcpuextstate.size);
+    rc = write_split_record(ctx, ctx->fd, &rec, buffer,
+                            domctl.u.vcpuextstate.size);
     if ( rc )
         goto err;
 
@@ -742,7 +743,7 @@ static int write_one_vcpu_msrs(struct xc_sr_context *ctx, uint32_t id)
     if ( domctl.u.vcpu_msrs.msr_count == 0 )
         goto out;
 
-    rc = write_split_record(ctx, &rec, buffer,
+    rc = write_split_record(ctx, ctx->fd, &rec, buffer,
                             domctl.u.vcpu_msrs.msr_count *
                             sizeof(xen_domctl_vcpu_msr_t));
     if ( rc )
@@ -817,7 +818,7 @@ static int write_x86_pv_info(struct xc_sr_context *ctx)
             .data = &info
         };
 
-    return write_record(ctx, &rec);
+    return write_record(ctx, ctx->fd, &rec);
 }
 
 /*
@@ -858,7 +859,7 @@ static int write_x86_pv_p2m_frames(struct xc_sr_context *ctx)
     else
         data = (uint64_t *)ctx->x86_pv.p2m_pfns;
 
-    rc = write_split_record(ctx, &rec, data, datasz);
+    rc = write_split_record(ctx, ctx->fd, &rec, data, datasz);
 
     if ( data != (uint64_t *)ctx->x86_pv.p2m_pfns )
         free(data);
@@ -878,7 +879,7 @@ static int write_shared_info(struct xc_sr_context *ctx)
         .data = ctx->x86_pv.shinfo,
     };
 
-    return write_record(ctx, &rec);
+    return write_record(ctx, ctx->fd, &rec);
 }
 
 /*
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 03/23] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list()
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 01/23] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 02/23] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 04/23] libxc/xc_sr: naming correction: mfns -> gfns Joshua Otto
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Teach send_checkpoint_dirty_pfn_list() to use write_record()'s new fd
parameter, avoiding the need for a manual writev().

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_restore.c | 27 ++++-----------------------
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index ee06b3d..481a904 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -420,7 +420,6 @@ static int send_checkpoint_dirty_pfn_list(struct xc_sr_context *ctx)
     int rc = -1;
     unsigned count, written;
     uint64_t i, *pfns = NULL;
-    struct iovec *iov = NULL;
     xc_shadow_op_stats_t stats = { 0, ctx->restore.p2m_size };
     struct xc_sr_record rec =
     {
@@ -467,35 +466,17 @@ static int send_checkpoint_dirty_pfn_list(struct xc_sr_context *ctx)
         pfns[written++] = i;
     }
 
-    /* iovec[] for writev(). */
-    iov = malloc(3 * sizeof(*iov));
-    if ( !iov )
-    {
-        ERROR("Unable to allocate memory for sending dirty bitmap");
-        goto err;
-    }
-
+    rec.data = pfns;
     rec.length = count * sizeof(*pfns);
 
-    iov[0].iov_base = &rec.type;
-    iov[0].iov_len = sizeof(rec.type);
-
-    iov[1].iov_base = &rec.length;
-    iov[1].iov_len = sizeof(rec.length);
-
-    iov[2].iov_base = pfns;
-    iov[2].iov_len = count * sizeof(*pfns);
-
-    if ( writev_exact(ctx->restore.send_back_fd, iov, 3) )
-    {
-        PERROR("Failed to write dirty bitmap to stream");
+    rc = write_record(ctx, ctx->restore.send_back_fd, &rec);
+    if ( rc )
         goto err;
-    }
 
     rc = 0;
+
  err:
     free(pfns);
-    free(iov);
     return rc;
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 04/23] libxc/xc_sr: naming correction: mfns -> gfns
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (2 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 03/23] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-07-05 15:12   ` Wei Liu
  2018-06-17 10:18 ` [PATCH RFC v2 05/23] libxc/xc_sr_restore: introduce generic 'pages' records Joshua Otto
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

In write_batch() on the migration save side and in process_page_data()
on the corresponding path on the restore side, a local variable named
'mfns' is used to refer to an array of what are actually gfns.  Rename
both to 'gfns' to address this.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxc/xc_sr_restore.c | 16 ++++++++--------
 tools/libxc/xc_sr_save.c    | 20 ++++++++++----------
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 481a904..2f35f4d 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -203,7 +203,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
                              xen_pfn_t *pfns, uint32_t *types, void *page_data)
 {
     xc_interface *xch = ctx->xch;
-    xen_pfn_t *mfns = malloc(count * sizeof(*mfns));
+    xen_pfn_t *gfns = malloc(count * sizeof(*gfns));
     int *map_errs = malloc(count * sizeof(*map_errs));
     int rc;
     void *mapping = NULL, *guest_page = NULL;
@@ -211,11 +211,11 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
         j,         /* j indexes the subset of pfns we decide to map. */
         nr_pages = 0;
 
-    if ( !mfns || !map_errs )
+    if ( !gfns || !map_errs )
     {
         rc = -1;
         ERROR("Failed to allocate %zu bytes to process page data",
-              count * (sizeof(*mfns) + sizeof(*map_errs)));
+              count * (sizeof(*gfns) + sizeof(*map_errs)));
         goto err;
     }
 
@@ -246,7 +246,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
         case XEN_DOMCTL_PFINFO_L4TAB:
         case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
 
-            mfns[nr_pages++] = ctx->restore.ops.pfn_to_gfn(ctx, pfns[i]);
+            gfns[nr_pages++] = ctx->restore.ops.pfn_to_gfn(ctx, pfns[i]);
             break;
         }
     }
@@ -257,11 +257,11 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
 
     mapping = guest_page = xenforeignmemory_map(xch->fmem,
         ctx->domid, PROT_READ | PROT_WRITE,
-        nr_pages, mfns, map_errs);
+        nr_pages, gfns, map_errs);
     if ( !mapping )
     {
         rc = -1;
-        PERROR("Unable to map %u mfns for %u pages of data",
+        PERROR("Unable to map %u gfns for %u pages of data",
                nr_pages, count);
         goto err;
     }
@@ -281,7 +281,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
         {
             rc = -1;
             ERROR("Mapping pfn %#"PRIpfn" (mfn %#"PRIpfn", type %#"PRIx32") failed with %d",
-                  pfns[i], mfns[j], types[i], map_errs[j]);
+                  pfns[i], gfns[j], types[i], map_errs[j]);
             goto err;
         }
 
@@ -320,7 +320,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
         xenforeignmemory_unmap(xch->fmem, mapping, nr_pages);
 
     free(map_errs);
-    free(mfns);
+    free(gfns);
 
     return rc;
 }
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 8aba0d8..e93d8fd 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -79,7 +79,7 @@ static int write_checkpoint_record(struct xc_sr_context *ctx)
 static int write_batch(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
-    xen_pfn_t *mfns = NULL, *types = NULL;
+    xen_pfn_t *gfns = NULL, *types = NULL;
     void *guest_mapping = NULL;
     void **guest_data = NULL;
     void **local_pages = NULL;
@@ -98,7 +98,7 @@ static int write_batch(struct xc_sr_context *ctx)
     assert(nr_pfns != 0);
 
     /* Mfns of the batch pfns. */
-    mfns = malloc(nr_pfns * sizeof(*mfns));
+    gfns = malloc(nr_pfns * sizeof(*gfns));
     /* Types of the batch pfns. */
     types = malloc(nr_pfns * sizeof(*types));
     /* Errors from attempting to map the gfns. */
@@ -110,7 +110,7 @@ static int write_batch(struct xc_sr_context *ctx)
     /* iovec[] for writev(). */
     iov = malloc((nr_pfns + 4) * sizeof(*iov));
 
-    if ( !mfns || !types || !errors || !guest_data || !local_pages || !iov )
+    if ( !gfns || !types || !errors || !guest_data || !local_pages || !iov )
     {
         ERROR("Unable to allocate arrays for a batch of %u pages",
               nr_pfns);
@@ -119,11 +119,11 @@ static int write_batch(struct xc_sr_context *ctx)
 
     for ( i = 0; i < nr_pfns; ++i )
     {
-        types[i] = mfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
+        types[i] = gfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
                                                       ctx->save.batch_pfns[i]);
 
         /* Likely a ballooned page. */
-        if ( mfns[i] == INVALID_MFN )
+        if ( gfns[i] == INVALID_MFN )
         {
             set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
             ++ctx->save.nr_deferred_pages;
@@ -148,13 +148,13 @@ static int write_batch(struct xc_sr_context *ctx)
             continue;
         }
 
-        mfns[nr_pages++] = mfns[i];
+        gfns[nr_pages++] = gfns[i];
     }
 
     if ( nr_pages > 0 )
     {
         guest_mapping = xenforeignmemory_map(xch->fmem,
-            ctx->domid, PROT_READ, nr_pages, mfns, errors);
+            ctx->domid, PROT_READ, nr_pages, gfns, errors);
         if ( !guest_mapping )
         {
             PERROR("Failed to map guest pages");
@@ -174,8 +174,8 @@ static int write_batch(struct xc_sr_context *ctx)
 
             if ( errors[p] )
             {
-                ERROR("Mapping of pfn %#"PRIpfn" (mfn %#"PRIpfn") failed %d",
-                      ctx->save.batch_pfns[i], mfns[p], errors[p]);
+                ERROR("Mapping of pfn %#"PRIpfn" (gfn %#"PRIpfn") failed %d",
+                      ctx->save.batch_pfns[i], gfns[p], errors[p]);
                 goto err;
             }
 
@@ -271,7 +271,7 @@ static int write_batch(struct xc_sr_context *ctx)
     free(guest_data);
     free(errors);
     free(types);
-    free(mfns);
+    free(gfns);
 
     return rc;
 }
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 05/23] libxc/xc_sr_restore: introduce generic 'pages' records
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (3 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 04/23] libxc/xc_sr: naming correction: mfns -> gfns Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 06/23] libxc/xc_sr_restore: factor helpers out of handle_page_data() Joshua Otto
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

The PAGE_DATA migration record type is specified as an array of
uint64_ts encoding pfns and their types, followed by an array of page
contents.  Postcopy live migration specifies a number of records with
similar or the same format, and it would be convenient to be able to
re-use the code that validates and unpacks such records for each type.
To facilitate this, introduce the generic 'pages' name for such records
and rename the PAGE_DATA stream format struct and pfn encoding masks
accordingly.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_common.c        | 2 +-
 tools/libxc/xc_sr_restore.c       | 6 +++---
 tools/libxc/xc_sr_save.c          | 2 +-
 tools/libxc/xc_sr_stream_format.h | 6 +++---
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index c1babf6..08abe9a 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -146,7 +146,7 @@ static void __attribute__((unused)) build_assertions(void)
     BUILD_BUG_ON(sizeof(struct xc_sr_dhdr) != 16);
     BUILD_BUG_ON(sizeof(struct xc_sr_rhdr) != 8);
 
-    BUILD_BUG_ON(sizeof(struct xc_sr_rec_page_data_header)  != 8);
+    BUILD_BUG_ON(sizeof(struct xc_sr_rec_pages_header)      != 8);
     BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_info)       != 8);
     BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_p2m_frames) != 8);
     BUILD_BUG_ON(sizeof(struct xc_sr_rec_x86_pv_vcpu_hdr)   != 8);
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 2f35f4d..fc47a25 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -332,7 +332,7 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
 static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
 {
     xc_interface *xch = ctx->xch;
-    struct xc_sr_rec_page_data_header *pages = rec->data;
+    struct xc_sr_rec_pages_header *pages = rec->data;
     unsigned i, pages_of_data = 0;
     int rc = -1;
 
@@ -368,14 +368,14 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
 
     for ( i = 0; i < pages->count; ++i )
     {
-        pfn = pages->pfn[i] & PAGE_DATA_PFN_MASK;
+        pfn = pages->pfn[i] & REC_PFINFO_PFN_MASK;
         if ( !ctx->restore.ops.pfn_is_valid(ctx, pfn) )
         {
             ERROR("pfn %#"PRIpfn" (index %u) outside domain maximum", pfn, i);
             goto err;
         }
 
-        type = (pages->pfn[i] & PAGE_DATA_TYPE_MASK) >> 32;
+        type = (pages->pfn[i] & REC_PFINFO_TYPE_MASK) >> 32;
         if ( ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) >= 5) &&
              ((type >> XEN_DOMCTL_PFINFO_LTAB_SHIFT) <= 8) )
         {
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index e93d8fd..b1a24b7 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -89,7 +89,7 @@ static int write_batch(struct xc_sr_context *ctx)
     void *page, *orig_page;
     uint64_t *rec_pfns = NULL;
     struct iovec *iov = NULL; int iovcnt = 0;
-    struct xc_sr_rec_page_data_header hdr = { 0 };
+    struct xc_sr_rec_pages_header hdr = { 0 };
     struct xc_sr_record rec =
     {
         .type = REC_TYPE_PAGE_DATA,
diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
index 3291b25..32400b2 100644
--- a/tools/libxc/xc_sr_stream_format.h
+++ b/tools/libxc/xc_sr_stream_format.h
@@ -80,15 +80,15 @@ struct xc_sr_rhdr
 #define REC_TYPE_OPTIONAL             0x80000000U
 
 /* PAGE_DATA */
-struct xc_sr_rec_page_data_header
+struct xc_sr_rec_pages_header
 {
     uint32_t count;
     uint32_t _res1;
     uint64_t pfn[0];
 };
 
-#define PAGE_DATA_PFN_MASK  0x000fffffffffffffULL
-#define PAGE_DATA_TYPE_MASK 0xf000000000000000ULL
+#define REC_PFINFO_PFN_MASK  0x000fffffffffffffULL
+#define REC_PFINFO_TYPE_MASK 0xf000000000000000ULL
 
 /* X86_PV_INFO */
 struct xc_sr_rec_x86_pv_info
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 06/23] libxc/xc_sr_restore: factor helpers out of handle_page_data()
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (4 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 05/23] libxc/xc_sr_restore: introduce generic 'pages' records Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 07/23] libxc/migration: tidy the xc_domain_save()/restore() interface Joshua Otto
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

When processing a PAGE_DATA record, the restore code:
1) applies a number of sanity checks on the record's headers and size
2) decodes the list of packed page info into pfns and their types
3) using the pfn and type info, populates and fills the pages into the
   guest using process_page_data()

Steps 1) and 2) are also useful for other types of pages records
introduced by postcopy live migration, so factor them into reusable
helper routines.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_common.c  | 36 ++++++++++++++++++
 tools/libxc/xc_sr_common.h  | 10 +++++
 tools/libxc/xc_sr_restore.c | 89 ++++++++++++++++++++++++++-------------------
 3 files changed, 97 insertions(+), 38 deletions(-)

diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 08abe9a..f443974 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -140,6 +140,42 @@ int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
     return 0;
 };
 
+int validate_pages_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                          uint32_t expected_type)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+
+    if ( rec->type != expected_type )
+    {
+        ERROR("%s record type expected, instead received record of type "
+              "%08x (%s)", rec_type_to_str(expected_type), rec->type,
+              rec_type_to_str(rec->type));
+        return -1;
+    }
+    else if ( rec->length < sizeof(*pages) )
+    {
+        ERROR("%s record truncated: length %u, min %zu",
+              rec_type_to_str(rec->type), rec->length, sizeof(*pages));
+        return -1;
+    }
+    else if ( pages->count < 1 )
+    {
+        ERROR("Expected at least 1 pfn in %s record",
+              rec_type_to_str(rec->type));
+        return -1;
+    }
+    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
+    {
+        ERROR("%s record (length %u) too short to contain %u"
+              " pfns worth of information", rec_type_to_str(rec->type),
+              rec->length, pages->count);
+        return -1;
+    }
+
+    return 0;
+}
+
 static void __attribute__((unused)) build_assertions(void)
 {
     BUILD_BUG_ON(sizeof(struct xc_sr_ihdr) != 24);
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 2f33ccc..b1aa88e 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -392,6 +392,16 @@ static inline int write_record(struct xc_sr_context *ctx, int fd,
 int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
 
 /*
+ * Given a record of one of the page data types, validate it by:
+ * - checking its actual type against its specific expected type
+ * - sanity checking its actual length against its claimed length
+ *
+ * Returns 0 on success and non-0 on failure.
+ */
+int validate_pages_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
+                          uint32_t expected_type);
+
+/*
  * This would ideally be private in restore.c, but is needed by
  * x86_pv_localise_page() if we receive pagetables frames ahead of the
  * contents of the frames they point at.
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index fc47a25..00fad7d 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -326,45 +326,21 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
 }
 
 /*
- * Validate a PAGE_DATA record from the stream, and pass the results to
- * process_page_data() to actually perform the legwork.
+ * Given a PAGE_DATA record, decode each packed entry into its encoded pfn and
+ * type, storing the results in the pfns and types buffers.
+ *
+ * Returns the number of pages of real data, or < 0 on error.
  */
-static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+static int decode_pages_record(struct xc_sr_context *ctx,
+                               struct xc_sr_rec_pages_header *pages,
+                               /* OUT */ xen_pfn_t *pfns,
+                               /* OUT */ uint32_t *types)
 {
     xc_interface *xch = ctx->xch;
-    struct xc_sr_rec_pages_header *pages = rec->data;
-    unsigned i, pages_of_data = 0;
-    int rc = -1;
-
-    xen_pfn_t *pfns = NULL, pfn;
-    uint32_t *types = NULL, type;
-
-    if ( rec->length < sizeof(*pages) )
-    {
-        ERROR("PAGE_DATA record truncated: length %u, min %zu",
-              rec->length, sizeof(*pages));
-        goto err;
-    }
-    else if ( pages->count < 1 )
-    {
-        ERROR("Expected at least 1 pfn in PAGE_DATA record");
-        goto err;
-    }
-    else if ( rec->length < sizeof(*pages) + (pages->count * sizeof(uint64_t)) )
-    {
-        ERROR("PAGE_DATA record (length %u) too short to contain %u"
-              " pfns worth of information", rec->length, pages->count);
-        goto err;
-    }
-
-    pfns = malloc(pages->count * sizeof(*pfns));
-    types = malloc(pages->count * sizeof(*types));
-    if ( !pfns || !types )
-    {
-        ERROR("Unable to allocate enough memory for %u pfns",
-              pages->count);
-        goto err;
-    }
+    unsigned int i;
+    int pages_of_data = 0;
+    xen_pfn_t pfn;
+    uint32_t type;
 
     for ( i = 0; i < pages->count; ++i )
     {
@@ -384,14 +360,51 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
             goto err;
         }
         else if ( type < XEN_DOMCTL_PFINFO_BROKEN )
-            /* NOTAB and all L1 through L4 tables (including pinned) should
-             * have a page worth of data in the record. */
+            /* NOTAB and all L1 through L4 tables (including pinned) require the
+             * migration of a page of real data. */
             pages_of_data++;
 
         pfns[i] = pfn;
         types[i] = type;
     }
 
+    return pages_of_data;
+
+ err:
+    return -1;
+}
+
+/*
+ * Validate a PAGE_DATA record from the stream, and pass the results to
+ * process_page_data() to actually perform the legwork.
+ */
+static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+    int pages_of_data;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL;
+    uint32_t *types = NULL;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_PAGE_DATA);
+    if ( rc )
+        goto err;
+
+    pfns = malloc(pages->count * sizeof(*pfns));
+    types = malloc(pages->count * sizeof(*types));
+    if ( !pfns || !types )
+    {
+        ERROR("Unable to allocate enough memory for %u pfns",
+              pages->count);
+        goto err;
+    }
+
+    pages_of_data = decode_pages_record(ctx, pages, pfns, types);
+    if ( pages_of_data < 0 )
+        goto err;
+
     if ( rec->length != (sizeof(*pages) +
                          (sizeof(uint64_t) * pages->count) +
                          (PAGE_SIZE * pages_of_data)) )
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 07/23] libxc/migration: tidy the xc_domain_save()/restore() interface
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (5 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 06/23] libxc/xc_sr_restore: factor helpers out of handle_page_data() Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 08/23] libxc/migration: defer precopy policy to a callback Joshua Otto
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Both xc_domain_save() and xc_domain_restore() have a high number of
parameters, including a number of boolean parameters that are split
between a bitfield flags argument and separate individual boolean
arguments.  Further, many of these arguments are dead/ignored.

Tidy the interface to these functions by collecting the parameters into
a structure assembled by the caller and passed by pointer, and drop the
dead parameters.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h   | 68 ++++++++++++++++++++--------------------
 tools/libxc/xc_nomigrate.c       | 16 +++-------
 tools/libxc/xc_sr_common.h       |  4 +--
 tools/libxc/xc_sr_restore.c      | 47 +++++++++++++--------------
 tools/libxc/xc_sr_save.c         | 54 +++++++++++++++----------------
 tools/libxl/libxl_dom_save.c     | 11 -------
 tools/libxl/libxl_internal.h     |  1 -
 tools/libxl/libxl_save_callout.c | 12 +++----
 tools/libxl/libxl_save_helper.c  | 61 ++++++++++++++++-------------------
 9 files changed, 122 insertions(+), 152 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index aa8cc8b..d1f97b9 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -22,16 +22,9 @@
 #ifndef XENGUEST_H
 #define XENGUEST_H
 
-#define XC_NUMA_NO_NODE   (~0U)
-
-#define XCFLAGS_LIVE      (1 << 0)
-#define XCFLAGS_DEBUG     (1 << 1)
-#define XCFLAGS_HVM       (1 << 2)
-#define XCFLAGS_STDVGA    (1 << 3)
-#define XCFLAGS_CHECKPOINT_COMPRESS    (1 << 4)
+#include <stdbool.h>
 
-#define X86_64_B_SIZE   64 
-#define X86_32_B_SIZE   32
+#define XC_NUMA_NO_NODE   (~0U)
 
 /*
  * User not using xc_suspend_* / xc_await_suspent may not want to
@@ -90,20 +83,26 @@ typedef enum {
     XC_MIG_STREAM_COLO,
 } xc_migration_stream_t;
 
+struct domain_save_params {
+    uint32_t dom;       /* the id of the domain */
+    int save_fd;        /* the fd to save the domain to */
+    int recv_fd;        /* the fd to receive live protocol responses */
+    uint32_t max_iters; /* how many precopy iterations before we give up? */
+    bool live;          /* is this a live migration? */
+    bool debug;         /* are we in debug mode? */
+    xc_migration_stream_t stream_type; /* is there checkpointing involved? */
+};
+
 /**
  * This function will save a running domain.
  *
  * @parm xch a handle to an open hypervisor interface
- * @parm fd the file descriptor to save a domain to
- * @parm dom the id of the domain
- * @param stream_type XC_MIG_STREAM_NONE if the far end of the stream
- *        doesn't use checkpointing
+ * @parm params a description of the requested save/migration
+ * @parm callbacks hooks for delegated steps of the save procedure
  * @return 0 on success, -1 on failure
  */
-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
-                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
-                   struct save_callbacks* callbacks, int hvm,
-                   xc_migration_stream_t stream_type, int recv_fd);
+int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
+                   const struct save_callbacks *callbacks);
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
@@ -145,31 +144,32 @@ struct restore_callbacks {
     void* data;
 };
 
+struct domain_restore_params {
+    uint32_t dom;                 /* the id of the domain */
+    int recv_fd;                  /* the fd to restore the domain from */
+    int send_back_fd;             /* the fd to send live protocol responses */
+    unsigned int store_evtchn;    /* the store event channel */
+    xen_pfn_t *store_gfn;         /* OUT - the gfn of the store page */
+    domid_t store_domid;          /* the store domain id */
+    unsigned int console_evtchn;  /* the console event channel */
+    xen_pfn_t *console_gfn;       /* OUT - the gfn of the console page */
+    domid_t console_domid;        /* the console domain id */
+    xc_migration_stream_t stream_type; /* is there checkpointing involved? */
+};
+
 /**
  * This function will restore a saved domain.
  *
  * Domain is restored in a suspended state ready to be unpaused.
  *
  * @parm xch a handle to an open hypervisor interface
- * @parm fd the file descriptor to restore a domain from
- * @parm dom the id of the domain
- * @parm store_evtchn the store event channel for this domain to use
- * @parm store_mfn returned with the mfn of the store page
- * @parm hvm non-zero if this is a HVM restore
- * @parm pae non-zero if this HVM domain has PAE support enabled
- * @parm superpages non-zero to allocate guest memory with superpages
- * @parm stream_type non-zero if the far end of the stream is using checkpointing
- * @parm callbacks non-NULL to receive a callback to restore toolstack
- *       specific data
+ * @parm params a description of the requested restore operation
+ * @parm callbacks hooks for delegated steps of the restore procedure
  * @return 0 on success, -1 on failure
  */
-int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
-                      unsigned int store_evtchn, unsigned long *store_mfn,
-                      domid_t store_domid, unsigned int console_evtchn,
-                      unsigned long *console_mfn, domid_t console_domid,
-                      unsigned int hvm, unsigned int pae, int superpages,
-                      xc_migration_stream_t stream_type,
-                      struct restore_callbacks *callbacks, int send_back_fd);
+int xc_domain_restore(xc_interface *xch,
+                      const struct domain_restore_params *params,
+                      const struct restore_callbacks *callbacks);
 
 /**
  * This function will create a domain for a paravirtualized Linux
diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
index 15c838f..50cd318 100644
--- a/tools/libxc/xc_nomigrate.c
+++ b/tools/libxc/xc_nomigrate.c
@@ -20,22 +20,16 @@
 #include <xenctrl.h>
 #include <xenguest.h>
 
-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
-                   uint32_t max_factor, uint32_t flags,
-                   struct save_callbacks* callbacks, int hvm,
-                   xc_migration_stream_t stream_type, int recv_fd)
+int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
+                   const struct save_callbacks *callbacks)
 {
     errno = ENOSYS;
     return -1;
 }
 
-int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
-                      unsigned int store_evtchn, unsigned long *store_mfn,
-                      domid_t store_domid, unsigned int console_evtchn,
-                      unsigned long *console_mfn, domid_t console_domid,
-                      unsigned int hvm, unsigned int pae, int superpages,
-                      xc_migration_stream_t stream_type,
-                      struct restore_callbacks *callbacks, int send_back_fd)
+int xc_domain_restore(xc_interface *xch,
+                      const struct domain_restore_params *params,
+                      const struct restore_callbacks *callbacks)
 {
     errno = ENOSYS;
     return -1;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index b1aa88e..f192654 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -187,7 +187,7 @@ struct xc_sr_context
             int recv_fd;
 
             struct xc_sr_save_ops ops;
-            struct save_callbacks *callbacks;
+            const struct save_callbacks *callbacks;
 
             /* Live migrate vs non live suspend. */
             bool live;
@@ -214,7 +214,7 @@ struct xc_sr_context
         struct /* Restore data. */
         {
             struct xc_sr_restore_ops ops;
-            struct restore_callbacks *callbacks;
+            const struct restore_callbacks *callbacks;
 
             int send_back_fd;
             unsigned long p2m_size;
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 00fad7d..51532aa 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -817,32 +817,28 @@ static int restore(struct xc_sr_context *ctx)
     return rc;
 }
 
-int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
-                      unsigned int store_evtchn, unsigned long *store_mfn,
-                      domid_t store_domid, unsigned int console_evtchn,
-                      unsigned long *console_gfn, domid_t console_domid,
-                      unsigned int hvm, unsigned int pae, int superpages,
-                      xc_migration_stream_t stream_type,
-                      struct restore_callbacks *callbacks, int send_back_fd)
+int xc_domain_restore(xc_interface *xch,
+                      const struct domain_restore_params *params,
+                      const struct restore_callbacks *callbacks)
 {
     xen_pfn_t nr_pfns;
     struct xc_sr_context ctx =
         {
             .xch = xch,
-            .fd = io_fd,
+            .fd = params->recv_fd,
         };
 
     /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions. */
-    ctx.restore.console_evtchn = console_evtchn;
-    ctx.restore.console_domid = console_domid;
-    ctx.restore.xenstore_evtchn = store_evtchn;
-    ctx.restore.xenstore_domid = store_domid;
-    ctx.restore.checkpointed = stream_type;
+    ctx.restore.console_evtchn = params->console_evtchn;
+    ctx.restore.console_domid = params->console_domid;
+    ctx.restore.xenstore_evtchn = params->store_evtchn;
+    ctx.restore.xenstore_domid = params->store_domid;
+    ctx.restore.checkpointed = params->stream_type;
     ctx.restore.callbacks = callbacks;
-    ctx.restore.send_back_fd = send_back_fd;
+    ctx.restore.send_back_fd = params->send_back_fd;
 
     /* Sanity checks for callbacks. */
-    if ( stream_type )
+    if ( params->stream_type )
         assert(callbacks->checkpoint);
 
     if ( ctx.restore.checkpointed == XC_MIG_STREAM_COLO )
@@ -854,28 +850,27 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                callbacks->restore_results);
     }
 
-    DPRINTF("fd %d, dom %u, hvm %u, pae %u, superpages %d"
-            ", stream_type %d", io_fd, dom, hvm, pae,
-            superpages, stream_type);
-
-    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    if ( xc_domain_getinfo(xch, params->dom, 1, &ctx.dominfo) != 1 )
     {
         PERROR("Failed to get domain info");
         return -1;
     }
 
-    if ( ctx.dominfo.domid != dom )
+    if ( ctx.dominfo.domid != params->dom )
     {
-        ERROR("Domain %u does not exist", dom);
+        ERROR("Domain %u does not exist", params->dom);
         return -1;
     }
 
-    ctx.domid = dom;
+    DPRINTF("fd %d, dom %u, hvm %d, stream_type %d", params->recv_fd,
+            params->dom, ctx.dominfo.hvm, params->stream_type);
+
+    ctx.domid = params->dom;
 
     if ( read_headers(&ctx) )
         return -1;
 
-    if ( xc_domain_nr_gpfns(xch, dom, &nr_pfns) < 0 )
+    if ( xc_domain_nr_gpfns(xch, ctx.domid, &nr_pfns) < 0 )
     {
         PERROR("Unable to obtain the guest p2m size");
         return -1;
@@ -906,8 +901,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
             ctx.restore.console_domid,
             ctx.restore.console_evtchn);
 
-    *console_gfn = ctx.restore.console_gfn;
-    *store_mfn = ctx.restore.xenstore_gfn;
+    *params->console_gfn = ctx.restore.console_gfn;
+    *params->store_gfn = ctx.restore.xenstore_gfn;
 
     return 0;
 }
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index b1a24b7..0ab86c3 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -915,28 +915,26 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
     return rc;
 };
 
-int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
-                   uint32_t max_iters, uint32_t max_factor, uint32_t flags,
-                   struct save_callbacks* callbacks, int hvm,
-                   xc_migration_stream_t stream_type, int recv_fd)
+int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
+                   const struct save_callbacks* callbacks)
 {
     struct xc_sr_context ctx =
         {
             .xch = xch,
-            .fd = io_fd,
+            .fd = params->save_fd,
         };
 
     /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions. */
     ctx.save.callbacks = callbacks;
-    ctx.save.live  = !!(flags & XCFLAGS_LIVE);
-    ctx.save.debug = !!(flags & XCFLAGS_DEBUG);
-    ctx.save.checkpointed = stream_type;
-    ctx.save.recv_fd = recv_fd;
+    ctx.save.live  = params->live;
+    ctx.save.debug = params->debug;
+    ctx.save.checkpointed = params->stream_type;
+    ctx.save.recv_fd = params->recv_fd;
 
     /* If altering migration_stream update this assert too. */
-    assert(stream_type == XC_MIG_STREAM_NONE ||
-           stream_type == XC_MIG_STREAM_REMUS ||
-           stream_type == XC_MIG_STREAM_COLO);
+    assert(params->stream_type == XC_MIG_STREAM_NONE ||
+           params->stream_type == XC_MIG_STREAM_REMUS ||
+           params->stream_type == XC_MIG_STREAM_COLO);
 
     /*
      * TODO: Find some time to better tweak the live migration algorithm.
@@ -947,30 +945,32 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
     ctx.save.max_iterations = 5;
     ctx.save.dirty_threshold = 50;
 
-    /* Sanity checks for callbacks. */
-    if ( hvm )
-        assert(callbacks->switch_qemu_logdirty);
-    if ( ctx.save.checkpointed )
-        assert(callbacks->checkpoint && callbacks->aftercopy);
-    if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
-        assert(callbacks->wait_checkpoint);
-
-    DPRINTF("fd %d, dom %u, max_iters %u, max_factor %u, flags %u, hvm %d",
-            io_fd, dom, max_iters, max_factor, flags, hvm);
-
-    if ( xc_domain_getinfo(xch, dom, 1, &ctx.dominfo) != 1 )
+    if ( xc_domain_getinfo(xch, params->dom, 1, &ctx.dominfo) != 1 )
     {
         PERROR("Failed to get domain info");
         return -1;
     }
 
-    if ( ctx.dominfo.domid != dom )
+    if ( ctx.dominfo.domid != params->dom )
     {
-        ERROR("Domain %u does not exist", dom);
+        ERROR("Domain %u does not exist", params->dom);
         return -1;
     }
 
-    ctx.domid = dom;
+    /* Sanity checks for callbacks. */
+    if ( ctx.dominfo.hvm )
+        assert(callbacks->switch_qemu_logdirty);
+    if ( ctx.save.checkpointed )
+        assert(callbacks->checkpoint && callbacks->aftercopy);
+    if ( ctx.save.checkpointed == XC_MIG_STREAM_COLO )
+        assert(callbacks->wait_checkpoint);
+
+    ctx.domid = params->dom;
+
+    DPRINTF("fd %d, dom %u, max_iterations %u, dirty_threshold %u, live %d, "
+            "debug %d, type %d, hvm %d", ctx.fd, ctx.domid,
+            ctx.save.max_iterations, ctx.save.dirty_threshold, ctx.save.live,
+            ctx.save.debug, ctx.save.checkpointed, ctx.dominfo.hvm);
 
     if ( ctx.dominfo.hvm )
     {
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 77fe30e..c27813a 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -338,8 +338,6 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
     /* Convenience aliases */
     const uint32_t domid = dss->domid;
     const libxl_domain_type type = dss->type;
-    const int live = dss->live;
-    const int debug = dss->debug;
     const libxl_domain_remus_info *const r_info = dss->remus;
     libxl__srm_save_autogen_callbacks *const callbacks =
         &dss->sws.shs.callbacks.save.a;
@@ -374,10 +372,6 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
         abort();
     }
 
-    dss->xcflags = (live ? XCFLAGS_LIVE : 0)
-          | (debug ? XCFLAGS_DEBUG : 0)
-          | (dss->hvm ? XCFLAGS_HVM : 0);
-
     /* Disallow saving a guest with vNUMA configured because migration
      * stream does not preserve node information.
      *
@@ -393,11 +387,6 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
         goto out;
     }
 
-    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_REMUS) {
-        if (libxl_defbool_val(r_info->compression))
-            dss->xcflags |= XCFLAGS_CHECKPOINT_COMPRESS;
-    }
-
     if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
         callbacks->suspend = libxl__domain_suspend_callback;
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index afe6652..89de86b 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3306,7 +3306,6 @@ struct libxl__domain_save_state {
     /* private */
     int rc;
     int hvm;
-    int xcflags;
     libxl__domain_suspend_state dsps;
     union {
         /* for Remus */
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 46b892c..0852bcf 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -59,10 +59,11 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs,
     const unsigned long argnums[] = {
         domid,
         state->store_port,
-        state->store_domid, state->console_port,
+        state->store_domid,
+        state->console_port,
         state->console_domid,
-        hvm, pae, superpages,
-        cbflags, dcs->restore_params.checkpointed_stream,
+        dcs->restore_params.checkpointed_stream,
+        cbflags
     };
 
     shs->ao = ao;
@@ -76,7 +77,7 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs,
     shs->caller_state = dcs;
     shs->need_results = 1;
 
-    run_helper(egc, shs, "--restore-domain", restore_fd, send_back_fd, 0, 0,
+    run_helper(egc, shs, "--restore-domain", restore_fd, send_back_fd, NULL, 0,
                argnums, ARRAY_SIZE(argnums));
 }
 
@@ -89,8 +90,7 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_save_state *dss,
         libxl__srm_callout_enumcallbacks_save(&shs->callbacks.save.a);
 
     const unsigned long argnums[] = {
-        dss->domid, 0, 0, dss->xcflags, dss->hvm,
-        cbflags, dss->checkpointed_stream,
+        dss->domid, 0, dss->live, dss->debug, dss->checkpointed_stream, cbflags
     };
 
     shs->ao = ao;
diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c
index d3def6b..887b6a2 100644
--- a/tools/libxl/libxl_save_helper.c
+++ b/tools/libxl/libxl_save_helper.c
@@ -239,7 +239,6 @@ static struct restore_callbacks helper_restore_callbacks;
 int main(int argc, char **argv)
 {
     int r;
-    int send_back_fd, recv_fd;
 
 #define NEXTARG (++argv, assert(*argv), *argv)
 
@@ -248,15 +247,15 @@ int main(int argc, char **argv)
 
     if (!strcmp(mode,"--save-domain")) {
 
-        io_fd =                             atoi(NEXTARG);
-        recv_fd =                           atoi(NEXTARG);
-        uint32_t dom =                      strtoul(NEXTARG,0,10);
-        uint32_t max_iters =                strtoul(NEXTARG,0,10);
-        uint32_t max_factor =               strtoul(NEXTARG,0,10);
-        uint32_t flags =                    strtoul(NEXTARG,0,10);
-        int hvm =                           atoi(NEXTARG);
-        unsigned cbflags =                  strtoul(NEXTARG,0,10);
-        xc_migration_stream_t stream_type = strtoul(NEXTARG,0,10);
+        struct domain_save_params params;
+        params.save_fd = io_fd = atoi(NEXTARG);
+        params.recv_fd =         atoi(NEXTARG);
+        params.dom =             strtoul(NEXTARG,0,10);
+        params.max_iters =       strtoul(NEXTARG,0,10);
+        params.live =            strtoul(NEXTARG,0,10);
+        params.debug =           strtoul(NEXTARG,0,10);
+        params.stream_type =     strtoul(NEXTARG,0,10);
+        unsigned cbflags =       strtoul(NEXTARG,0,10);
         assert(!*++argv);
 
         helper_setcallbacks_save(&helper_save_callbacks, cbflags);
@@ -264,41 +263,35 @@ int main(int argc, char **argv)
         startup("save");
         setup_signals(save_signal_handler);
 
-        r = xc_domain_save(xch, io_fd, dom, max_iters, max_factor, flags,
-                           &helper_save_callbacks, hvm, stream_type,
-                           recv_fd);
+        r = xc_domain_save(xch, &params, &helper_save_callbacks);
         complete(r);
 
     } else if (!strcmp(mode,"--restore-domain")) {
 
-        io_fd =                             atoi(NEXTARG);
-        send_back_fd =                      atoi(NEXTARG);
-        uint32_t dom =                      strtoul(NEXTARG,0,10);
-        unsigned store_evtchn =             strtoul(NEXTARG,0,10);
-        domid_t store_domid =               strtoul(NEXTARG,0,10);
-        unsigned console_evtchn =           strtoul(NEXTARG,0,10);
-        domid_t console_domid =             strtoul(NEXTARG,0,10);
-        unsigned int hvm =                  strtoul(NEXTARG,0,10);
-        unsigned int pae =                  strtoul(NEXTARG,0,10);
-        int superpages =                    strtoul(NEXTARG,0,10);
-        unsigned cbflags =                  strtoul(NEXTARG,0,10);
-        xc_migration_stream_t stream_type = strtoul(NEXTARG,0,10);
+        xen_pfn_t store_gfn = 0;
+        xen_pfn_t console_gfn = 0;
+
+        struct domain_restore_params params;
+        params.recv_fd = io_fd = atoi(NEXTARG);
+        params.send_back_fd =    atoi(NEXTARG);
+        params.dom =             strtoul(NEXTARG,0,10);
+        params.store_evtchn =    strtoul(NEXTARG,0,10);
+        params.store_gfn =       &store_gfn;
+        params.store_domid =     strtoul(NEXTARG,0,10);
+        params.console_evtchn =  strtoul(NEXTARG,0,10);
+        params.console_gfn =     &console_gfn;
+        params.console_domid =   strtoul(NEXTARG,0,10);
+        params.stream_type =     strtoul(NEXTARG,0,10);
+        unsigned cbflags =       strtoul(NEXTARG,0,10);
         assert(!*++argv);
 
         helper_setcallbacks_restore(&helper_restore_callbacks, cbflags);
 
-        unsigned long store_mfn = 0;
-        unsigned long console_mfn = 0;
-
         startup("restore");
         setup_signals(SIG_DFL);
 
-        r = xc_domain_restore(xch, io_fd, dom, store_evtchn, &store_mfn,
-                              store_domid, console_evtchn, &console_mfn,
-                              console_domid, hvm, pae, superpages,
-                              stream_type,
-                              &helper_restore_callbacks, send_back_fd);
-        helper_stub_restore_results(store_mfn,console_mfn,0);
+        r = xc_domain_restore(xch, &params, &helper_restore_callbacks);
+        helper_stub_restore_results(store_gfn, console_gfn, 0);
         complete(r);
 
     } else {
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 08/23] libxc/migration: defer precopy policy to a callback
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (6 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 07/23] libxc/migration: tidy the xc_domain_save()/restore() interface Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 09/23] libxl/migration: wire up the precopy policy RPC callback Joshua Otto
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

The precopy phase of the xc_domain_save() live migration algorithm has
historically been implemented to run until either a) (almost) no pages
are dirty or b) some fixed, hard-coded maximum number of precopy
iterations has been exceeded.  This policy and its implementation are
less than ideal for a few reasons:
- the logic of the policy is intertwined with the control flow of the
  mechanism of the precopy stage
- it can't take into account facts external to the immediate
  migration context, such as interactive user input or the passage of
  wall-clock time

To permit users to implement arbitrary higher-level policies governing
when the live migration precopy phase should end, and what should be
done next:
- add a precopy_policy() callback to the xc_domain_save() user-supplied
  callbacks
- during the precopy phase of live migrations, consult this policy after
  each batch of pages transmitted and take the dictated action, which
  may be to a) abort the migration entirely, b) continue with the
  precopy, or c) proceed to the stop-and-copy phase.

For now a simple callback implementing the old policy is hard-coded in
place (to be replaced in a subsequent patch).

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h   |  20 +++-
 tools/libxc/xc_sr_common.h       |  12 ++-
 tools/libxc/xc_sr_save.c         | 193 ++++++++++++++++++++++++++++-----------
 tools/libxl/libxl_save_callout.c |   2 +-
 tools/libxl/libxl_save_helper.c  |   1 -
 5 files changed, 170 insertions(+), 58 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index d1f97b9..215abd0 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -32,6 +32,14 @@
  */
 struct xenevtchn_handle;
 
+/* For save's precopy_policy(). */
+struct precopy_stats
+{
+    unsigned int iteration;
+    unsigned int total_written;
+    int dirty_count; /* -1 if unknown */
+};
+
 /* callbacks provided by xc_domain_save */
 struct save_callbacks {
     /* Called after expiration of checkpoint interval,
@@ -39,6 +47,17 @@ struct save_callbacks {
      */
     int (*suspend)(void* data);
 
+    /* Called after every batch of page data sent during the precopy phase of a
+     * live migration to ask the caller what to do next based on the current
+     * state of the precopy migration.
+     */
+#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely and
+                                        * tidy up. */
+#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
+#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
+                                        * remaining dirty pages. */
+    int (*precopy_policy)(struct precopy_stats stats, void *data);
+
     /* Called after the guest's dirty pages have been
      *  copied into an output buffer.
      * Callback function resumes the guest & the device model,
@@ -87,7 +106,6 @@ struct domain_save_params {
     uint32_t dom;       /* the id of the domain */
     int save_fd;        /* the fd to save the domain to */
     int recv_fd;        /* the fd to receive live protocol responses */
-    uint32_t max_iters; /* how many precopy iterations before we give up? */
     bool live;          /* is this a live migration? */
     bool debug;         /* are we in debug mode? */
     xc_migration_stream_t stream_type; /* is there checkpointing involved? */
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index f192654..0da0ffc 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -198,12 +198,16 @@ struct xc_sr_context
             /* Further debugging information in the stream. */
             bool debug;
 
-            /* Parameters for tweaking live migration. */
-            unsigned max_iterations;
-            unsigned dirty_threshold;
-
             unsigned long p2m_size;
 
+            enum {
+                XC_SAVE_PHASE_PRECOPY,
+                XC_SAVE_PHASE_STOP_AND_COPY
+            } phase;
+
+            struct precopy_stats stats;
+            int policy_decision;
+
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 0ab86c3..55b77ff 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -277,13 +277,29 @@ static int write_batch(struct xc_sr_context *ctx)
 }
 
 /*
+ * Test if the batch is full.
+ */
+static bool batch_full(const struct xc_sr_context *ctx)
+{
+    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
+}
+
+/*
+ * Test if the batch is empty.
+ */
+static bool batch_empty(struct xc_sr_context *ctx)
+{
+    return ctx->save.nr_batch_pfns == 0;
+}
+
+/*
  * Flush a batch of pfns into the stream.
  */
 static int flush_batch(struct xc_sr_context *ctx)
 {
     int rc = 0;
 
-    if ( ctx->save.nr_batch_pfns == 0 )
+    if ( batch_empty(ctx) )
         return rc;
 
     rc = write_batch(ctx);
@@ -299,19 +315,12 @@ static int flush_batch(struct xc_sr_context *ctx)
 }
 
 /*
- * Add a single pfn to the batch, flushing the batch if full.
+ * Add a single pfn to the batch.
  */
-static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
+static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
-    int rc = 0;
-
-    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
-        rc = flush_batch(ctx);
-
-    if ( rc == 0 )
-        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
-
-    return rc;
+    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
+    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
 }
 
 /*
@@ -358,43 +367,80 @@ static int suspend_domain(struct xc_sr_context *ctx)
  * Send a subset of pages in the guests p2m, according to the dirty bitmap.
  * Used for each subsequent iteration of the live migration loop.
  *
+ * During the precopy stage of a live migration, test the user-supplied
+ * policy function after each batch of pages and cut off the operation
+ * early if indicated (the dirty pages remaining in this round are transferred
+ * into the deferred_pages bitmap).  This function writes observed precopy
+ * policy decisions to ctx->save.policy_decision; callers must check this upon
+ * return.
+ *
  * Bitmap is bounded by p2m_size.
  */
 static int send_dirty_pages(struct xc_sr_context *ctx,
                             unsigned long entries)
 {
     xc_interface *xch = ctx->xch;
-    xen_pfn_t p;
-    unsigned long written;
+    xen_pfn_t p = 0;
+    unsigned long written = 0;
     int rc;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
-    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
+    int (*precopy_policy)(struct precopy_stats, void *) =
+        ctx->save.callbacks->precopy_policy;
+    void *data = ctx->save.callbacks->data;
+
+    assert(batch_empty(ctx));
+    while ( p < ctx->save.p2m_size )
     {
-        if ( !test_bit(p, dirty_bitmap) )
-            continue;
+        if ( ctx->save.phase == XC_SAVE_PHASE_PRECOPY )
+        {
+            ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
+
+            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
+            {
+                IPRINTF("Precopy policy has requested we abort, cleaning up");
+                return -1;
+            }
+            else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
+            {
+                /*
+                 * Any outstanding dirty pages are now deferred until the next
+                 * phase of the migration.
+                 */
+                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
+                          ctx->save.p2m_size);
+                if ( entries > written )
+                    ctx->save.nr_deferred_pages += entries - written;
+
+                goto done;
+            }
+        }
+
+        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
+        {
+            if ( test_and_clear_bit(p, dirty_bitmap) )
+            {
+                add_to_batch(ctx, p);
+                ++written;
+                ++ctx->save.stats.total_written;
+            }
+        }
 
-        rc = add_to_batch(ctx, p);
+        rc = flush_batch(ctx);
         if ( rc )
             return rc;
 
-        /* Update progress every 4MB worth of memory sent. */
-        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
-            xc_report_progress_step(xch, written, entries);
-
-        ++written;
+        /* Update progress after every batch (4MB) worth of memory sent. */
+        xc_report_progress_step(xch, written, entries);
     }
 
-    rc = flush_batch(ctx);
-    if ( rc )
-        return rc;
-
     if ( written > entries )
         DPRINTF("Bitmap contained more entries than expected...");
 
     xc_report_progress_step(xch, entries, entries);
 
+ done:
     return ctx->save.ops.check_vm_state(ctx);
 }
 
@@ -452,8 +498,7 @@ static int update_progress_string(struct xc_sr_context *ctx,
     xc_interface *xch = ctx->xch;
     char *new_str = NULL;
 
-    if ( asprintf(&new_str, "Frames iteration %u of %u",
-                  iter, ctx->save.max_iterations) == -1 )
+    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
     {
         PERROR("Unable to allocate new progress string");
         return -1;
@@ -474,20 +519,34 @@ static int send_memory_live(struct xc_sr_context *ctx)
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
     char *progress_str = NULL;
-    unsigned x;
     int rc;
 
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    int (*precopy_policy)(struct precopy_stats, void *) =
+        ctx->save.callbacks->precopy_policy;
+    void *data = ctx->save.callbacks->data;
+
     rc = update_progress_string(ctx, &progress_str, 0);
     if ( rc )
         goto out;
 
+    ctx->save.stats = (struct precopy_stats)
+        {
+            .iteration     = 0,
+            .total_written = 0,
+            .dirty_count   = -1
+        };
+
+    /* This has the side-effect of priming ctx->save.policy_decision. */
     rc = send_all_pages(ctx);
     if ( rc )
         goto out;
 
-    for ( x = 1;
-          ((x < ctx->save.max_iterations) &&
-           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
+    for ( ctx->save.stats.iteration = 1;
+          ctx->save.policy_decision == XGS_POLICY_CONTINUE_PRECOPY;
+          ++ctx->save.stats.iteration )
     {
         if ( xc_shadow_control(
                  xch, ctx->domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
@@ -499,10 +558,32 @@ static int send_memory_live(struct xc_sr_context *ctx)
             goto out;
         }
 
-        if ( stats.dirty_count == 0 )
-            break;
+        /* Check the new dirty_count against the policy. */
+        ctx->save.stats.dirty_count = stats.dirty_count;
+        ctx->save.policy_decision = precopy_policy(ctx->save.stats, data);
+        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
+        {
+            IPRINTF("Precopy policy has requested we abort, cleaning up");
+            rc = -1;
+            goto out;
+        }
+        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY )
+        {
+            bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
+                      ctx->save.p2m_size);
+            ctx->save.nr_deferred_pages += stats.dirty_count;
+            rc = 0;
+            goto out;
+        }
 
-        rc = update_progress_string(ctx, &progress_str, x);
+        /*
+         * After this point we won't know how many pages are really dirty until
+         * the next iteration.
+         */
+        ctx->save.stats.dirty_count = -1;
+
+        rc = update_progress_string(ctx, &progress_str,
+                                    ctx->save.stats.iteration);
         if ( rc )
             goto out;
 
@@ -583,6 +664,8 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
+    ctx->save.phase = XC_SAVE_PHASE_STOP_AND_COPY;
+
     rc = suspend_domain(ctx);
     if ( rc )
         goto out;
@@ -601,7 +684,7 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
     if ( ctx->save.live )
     {
         rc = update_progress_string(ctx, &progress_str,
-                                    ctx->save.max_iterations);
+                                    ctx->save.stats.iteration);
         if ( rc )
             goto out;
     }
@@ -740,6 +823,9 @@ static int setup(struct xc_sr_context *ctx)
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
+    ctx->save.phase = ctx->save.live ? XC_SAVE_PHASE_PRECOPY
+                                     : XC_SAVE_PHASE_STOP_AND_COPY;
+
     rc = ctx->save.ops.setup(ctx);
     if ( rc )
         goto err;
@@ -915,6 +1001,17 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
     return rc;
 };
 
+static int simple_precopy_policy(struct precopy_stats stats, void *user)
+{
+    if (stats.dirty_count >= 0 && stats.dirty_count < 50)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    if (stats.iteration >= 5)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    return XGS_POLICY_CONTINUE_PRECOPY;
+}
+
 int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
                    const struct save_callbacks* callbacks)
 {
@@ -924,8 +1021,12 @@ int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
             .fd = params->save_fd,
         };
 
+    /* XXX use this to shim our precopy_policy in before moving it to libxl */
+    struct save_callbacks overridden_callbacks = *callbacks;
+    overridden_callbacks.precopy_policy = simple_precopy_policy;
+
     /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions. */
-    ctx.save.callbacks = callbacks;
+    ctx.save.callbacks = &overridden_callbacks;
     ctx.save.live  = params->live;
     ctx.save.debug = params->debug;
     ctx.save.checkpointed = params->stream_type;
@@ -936,15 +1037,6 @@ int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
            params->stream_type == XC_MIG_STREAM_REMUS ||
            params->stream_type == XC_MIG_STREAM_COLO);
 
-    /*
-     * TODO: Find some time to better tweak the live migration algorithm.
-     *
-     * These parameters are better than the legacy algorithm especially for
-     * busy guests.
-     */
-    ctx.save.max_iterations = 5;
-    ctx.save.dirty_threshold = 50;
-
     if ( xc_domain_getinfo(xch, params->dom, 1, &ctx.dominfo) != 1 )
     {
         PERROR("Failed to get domain info");
@@ -967,10 +1059,9 @@ int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
 
     ctx.domid = params->dom;
 
-    DPRINTF("fd %d, dom %u, max_iterations %u, dirty_threshold %u, live %d, "
-            "debug %d, type %d, hvm %d", ctx.fd, ctx.domid,
-            ctx.save.max_iterations, ctx.save.dirty_threshold, ctx.save.live,
-            ctx.save.debug, ctx.save.checkpointed, ctx.dominfo.hvm);
+    DPRINTF("fd %d, dom %u, live %d, debug %d, type %d, hvm %d", ctx.fd,
+            ctx.domid, ctx.save.live, ctx.save.debug, ctx.save.checkpointed,
+            ctx.dominfo.hvm);
 
     if ( ctx.dominfo.hvm )
     {
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 0852bcf..4094c0f 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -90,7 +90,7 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_save_state *dss,
         libxl__srm_callout_enumcallbacks_save(&shs->callbacks.save.a);
 
     const unsigned long argnums[] = {
-        dss->domid, 0, dss->live, dss->debug, dss->checkpointed_stream, cbflags
+        dss->domid, dss->live, dss->debug, dss->checkpointed_stream, cbflags
     };
 
     shs->ao = ao;
diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c
index 887b6a2..63c8e15 100644
--- a/tools/libxl/libxl_save_helper.c
+++ b/tools/libxl/libxl_save_helper.c
@@ -251,7 +251,6 @@ int main(int argc, char **argv)
         params.save_fd = io_fd = atoi(NEXTARG);
         params.recv_fd =         atoi(NEXTARG);
         params.dom =             strtoul(NEXTARG,0,10);
-        params.max_iters =       strtoul(NEXTARG,0,10);
         params.live =            strtoul(NEXTARG,0,10);
         params.debug =           strtoul(NEXTARG,0,10);
         params.stream_type =     strtoul(NEXTARG,0,10);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 09/23] libxl/migration: wire up the precopy policy RPC callback
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (7 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 08/23] libxc/migration: defer precopy policy to a callback Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 10/23] libxc/xc_sr_save: introduce save batch types Joshua Otto
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Permit libxl to implement the xc_domain_save() precopy_policy callback
by adding it to the RPC generation machinery and implementing a policy
in libxl with the same semantics as the old one.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_save.c           | 17 +----------------
 tools/libxl/libxl_dom_save.c       | 23 +++++++++++++++++++++++
 tools/libxl/libxl_save_msgs_gen.pl |  4 +++-
 3 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 55b77ff..48d403b 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -1001,17 +1001,6 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
     return rc;
 };
 
-static int simple_precopy_policy(struct precopy_stats stats, void *user)
-{
-    if (stats.dirty_count >= 0 && stats.dirty_count < 50)
-        return XGS_POLICY_STOP_AND_COPY;
-
-    if (stats.iteration >= 5)
-        return XGS_POLICY_STOP_AND_COPY;
-
-    return XGS_POLICY_CONTINUE_PRECOPY;
-}
-
 int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
                    const struct save_callbacks* callbacks)
 {
@@ -1021,12 +1010,8 @@ int xc_domain_save(xc_interface *xch, const struct domain_save_params *params,
             .fd = params->save_fd,
         };
 
-    /* XXX use this to shim our precopy_policy in before moving it to libxl */
-    struct save_callbacks overridden_callbacks = *callbacks;
-    overridden_callbacks.precopy_policy = simple_precopy_policy;
-
     /* GCC 4.4 (of CentOS 6.x vintage) can' t initialise anonymous unions. */
-    ctx.save.callbacks = &overridden_callbacks;
+    ctx.save.callbacks = callbacks;
     ctx.save.live  = params->live;
     ctx.save.debug = params->debug;
     ctx.save.checkpointed = params->stream_type;
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index c27813a..b65135d 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -328,6 +328,28 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
     return rc;
 }
 
+/*
+ * This is the live migration precopy policy - it's called periodically during
+ * the precopy phase of live migrations, and is responsible for deciding when
+ * the precopy phase should terminate and what should be done next.
+ *
+ * The policy implemented here behaves identically to the policy previously
+ * hard-coded into xc_domain_save() - it proceeds to the stop-and-copy phase of
+ * the live migration when there are either fewer than 50 dirty pages, or more
+ * than 5 precopy rounds have completed.
+ */
+static int libxl__save_live_migration_precopy_policy(
+    struct precopy_stats stats, void *user)
+{
+    if (stats.dirty_count >= 0 && stats.dirty_count < 50)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    if (stats.iteration >= 5)
+        return XGS_POLICY_STOP_AND_COPY;
+
+    return XGS_POLICY_CONTINUE_PRECOPY;
+}
+
 /*----- main code for saving, in order of execution -----*/
 
 void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
@@ -390,6 +412,7 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
     if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
         callbacks->suspend = libxl__domain_suspend_callback;
 
+    callbacks->precopy_policy = libxl__save_live_migration_precopy_policy;
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
 
     dss->sws.ao  = dss->ao;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 27845bb..50c97b4 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -33,6 +33,7 @@ our @msgs = (
                                               'xen_pfn_t', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
 );
 
 #----------------------------------------
@@ -141,7 +142,8 @@ static void bytes_put(unsigned char *const buf, int *len,
 
 END
 
-foreach my $simpletype (qw(int uint16_t uint32_t unsigned), 'unsigned long', 'xen_pfn_t') {
+foreach my $simpletype (qw(int uint16_t uint32_t unsigned),
+                        'unsigned long', 'xen_pfn_t', 'struct precopy_stats') {
     my $typeid = typeid($simpletype);
     $out_body{'callout'} .= <<END;
 static int ${typeid}_get(const unsigned char **msg,
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 10/23] libxc/xc_sr_save: introduce save batch types
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (8 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 09/23] libxl/migration: wire up the precopy policy RPC callback Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 11/23] libxc/migration: correct hvm record ordering specification Joshua Otto
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

To write guest pages into the stream, the save logic builds up batches
of pfns to be written and performs all of the work necessary to write
them whenever a full batch has been accumulated.  Writing a PAGE_DATA
batch entails determining the types of all pfns in the batch, mapping
the subset of pfns that are backed by real memory constructing a
PAGE_DATA record describing the batch and writing everything into the
stream.

Postcopy live migration introduces several new types of batches.  To
enable the postcopy logic to re-use the bulk of the code used to manage
and write PAGE_DATA records, introduce a batch_type member to the save
context (which for now can take on only a single value), and refactor
write_batch() to take the batch_type into account when preparing and
writing each record.

While refactoring write_batch(), factor the operation of querying the
page types of a batch into a subroutine that is useable independently of
write_batch().

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_common.h    |   3 +
 tools/libxc/xc_sr_save.c      | 207 +++++++++++++++++++++++++++---------------
 tools/libxc/xg_save_restore.h |   2 +-
 3 files changed, 140 insertions(+), 72 deletions(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 0da0ffc..fc82e71 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -208,6 +208,9 @@ struct xc_sr_context
             struct precopy_stats stats;
             int policy_decision;
 
+            enum {
+                XC_SR_SAVE_BATCH_PRECOPY_PAGE
+            } batch_type;
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 48d403b..9f077a3 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -3,6 +3,23 @@
 
 #include "xc_sr_common.h"
 
+#define MAX_BATCH_SIZE MAX_PRECOPY_BATCH_SIZE
+
+static const unsigned int batch_sizes[] =
+{
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE
+};
+
+static const bool batch_includes_contents[] =
+{
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE] = true
+};
+
+static const uint32_t batch_rec_types[] =
+{
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA
+};
+
 /*
  * Writes an Image header and Domain header into the stream.
  */
@@ -67,19 +84,54 @@ static int write_checkpoint_record(struct xc_sr_context *ctx)
 }
 
 /*
+ * This function:
+ * - maps each pfn in the current batch to its gfn
+ * - gets the type of each pfn in the batch.
+ */
+static int get_batch_info(struct xc_sr_context *ctx, xen_pfn_t *gfns,
+                          xen_pfn_t *types)
+{
+    int rc;
+    unsigned int nr_pfns = ctx->save.nr_batch_pfns;
+    xc_interface *xch = ctx->xch;
+    unsigned int i;
+
+    for ( i = 0; i < nr_pfns; ++i )
+        types[i] = gfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
+                                                      ctx->save.batch_pfns[i]);
+
+    /*
+     * The type query domctl accepts batches of at most 1024 pfns, so we need to
+     * break our batch here into appropriately-sized sub-batches.
+     */
+    for ( i = 0; i < nr_pfns; i += 1024 )
+    {
+        rc = xc_get_pfn_type_batch(xch, ctx->domid, min(1024U, nr_pfns - i),
+                                   &types[i]);
+        if ( rc )
+        {
+            PERROR("Failed to get types for pfn batch");
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+/*
  * Writes a batch of memory as a PAGE_DATA record into the stream.  The batch
  * is constructed in ctx->save.batch_pfns.
  *
  * This function:
- * - gets the types for each pfn in the batch.
  * - for each pfn with real data:
  *   - maps and attempts to localise the pages.
  * - construct and writes a PAGE_DATA record into the stream.
  */
-static int write_batch(struct xc_sr_context *ctx)
+static int write_batch(struct xc_sr_context *ctx, xen_pfn_t *gfns,
+                       xen_pfn_t *types)
 {
     xc_interface *xch = ctx->xch;
-    xen_pfn_t *gfns = NULL, *types = NULL;
+    xen_pfn_t *bgfns = NULL;
     void *guest_mapping = NULL;
     void **guest_data = NULL;
     void **local_pages = NULL;
@@ -90,17 +142,16 @@ static int write_batch(struct xc_sr_context *ctx)
     uint64_t *rec_pfns = NULL;
     struct iovec *iov = NULL; int iovcnt = 0;
     struct xc_sr_rec_pages_header hdr = { 0 };
+    bool send_page_contents = batch_includes_contents[ctx->save.batch_type];
     struct xc_sr_record rec =
     {
-        .type = REC_TYPE_PAGE_DATA,
+        .type = batch_rec_types[ctx->save.batch_type],
     };
 
     assert(nr_pfns != 0);
 
-    /* Mfns of the batch pfns. */
-    gfns = malloc(nr_pfns * sizeof(*gfns));
-    /* Types of the batch pfns. */
-    types = malloc(nr_pfns * sizeof(*types));
+    /* The subset of gfns that are physically-backed. */
+    bgfns = malloc(nr_pfns * sizeof(*bgfns));
     /* Errors from attempting to map the gfns. */
     errors = malloc(nr_pfns * sizeof(*errors));
     /* Pointers to page data to send.  Mapped gfns or local allocations. */
@@ -110,19 +161,16 @@ static int write_batch(struct xc_sr_context *ctx)
     /* iovec[] for writev(). */
     iov = malloc((nr_pfns + 4) * sizeof(*iov));
 
-    if ( !gfns || !types || !errors || !guest_data || !local_pages || !iov )
+    if ( !bgfns || !errors || !guest_data || !local_pages || !iov )
     {
         ERROR("Unable to allocate arrays for a batch of %u pages",
               nr_pfns);
         goto err;
     }
 
+    /* Mark likely-ballooned pages as deferred. */
     for ( i = 0; i < nr_pfns; ++i )
     {
-        types[i] = gfns[i] = ctx->save.ops.pfn_to_gfn(ctx,
-                                                      ctx->save.batch_pfns[i]);
-
-        /* Likely a ballooned page. */
         if ( gfns[i] == INVALID_MFN )
         {
             set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
@@ -130,39 +178,9 @@ static int write_batch(struct xc_sr_context *ctx)
         }
     }
 
-    rc = xc_get_pfn_type_batch(xch, ctx->domid, nr_pfns, types);
-    if ( rc )
-    {
-        PERROR("Failed to get types for pfn batch");
-        goto err;
-    }
-    rc = -1;
-
-    for ( i = 0; i < nr_pfns; ++i )
-    {
-        switch ( types[i] )
-        {
-        case XEN_DOMCTL_PFINFO_BROKEN:
-        case XEN_DOMCTL_PFINFO_XALLOC:
-        case XEN_DOMCTL_PFINFO_XTAB:
-            continue;
-        }
-
-        gfns[nr_pages++] = gfns[i];
-    }
-
-    if ( nr_pages > 0 )
+    if ( send_page_contents )
     {
-        guest_mapping = xenforeignmemory_map(xch->fmem,
-            ctx->domid, PROT_READ, nr_pages, gfns, errors);
-        if ( !guest_mapping )
-        {
-            PERROR("Failed to map guest pages");
-            goto err;
-        }
-        nr_pages_mapped = nr_pages;
-
-        for ( i = 0, p = 0; i < nr_pfns; ++i )
+        for ( i = 0; i < nr_pfns; ++i )
         {
             switch ( types[i] )
             {
@@ -172,36 +190,62 @@ static int write_batch(struct xc_sr_context *ctx)
                 continue;
             }
 
-            if ( errors[p] )
+            bgfns[nr_pages++] = gfns[i];
+        }
+
+        if ( nr_pages > 0 )
+        {
+            guest_mapping = xenforeignmemory_map(xch->fmem,
+                ctx->domid, PROT_READ, nr_pages, bgfns, errors);
+            if ( !guest_mapping )
             {
-                ERROR("Mapping of pfn %#"PRIpfn" (gfn %#"PRIpfn") failed %d",
-                      ctx->save.batch_pfns[i], gfns[p], errors[p]);
+                PERROR("Failed to map guest pages");
                 goto err;
             }
+            nr_pages_mapped = nr_pages;
+
+            for ( i = 0, p = 0; i < nr_pfns; ++i )
+            {
+                switch ( types[i] )
+                {
+                case XEN_DOMCTL_PFINFO_BROKEN:
+                case XEN_DOMCTL_PFINFO_XALLOC:
+                case XEN_DOMCTL_PFINFO_XTAB:
+                    continue;
+                }
 
-            orig_page = page = guest_mapping + (p * PAGE_SIZE);
-            rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
+                if ( errors[p] )
+                {
+                    ERROR("Mapping of pfn %#"PRIpfn" (mfn %#"PRIpfn") failed %d",
+                          ctx->save.batch_pfns[i], bgfns[p], errors[p]);
+                    goto err;
+                }
 
-            if ( orig_page != page )
-                local_pages[i] = page;
+                orig_page = page = guest_mapping + (p * PAGE_SIZE);
+                rc = ctx->save.ops.normalise_page(ctx, types[i], &page);
 
-            if ( rc )
-            {
-                if ( rc == -1 && errno == EAGAIN )
+                if ( orig_page != page )
+                    local_pages[i] = page;
+
+                if ( rc )
                 {
-                    set_bit(ctx->save.batch_pfns[i], ctx->save.deferred_pages);
-                    ++ctx->save.nr_deferred_pages;
-                    types[i] = XEN_DOMCTL_PFINFO_XTAB;
-                    --nr_pages;
+                    if ( rc == -1 && errno == EAGAIN )
+                    {
+                        set_bit(ctx->save.batch_pfns[i],
+                                ctx->save.deferred_pages);
+                        ++ctx->save.nr_deferred_pages;
+                        types[i] = XEN_DOMCTL_PFINFO_XTAB;
+                        --nr_pages;
+                    }
+                    else
+                        goto err;
                 }
                 else
-                    goto err;
-            }
-            else
-                guest_data[i] = page;
+                    guest_data[i] = page;
 
-            rc = -1;
-            ++p;
+                rc = -1;
+                ++p;
+            }
         }
     }
 
@@ -270,8 +314,7 @@ static int write_batch(struct xc_sr_context *ctx)
     free(local_pages);
     free(guest_data);
     free(errors);
-    free(types);
-    free(gfns);
+    free(bgfns);
 
     return rc;
 }
@@ -281,7 +324,7 @@ static int write_batch(struct xc_sr_context *ctx)
  */
 static bool batch_full(const struct xc_sr_context *ctx)
 {
-    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
+    return ctx->save.nr_batch_pfns == batch_sizes[ctx->save.batch_type];
 }
 
 /*
@@ -298,12 +341,29 @@ static bool batch_empty(struct xc_sr_context *ctx)
 static int flush_batch(struct xc_sr_context *ctx)
 {
     int rc = 0;
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *gfns = NULL, *types = NULL;
+    unsigned int nr_pfns = ctx->save.nr_batch_pfns;
 
     if ( batch_empty(ctx) )
-        return rc;
+        goto out;
 
-    rc = write_batch(ctx);
+    gfns = malloc(nr_pfns * sizeof(*gfns));
+    types = malloc(nr_pfns * sizeof(*types));
 
+    if ( !gfns || !types )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        rc = -1;
+        goto out;
+    }
+
+    rc = get_batch_info(ctx, gfns, types);
+    if ( rc )
+        goto out;
+
+    rc = write_batch(ctx, gfns, types);
     if ( !rc )
     {
         VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
@@ -311,6 +371,10 @@ static int flush_batch(struct xc_sr_context *ctx)
                                     sizeof(*ctx->save.batch_pfns));
     }
 
+ out:
+    free(gfns);
+    free(types);
+
     return rc;
 }
 
@@ -319,7 +383,7 @@ static int flush_batch(struct xc_sr_context *ctx)
  */
 static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
-    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
+    assert(ctx->save.nr_batch_pfns < batch_sizes[ctx->save.batch_type]);
     ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
 }
 
@@ -391,6 +455,7 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
     void *data = ctx->save.callbacks->data;
 
     assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_PRECOPY_PAGE;
     while ( p < ctx->save.p2m_size )
     {
         if ( ctx->save.phase == XC_SAVE_PHASE_PRECOPY )
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
index 303081d..40debf6 100644
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -24,7 +24,7 @@
 ** We process save/restore/migrate in batches of pages; the below
 ** determines how many pages we (at maximum) deal with in each batch.
 */
-#define MAX_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
+#define MAX_PRECOPY_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
 
 /* When pinning page tables at the end of restore, we also use batching. */
 #define MAX_PIN_BATCH  1024
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 11/23] libxc/migration: correct hvm record ordering specification
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (9 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 10/23] libxc/xc_sr_save: introduce save batch types Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 12/23] libxc/migration: specify postcopy live migration Joshua Otto
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

The libxc migration stream specification document asserts that, within
an hvm migration stream, "HVM_PARAMS must precede HVM_CONTEXT, as
certain parameters can affect the validity of architectural state in the
context."  This sounds reasonable, but the in-tree implementation of hvm
domain save actually writes these records in the _reverse_ order, with
HVM_CONTEXT first and HVM_PARAMS next.  This has been the case for the
entire history of that implementation, seemingly to no ill effect, so
update the spec to reflect this.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 docs/specs/libxc-migration-stream.pandoc | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
index 73421ff..8342d88 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -673,11 +673,8 @@ A typical save record for an x86 HVM guest image would look like:
 2. Domain header
 3. Many PAGE\_DATA records
 4. TSC\_INFO
-5. HVM\_PARAMS
-6. HVM\_CONTEXT
-
-HVM\_PARAMS must precede HVM\_CONTEXT, as certain parameters can affect
-the validity of architectural state in the context.
+5. HVM\_CONTEXT
+6. HVM\_PARAMS
 
 
 Legacy Images (x86 only)
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 12/23] libxc/migration: specify postcopy live migration
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (10 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 11/23] libxc/migration: correct hvm record ordering specification Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 13/23] libxc/migration: add try_read_record() Joshua Otto
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

- allocate the new postcopy record type numbers
- augment the stream format specification to include these new types and
  their role in the protocol

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 docs/specs/libxc-migration-stream.pandoc | 175 +++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_common.c               |   7 ++
 tools/libxc/xc_sr_stream_format.h        |   9 +-
 3 files changed, 190 insertions(+), 1 deletion(-)

diff --git a/docs/specs/libxc-migration-stream.pandoc b/docs/specs/libxc-migration-stream.pandoc
index 8342d88..9f08615 100644
--- a/docs/specs/libxc-migration-stream.pandoc
+++ b/docs/specs/libxc-migration-stream.pandoc
@@ -3,6 +3,7 @@
   Andrew Cooper <<andrew.cooper3@citrix.com>>
   Wen Congyang <<wency@cn.fujitsu.com>>
   Yang Hongyang <<hongyang.yang@easystack.cn>>
+  Joshua Otto <<jtotto@uwaterloo.ca>>
 % Revision 2
 
 Introduction
@@ -231,6 +232,20 @@ type         0x00000000: END
 
              0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
 
+             0x00000010: POSTCOPY_BEGIN
+
+             0x00000011: POSTCOPY_PFNS_BEGIN
+
+             0x00000012: POSTCOPY_PFNS
+
+             0x00000013: POSTCOPY_TRANSITION
+
+             0x00000014: POSTCOPY_PAGE_DATA
+
+             0x00000015: POSTCOPY_FAULT
+
+             0x00000016: POSTCOPY_COMPLETE
+
              0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
              records.
 
@@ -624,6 +639,142 @@ The count of pfns is: record->length/sizeof(uint64_t).
 
 \clearpage
 
+POSTCOPY_BEGIN
+--------------
+
+This record must only appear in a truly _live_ migration stream, and is
+transmitted by the migration sender to signal to the destination that
+the migration will (as soon as possible) transition from the memory
+pre-copy phase to the post-copy phase, during which remaining unmigrated
+domain memory is paged over the network on-demand _after_ the guest has
+resumed.
+
+This record _must_ be followed immediately by the domain CPU context
+records (e.g. TSC_INFO, HVM_CONTEXT and HVM_PARAMS for HVM domains).
+This is for practical reasons: in the HVM case, the PAGING_RING_PFN
+parameter must be known at the destination before preparation for paging
+can begin.
+
+This record contains no fields; its body_length is 0.
+
+\clearpage
+
+POSTCOPY_PFNS_BEGIN
+-------------------
+
+During the initiation sequence of a postcopy live migration, this record
+immediately follows the final domain CPU context record and indicates
+the beginning of a sequence of 0 or more POSTCOPY_PFNS records.  The
+destination uses this record as a cue to prepare for postcopy paging.
+
+This record contains no fields; its body_length is 0.
+
+\clearpage
+
+POSTCOPY_PFNS
+-------------
+
+Each POSTCOPY_PFNS record contains an unordered list of 'postcopy PFNS'
+- i.e. pfns that are dirty at the sender and require migration during
+the postcopy phase.  The structure of the record is identical that of
+the PAGE_DATA record type, but omitting any actual trailing page
+contents.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+
+\clearpage
+
+POSTCOPY_TRANSITION
+-------------------
+
+This record is transmitted by a postcopy live migration sender after the
+final POSTCOPY_PFNS record, and indicates that the embedded libxc stream
+will be interrupted by content in the higher-layer stream necessary to
+permit resumption of the domain at the destination, and further than
+when the higher-layer content is complete the domain should be resumed
+in postcopy mode at the destination.
+
+This record contains no fields; its body_length is 0.
+
+\clearpage
+
+POSTCOPY_PAGE_DATA
+------------------
+
+This record is identical in meaning and format to the PAGE_DATA record
+type, and is transmitted during live migration by the sender during the
+postcopy phase to transfer batches of outstanding domain memory.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+    | page_data[0]...                                 |
+    ...
+    +-------------------------------------------------+
+    | page_data[C-1]...                               |
+    ...
+    +-------------------------------------------------+
+
+It is an error for an XTAB, BROKEN or XALLOC pfn to be transmitted in a
+record of this type, so all pfns must be accompanied by backing data.
+It is an error for a pfn not previously included in a POSTCOPY_PFNS
+record to be included in a record of this type.
+
+\clearpage
+
+POSTCOPY_FAULT
+--------------
+
+A POSTCOPY_FAULT record is transmitted by a postcopy live migration
+_destination_ to communicate an urgent need for a batch of pfns.  It is
+identical in format to the POSTCOPY_PFNS record type, _except_ that the
+type of each page is not encoded in the transmitted pfns.
+
+     0     1     2     3     4     5     6     7 octet
+    +-----------------------+-------------------------+
+    | count (C)             | (reserved)              |
+    +-----------------------+-------------------------+
+    | pfn[0]                                          |
+    +-------------------------------------------------+
+    ...
+    +-------------------------------------------------+
+    | pfn[C-1]                                        |
+    +-------------------------------------------------+
+
+\clearpage
+
+POSTCOPY_COMPLETE
+-----------------
+
+A postcopy live migration _destination_ transmits a POSTCOPY_COMPLETE
+record when the postcopy phase of a migration is complete, if one was
+entered.
+
+This record contains no fields; its body_length is 0.
+
+In addition to reporting the phase completion to the sender, this record
+also enables the migration sender to flush its receive stream of
+in-flight POSTCOPY_FAULT records before handing control of the stream
+back to a higher layer.
+
+\clearpage
+
 Layout
 ======
 
@@ -676,6 +827,30 @@ A typical save record for an x86 HVM guest image would look like:
 5. HVM\_CONTEXT
 6. HVM\_PARAMS
 
+x86 HVM Postcopy Live Migration
+-------------------------------
+
+The bi-directional migration stream for postcopy live migration of an
+x86 HVM guest image would look like:
+
+ 1. Image header
+ 2. Domain header
+ 3. Many (or few!) PAGE\_DATA records
+ 4. POSTCOPY\_BEGIN
+ 5. TSC\_INFO
+ 6. HVM\_CONTEXT
+ 7. HVM\_PARAMS
+ 8. POSTCOPY\_PFNS\_BEGIN
+ 9. Many POSTCOPY\_PFNS records
+10. POSTCOPY\_TRANSITION
+... higher layer stream content ...
+11. Many POSTCOPY\_PAGE\_DATA records
+
+During 11, the destination would reply with (hopefully not too) many
+POSTCOPY\_FAULT records.
+
+After 11, the destination would transmit a final POSTCOPY\_COMPLETE.
+
 
 Legacy Images (x86 only)
 ========================
diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index f443974..090b5fd 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -38,6 +38,13 @@ static const char *mandatory_rec_types[] =
     [REC_TYPE_VERIFY]                       = "Verify",
     [REC_TYPE_CHECKPOINT]                   = "Checkpoint",
     [REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST]    = "Checkpoint dirty pfn list",
+    [REC_TYPE_POSTCOPY_BEGIN]               = "Postcopy begin",
+    [REC_TYPE_POSTCOPY_PFNS_BEGIN]          = "Postcopy pfns begin",
+    [REC_TYPE_POSTCOPY_PFNS]                = "Postcopy pfns",
+    [REC_TYPE_POSTCOPY_TRANSITION]          = "Postcopy transition",
+    [REC_TYPE_POSTCOPY_PAGE_DATA]           = "Postcopy page data",
+    [REC_TYPE_POSTCOPY_FAULT]               = "Postcopy fault",
+    [REC_TYPE_POSTCOPY_COMPLETE]            = "Postcopy complete",
 };
 
 const char *rec_type_to_str(uint32_t type)
diff --git a/tools/libxc/xc_sr_stream_format.h b/tools/libxc/xc_sr_stream_format.h
index 32400b2..d16d0c7 100644
--- a/tools/libxc/xc_sr_stream_format.h
+++ b/tools/libxc/xc_sr_stream_format.h
@@ -76,10 +76,17 @@ struct xc_sr_rhdr
 #define REC_TYPE_VERIFY                     0x0000000dU
 #define REC_TYPE_CHECKPOINT                 0x0000000eU
 #define REC_TYPE_CHECKPOINT_DIRTY_PFN_LIST  0x0000000fU
+#define REC_TYPE_POSTCOPY_BEGIN             0x00000010U
+#define REC_TYPE_POSTCOPY_PFNS_BEGIN        0x00000011U
+#define REC_TYPE_POSTCOPY_PFNS              0x00000012U
+#define REC_TYPE_POSTCOPY_TRANSITION        0x00000013U
+#define REC_TYPE_POSTCOPY_PAGE_DATA         0x00000014U
+#define REC_TYPE_POSTCOPY_FAULT             0x00000015U
+#define REC_TYPE_POSTCOPY_COMPLETE          0x00000016U
 
 #define REC_TYPE_OPTIONAL             0x80000000U
 
-/* PAGE_DATA */
+/* PAGE_DATA/POSTCOPY_PFNS/POSTCOPY_PAGE_DATA/POSTCOPY_FAULT */
 struct xc_sr_rec_pages_header
 {
     uint32_t count;
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 13/23] libxc/migration: add try_read_record()
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (11 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 12/23] libxc/migration: specify postcopy live migration Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 14/23] libxc/migration: implement the sender side of postcopy live migration Joshua Otto
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Enable non-blocking migration record reads by adding a helper routine that
manages the context of a record read across multiple invocations as the record's
data becomes available over time.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_private.c   | 21 +++++++++++----
 tools/libxc/xc_private.h   |  2 ++
 tools/libxc/xc_sr_common.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_sr_common.h | 39 ++++++++++++++++++++++++++++
 4 files changed, 122 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/xc_private.c b/tools/libxc/xc_private.c
index f395594..b33d02f 100644
--- a/tools/libxc/xc_private.c
+++ b/tools/libxc/xc_private.c
@@ -633,26 +633,37 @@ void bitmap_byte_to_64(uint64_t *lp, const uint8_t *bp, int nbits)
     }
 }
 
-int read_exact(int fd, void *data, size_t size)
+int try_read_exact(int fd, void *data, size_t size, size_t *offset)
 {
-    size_t offset = 0;
     ssize_t len;
 
-    while ( offset < size )
+    assert(offset);
+    *offset = 0;
+    while ( *offset < size )
     {
-        len = read(fd, (char *)data + offset, size - offset);
+        len = read(fd, (char *)data + *offset, size - *offset);
         if ( (len == -1) && (errno == EINTR) )
             continue;
         if ( len == 0 )
             errno = 0;
         if ( len <= 0 )
             return -1;
-        offset += len;
+        *offset += len;
     }
 
     return 0;
 }
 
+int read_exact(int fd, void *data, size_t size)
+{
+    size_t offset;
+    int rc;
+
+    rc = try_read_exact(fd, data, size, &offset);
+    assert(rc == -1 || offset == size);
+    return rc;
+}
+
 int write_exact(int fd, const void *data, size_t size)
 {
     size_t offset = 0;
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 1c27b0f..aaae344 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -384,6 +384,8 @@ int xc_flush_mmu_updates(xc_interface *xch, struct xc_mmu *mmu);
 
 /* Return 0 on success; -1 on error setting errno. */
 int read_exact(int fd, void *data, size_t size); /* EOF => -1, errno=0 */
+/* Like read_exact(), but stores the length read before error to *offset. */
+int try_read_exact(int fd, void *data, size_t size, size_t *offset);
 int write_exact(int fd, const void *data, size_t size);
 int writev_exact(int fd, const struct iovec *iov, int iovcnt);
 
diff --git a/tools/libxc/xc_sr_common.c b/tools/libxc/xc_sr_common.c
index 090b5fd..c37fe1f 100644
--- a/tools/libxc/xc_sr_common.c
+++ b/tools/libxc/xc_sr_common.c
@@ -147,6 +147,71 @@ int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec)
     return 0;
 };
 
+int try_read_record(struct xc_sr_read_record_context *rrctx, int fd,
+                    struct xc_sr_record *rec)
+{
+    int rc;
+    xc_interface *xch = rrctx->ctx->xch;
+    size_t offset_out, dataoff, datasz;
+
+    /* If the header isn't yet complete, attempt to finish it first. */
+    if ( rrctx->offset < sizeof(rrctx->rhdr) )
+    {
+        rc = try_read_exact(fd, (char *)&rrctx->rhdr + rrctx->offset,
+                            sizeof(rrctx->rhdr) - rrctx->offset, &offset_out);
+        rrctx->offset += offset_out;
+
+        if ( rc )
+            return rc;
+    }
+
+    datasz = ROUNDUP(rrctx->rhdr.length, REC_ALIGN_ORDER);
+
+    if ( datasz )
+    {
+        if ( !rrctx->data )
+        {
+            rrctx->data = malloc(datasz);
+
+            if ( !rrctx->data )
+            {
+                ERROR("Unable to allocate %zu bytes for record (0x%08x, %s)",
+                      datasz, rrctx->rhdr.type,
+                      rec_type_to_str(rrctx->rhdr.type));
+                return -1;
+            }
+        }
+
+        dataoff = rrctx->offset - sizeof(rrctx->rhdr);
+        rc = try_read_exact(fd, (char *)rrctx->data + dataoff, datasz - dataoff,
+                            &offset_out);
+        rrctx->offset += offset_out;
+
+        if ( rc == -1 )
+        {
+            /* Differentiate between expected and fatal errors. */
+            if ( (errno != EAGAIN) && (errno != EWOULDBLOCK) )
+            {
+                free(rrctx->data);
+                rrctx->data = NULL;
+                PERROR("Failed to read %zu bytes for record (0x%08x, %s)",
+                       datasz, rrctx->rhdr.type,
+                       rec_type_to_str(rrctx->rhdr.type));
+            }
+
+            return rc;
+        }
+    }
+
+    /* Success!  Fill in the output record structure. */
+    rec->type   = rrctx->rhdr.type;
+    rec->length = rrctx->rhdr.length;
+    rec->data   = rrctx->data;
+    rrctx->data = NULL;
+
+    return 0;
+}
+
 int validate_pages_record(struct xc_sr_context *ctx, struct xc_sr_record *rec,
                           uint32_t expected_type)
 {
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index fc82e71..ce72e0d 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -399,6 +399,45 @@ static inline int write_record(struct xc_sr_context *ctx, int fd,
 int read_record(struct xc_sr_context *ctx, int fd, struct xc_sr_record *rec);
 
 /*
+ * try_read_record() (prototype below) reads a record from a _non-blocking_
+ * stream over the course of one or more invocations.  Context for the record
+ * read is maintained in an xc_sr_read_record_context.
+ *
+ * The protocol is:
+ * - call read_record_init() on an uninitialized or previously-destroyed
+ *   read-record context prior to using it to read a record
+ * - call try_read_record() with this initialized context one or more times
+ *   - rc < 0 and errno == EAGAIN/EWOULDBLOCK => try again
+ *   - rc < 0 otherwise => failure
+ *   - rc == 0 => a complete record has been read, and is filled into
+ *     try_read_record()'s rec argument
+ * - after either failure or completion of a record, destroy the context with
+ *   read_record_destroy()
+ */
+struct xc_sr_read_record_context
+{
+    struct xc_sr_context *ctx;
+    size_t offset;
+    struct xc_sr_rhdr rhdr;
+    void *data;
+};
+
+static inline void read_record_init(struct xc_sr_read_record_context *rrctx,
+                                    struct xc_sr_context *ctx)
+{
+    *rrctx = (struct xc_sr_read_record_context) { .ctx = ctx };
+}
+
+int try_read_record(struct xc_sr_read_record_context *rrctx, int fd,
+                    struct xc_sr_record *rec);
+
+static inline void read_record_destroy(struct xc_sr_read_record_context *rrctx)
+{
+    free(rrctx->data);
+    rrctx->data = NULL;
+}
+
+/*
  * Given a record of one of the page data types, validate it by:
  * - checking its actual type against its specific expected type
  * - sanity checking its actual length against its claimed length
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 14/23] libxc/migration: implement the sender side of postcopy live migration
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (12 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 13/23] libxc/migration: add try_read_record() Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 15/23] libxc/migration: implement the receiver " Joshua Otto
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Add a new 'postcopy' phase to the live migration algorithm, during which
unmigrated domain memory is paged over the network on-demand _after_ the
guest has been resumed at the destination.

To do so:
- Add a new precopy policy option, XGS_POLICY_POSTCOPY, that policies
  can use to request a transition to the postcopy live migration phase
  rather than a stop-and-copy of the remaining dirty pages.
- Add support to xc_domain_save() for this policy option by breaking out
  of the precopy loop early, transmitting the final set of dirty pfns
  and all remaining domain state (including higher-layer state) except
  memory, and entering a postcopy loop during which the remaining page
  data is pushed in the background.  Remote requests for specific pages
  in response to faults in the domain are serviced with priority in this
  loop.

The new save callbacks required for this migration phase are stubbed in
libxl for now, to be replaced in a subsequent patch that adds libxl
support for this migration phase.  Support for this phase on the
migration receiver side follows immediately in the next patch.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h     |  84 ++++---
 tools/libxc/xc_sr_common.h         |   8 +-
 tools/libxc/xc_sr_save.c           | 488 ++++++++++++++++++++++++++++++++++---
 tools/libxc/xc_sr_save_x86_hvm.c   |  13 +
 tools/libxc/xg_save_restore.h      |  16 +-
 tools/libxl/libxl_dom_save.c       |  11 +-
 tools/libxl/libxl_save_msgs_gen.pl |   6 +-
 7 files changed, 558 insertions(+), 68 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 215abd0..a662273 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -56,41 +56,59 @@ struct save_callbacks {
 #define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
 #define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
                                         * remaining dirty pages. */
+#define XGS_POLICY_POSTCOPY         2  /* Suspend the guest and transition into
+                                        * the postcopy phase of the migration. */
     int (*precopy_policy)(struct precopy_stats stats, void *data);
 
-    /* Called after the guest's dirty pages have been
-     *  copied into an output buffer.
-     * Callback function resumes the guest & the device model,
-     *  returns to xc_domain_save.
-     * xc_domain_save then flushes the output buffer, while the
-     *  guest continues to run.
-     */
-    int (*aftercopy)(void* data);
-
-    /* Called after the memory checkpoint has been flushed
-     * out into the network. Typical actions performed in this
-     * callback include:
-     *   (a) send the saved device model state (for HVM guests),
-     *   (b) wait for checkpoint ack
-     *   (c) release the network output buffer pertaining to the acked checkpoint.
-     *   (c) sleep for the checkpoint interval.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint */
-    int (*checkpoint)(void* data);
-
-    /*
-     * Called after the checkpoint callback.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint
-     */
-    int (*wait_checkpoint)(void* data);
-
-    /* Enable qemu-dm logging dirty pages to xen */
-    int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        struct {
+            /*
+             * Called during a live migration's transition to the postcopy phase
+             * to yield control of the stream back to a higher layer so it can
+             * transmit records needed for resumption of the guest at the
+             * destination (e.g. device model state, xenstore context)
+             */
+            int (*postcopy_transition)(void *data);
+        };
+
+        struct {
+            /* Called after the guest's dirty pages have been
+             *  copied into an output buffer.
+             * Callback function resumes the guest & the device model,
+             *  returns to xc_domain_save.
+             * xc_domain_save then flushes the output buffer, while the
+             *  guest continues to run.
+             */
+            int (*aftercopy)(void* data);
+
+            /* Called after the memory checkpoint has been flushed
+             * out into the network. Typical actions performed in this
+             * callback include:
+             *   (a) send the saved device model state (for HVM guests),
+             *   (b) wait for checkpoint ack
+             *   (c) release the network output buffer pertaining to the acked
+             *       checkpoint.
+             *   (c) sleep for the checkpoint interval.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint */
+            int (*checkpoint)(void* data);
+
+            /*
+             * Called after the checkpoint callback.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint
+             */
+            int (*wait_checkpoint)(void* data);
+
+            /* Enable qemu-dm logging dirty pages to xen */
+            int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+        };
+    };
 
     /* to be provided as the last argument to each callback function */
     void* data;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index ce72e0d..244c536 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -202,20 +202,24 @@ struct xc_sr_context
 
             enum {
                 XC_SAVE_PHASE_PRECOPY,
-                XC_SAVE_PHASE_STOP_AND_COPY
+                XC_SAVE_PHASE_STOP_AND_COPY,
+                XC_SAVE_PHASE_POSTCOPY
             } phase;
 
             struct precopy_stats stats;
             int policy_decision;
 
             enum {
-                XC_SR_SAVE_BATCH_PRECOPY_PAGE
+                XC_SR_SAVE_BATCH_PRECOPY_PAGE,
+                XC_SR_SAVE_BATCH_POSTCOPY_PFN,
+                XC_SR_SAVE_BATCH_POSTCOPY_PAGE
             } batch_type;
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
             unsigned long nr_deferred_pages;
             xc_hypercall_buffer_t dirty_bitmap_hbuf;
+            unsigned long nr_final_dirty_pages;
         } save;
 
         struct /* Restore data. */
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 9f077a3..81b4755 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -3,21 +3,28 @@
 
 #include "xc_sr_common.h"
 
-#define MAX_BATCH_SIZE MAX_PRECOPY_BATCH_SIZE
+#define MAX_BATCH_SIZE \
+    max(max(MAX_PRECOPY_BATCH_SIZE, MAX_PFN_BATCH_SIZE), MAX_POSTCOPY_BATCH_SIZE)
 
 static const unsigned int batch_sizes[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = MAX_PFN_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = MAX_POSTCOPY_BATCH_SIZE
 };
 
 static const bool batch_includes_contents[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE] = true
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = true,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = false,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = true
 };
 
 static const uint32_t batch_rec_types[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = REC_TYPE_POSTCOPY_PFNS,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = REC_TYPE_POSTCOPY_PAGE_DATA
 };
 
 /*
@@ -84,6 +91,38 @@ static int write_checkpoint_record(struct xc_sr_context *ctx)
 }
 
 /*
+ * Writes a POSTCOPY_BEGIN record into the stream.
+ */
+static int write_postcopy_begin_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record postcopy_begin = { REC_TYPE_POSTCOPY_BEGIN, 0, NULL };
+
+    return write_record(ctx, ctx->fd, &postcopy_begin);
+}
+
+/*
+ * Writes a POSTCOPY_PFNS_BEGIN record into the stream.
+ */
+static int write_postcopy_pfns_begin_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record postcopy_pfns_begin =
+        { REC_TYPE_POSTCOPY_PFNS_BEGIN, 0, NULL };
+
+    return write_record(ctx, ctx->fd, &postcopy_pfns_begin);
+}
+
+/*
+ * Writes a POSTCOPY_TRANSITION record into the stream.
+ */
+static int write_postcopy_transition_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record postcopy_transition =
+        { REC_TYPE_POSTCOPY_TRANSITION, 0, NULL };
+
+    return write_record(ctx, ctx->fd, &postcopy_transition);
+}
+
+/*
  * This function:
  * - maps each pfn in the current batch to its gfn
  * - gets the type of each pfn in the batch.
@@ -388,6 +427,125 @@ static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
 }
 
 /*
+ * This function:
+ * - flushes the current batch of postcopy pfns into the migration stream
+ * - clears the dirty bits of all pfns with no migrateable backing data
+ * - counts the number of pfns that _do_ have migrateable backing data, adding
+ *   it to nr_final_dirty_pfns
+ */
+static int flush_postcopy_pfns_batch(struct xc_sr_context *ctx)
+{
+    int rc = 0;
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *pfns = ctx->save.batch_pfns, *gfns = NULL, *types = NULL;
+    unsigned int i, nr_pfns = ctx->save.nr_batch_pfns;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PFN);
+
+    if ( batch_empty(ctx) )
+        goto out;
+
+    gfns = malloc(nr_pfns * sizeof(*gfns));
+    types = malloc(nr_pfns * sizeof(*types));
+
+    if ( !gfns || !types )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        rc = -1;
+        goto out;
+    }
+
+    rc = get_batch_info(ctx, gfns, types);
+    if ( rc )
+        goto out;
+
+    /*
+     * Consider any pages not backed by a physical page of data to have been
+     * 'cleaned' at this point - there's no sense wasting room in a subsequent
+     * postcopy batch to duplicate the type information.
+     */
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            clear_bit(pfns[i], dirty_bitmap);
+            continue;
+        }
+
+        ++ctx->save.nr_final_dirty_pages;
+    }
+
+    rc = write_batch(ctx, gfns, types);
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE *
+                                    sizeof(*ctx->save.batch_pfns));
+    }
+
+ out:
+    free(gfns);
+    free(types);
+
+    return rc;
+}
+
+/*
+ * This function:
+ * - writes a POSTCOPY_PFNS_BEGIN record into the stream
+ * - writes 0 or more POSTCOPY_PFNS records specifying the subset of domain
+ *   memory that must be migrated during the upcoming postcopy phase of the
+ *   migration
+ * - counts the number of pfns in this subset, storing it in
+ *   nr_final_dirty_pages
+ */
+static int send_postcopy_pfns(struct xc_sr_context *ctx)
+{
+    xen_pfn_t p;
+    int rc;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    /*
+     * The true nr_final_dirty_pages is iteratively computed by
+     * flush_postcopy_pfns_batch(), which counts only pages actually backed by
+     * data we need to migrate.
+     */
+    ctx->save.nr_final_dirty_pages = 0;
+
+    rc = write_postcopy_pfns_begin_record(ctx);
+    if ( rc )
+        return rc;
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PFN;
+    for ( p = 0; p < ctx->save.p2m_size; ++p )
+    {
+        if ( !test_bit(p, dirty_bitmap) )
+            continue;
+
+        if ( batch_full(ctx) )
+        {
+            rc = flush_postcopy_pfns_batch(ctx);
+            if ( rc )
+                return rc;
+        }
+
+        add_to_batch(ctx, p);
+    }
+
+    return flush_postcopy_pfns_batch(ctx);
+}
+
+/*
  * Pause/suspend the domain, and refresh ctx->dominfo if required.
  */
 static int suspend_domain(struct xc_sr_context *ctx)
@@ -716,20 +874,19 @@ static int colo_merge_secondary_dirty_bitmap(struct xc_sr_context *ctx)
 }
 
 /*
- * Suspend the domain and send dirty memory.
- * This is the last iteration of the live migration and the
- * heart of the checkpointed stream.
+ * Suspend the domain and determine the final set of dirty pages.
  */
-static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+static int suspend_and_check_dirty(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
-    char *progress_str = NULL;
     int rc;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
-    ctx->save.phase = XC_SAVE_PHASE_STOP_AND_COPY;
+    ctx->save.phase = (ctx->save.policy_decision == XGS_POLICY_POSTCOPY)
+        ? XC_SAVE_PHASE_POSTCOPY
+        : XC_SAVE_PHASE_STOP_AND_COPY;
 
     rc = suspend_domain(ctx);
     if ( rc )
@@ -746,16 +903,6 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         goto out;
     }
 
-    if ( ctx->save.live )
-    {
-        rc = update_progress_string(ctx, &progress_str,
-                                    ctx->save.stats.iteration);
-        if ( rc )
-            goto out;
-    }
-    else
-        xc_set_progress_prefix(xch, "Checkpointed save");
-
     bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
     if ( !ctx->save.live && ctx->save.checkpointed == XC_MIG_STREAM_COLO )
@@ -768,19 +915,37 @@ static int suspend_and_send_dirty(struct xc_sr_context *ctx)
         }
     }
 
-    rc = send_dirty_pages(ctx, stats.dirty_count + ctx->save.nr_deferred_pages);
-    if ( rc )
-        goto out;
+    if ( !ctx->save.live || ctx->save.policy_decision != XGS_POLICY_POSTCOPY )
+    {
+        /*
+         * If we aren't transitioning to a postcopy live migration, then rather
+         * than explicitly counting the number of final dirty pages, simply
+         * (somewhat crudely) estimate it as this sum to save time.  If we _are_
+         * about to begin postcopy then we don't bother, since our count must in
+         * that case be exact and we'll work it out later on.
+         */
+        ctx->save.nr_final_dirty_pages =
+            stats.dirty_count + ctx->save.nr_deferred_pages;
+    }
 
     bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
     ctx->save.nr_deferred_pages = 0;
 
  out:
-    xc_set_progress_prefix(xch, NULL);
-    free(progress_str);
     return rc;
 }
 
+static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = suspend_and_check_dirty(ctx);
+    if ( rc )
+        return rc;
+
+    return send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages);
+}
+
 static int verify_frames(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
@@ -821,11 +986,13 @@ static int verify_frames(struct xc_sr_context *ctx)
 }
 
 /*
- * Send all domain memory.  This is the heart of the live migration loop.
+ * Send all domain memory, modulo postcopy pages.  This is the heart of the live
+ * migration loop.
  */
 static int send_domain_memory_live(struct xc_sr_context *ctx)
 {
     int rc;
+    xc_interface *xch = ctx->xch;
 
     rc = enable_logdirty(ctx);
     if ( rc )
@@ -835,10 +1002,19 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
     if ( rc )
         goto out;
 
-    rc = suspend_and_send_dirty(ctx);
+    rc = suspend_and_check_dirty(ctx);
     if ( rc )
         goto out;
 
+    if ( ctx->save.policy_decision == XGS_POLICY_STOP_AND_COPY )
+    {
+        xc_set_progress_prefix(xch, "Final precopy iteration");
+        rc = send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages);
+        xc_set_progress_prefix(xch, NULL);
+        if ( rc )
+            goto out;
+    }
+
     if ( ctx->save.debug && ctx->save.checkpointed != XC_MIG_STREAM_NONE )
     {
         rc = verify_frames(ctx);
@@ -850,12 +1026,223 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
     return rc;
 }
 
+static int handle_postcopy_faults(struct xc_sr_context *ctx,
+                                  struct xc_sr_record *rec,
+                                  /* OUT */ unsigned long *nr_new_fault_pfns,
+                                  /* OUT */ xen_pfn_t *last_fault_pfn)
+{
+    int rc;
+    unsigned int i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *fault_pages = rec->data;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(nr_new_fault_pfns);
+    *nr_new_fault_pfns = 0;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_FAULT);
+    if ( rc )
+        return rc;
+
+    DBGPRINTF("Handling a batch of %"PRIu32" faults!", fault_pages->count);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PAGE);
+    for ( i = 0; i < fault_pages->count; ++i )
+    {
+        if ( test_and_clear_bit(fault_pages->pfn[i], dirty_bitmap) )
+        {
+            if ( batch_full(ctx) )
+            {
+                rc = flush_batch(ctx);
+                if ( rc )
+                    return rc;
+            }
+
+            add_to_batch(ctx, fault_pages->pfn[i]);
+            ++(*nr_new_fault_pfns);
+        }
+    }
+
+    /* _Don't_ flush yet - fill out the rest of the batch. */
+
+    assert(fault_pages->count);
+    *last_fault_pfn = fault_pages->pfn[fault_pages->count - 1];
+    return 0;
+}
+
+/*
+ * Now that the guest has resumed at the destination, send all of the remaining
+ * dirty pages.  Periodically check for pages needed by the destination to make
+ * progress.
+ */
+static int postcopy_domain_memory(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    int recv_fd = ctx->save.recv_fd;
+    int old_flags;
+    struct xc_sr_read_record_context rrctx;
+    struct xc_sr_record rec = { 0, 0, NULL };
+    unsigned long nr_new_fault_pfns;
+    unsigned long pages_remaining = ctx->save.nr_final_dirty_pages;
+    xen_pfn_t last_fault_pfn, p;
+    bool received_postcopy_complete = false;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    read_record_init(&rrctx, ctx);
+
+    /*
+     * First, configure the receive stream as non-blocking so we can
+     * periodically poll it for fault requests.
+     */
+    old_flags = fcntl(recv_fd, F_GETFL);
+    if ( old_flags == -1 )
+    {
+        rc = old_flags;
+        goto err;
+    }
+
+    assert(!(old_flags & O_NONBLOCK));
+
+    rc = fcntl(recv_fd, F_SETFL, old_flags | O_NONBLOCK);
+    if ( rc == -1 )
+    {
+        goto err;
+    }
+
+    xc_set_progress_prefix(xch, "Postcopy phase");
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PAGE;
+
+    p = 0;
+    while ( pages_remaining )
+    {
+        /*
+         * Between (small) batches, poll the receive stream for new
+         * POSTCOPY_FAULT messages.
+         */
+        for ( ; ; )
+        {
+            rc = try_read_record(&rrctx, recv_fd, &rec);
+            if ( rc )
+            {
+                if ( (errno == EAGAIN) || (errno == EWOULDBLOCK) )
+                {
+                    break;
+                }
+
+                goto err;
+            }
+            else
+            {
+                /*
+                 * Tear down and re-initialize the read record context for the
+                 * next request record.
+                 */
+                read_record_destroy(&rrctx);
+                read_record_init(&rrctx, ctx);
+
+                if ( rec.type == REC_TYPE_POSTCOPY_COMPLETE )
+                {
+                    /*
+                     * The restore side may ultimately not need all of the pages
+                     * we think it does - for example, the guest may release
+                     * some outstanding pages.  If this occurs, we'll receive
+                     * this record before we'd otherwise expect to.
+                     */
+                    received_postcopy_complete = true;
+                    goto done;
+                }
+
+                rc = handle_postcopy_faults(ctx, &rec, &nr_new_fault_pfns,
+                                            &last_fault_pfn);
+                if ( rc )
+                    goto err;
+
+                free(rec.data);
+                rec.data = NULL;
+
+                assert(pages_remaining >= nr_new_fault_pfns);
+                pages_remaining -= nr_new_fault_pfns;
+
+                /*
+                 * To take advantage of any locality present in the postcopy
+                 * faults, continue the background copy process from the newest
+                 * page in the fault batch.
+                 */
+                p = (last_fault_pfn + 1) % ctx->save.p2m_size;
+            }
+        }
+
+        /*
+         * Now that we've serviced all of the POSTCOPY_FAULT requests we know
+         * about for now, fill out the current batch with background pages.
+         */
+        for ( ;
+              pages_remaining && !batch_full(ctx);
+              p = (p + 1) % ctx->save.p2m_size )
+        {
+            if ( test_and_clear_bit(p, dirty_bitmap) )
+            {
+                add_to_batch(ctx, p);
+                --pages_remaining;
+            }
+        }
+
+        rc = flush_batch(ctx);
+        if ( rc )
+            goto err;
+
+        xc_report_progress_step(
+            xch, ctx->save.nr_final_dirty_pages - pages_remaining,
+            ctx->save.nr_final_dirty_pages);
+    }
+
+ done:
+    /* Revert the receive stream to the (blocking) state we found it in. */
+    rc = fcntl(recv_fd, F_SETFL, old_flags);
+    if ( rc == -1 )
+        goto err;
+
+    if ( !received_postcopy_complete )
+    {
+        /*
+         * Flush any outstanding POSTCOPY_FAULT requests from the migration
+         * stream by reading until a POSTCOPY_COMPLETE is received.
+         */
+        do
+        {
+            rc = read_record(ctx, recv_fd, &rec);
+            if ( rc )
+                goto err;
+        } while ( rec.type != REC_TYPE_POSTCOPY_COMPLETE );
+    }
+
+ err:
+    xc_set_progress_prefix(xch, NULL);
+    free(rec.data);
+    read_record_destroy(&rrctx);
+    return rc;
+}
+
 /*
  * Checkpointed save.
  */
 static int send_domain_memory_checkpointed(struct xc_sr_context *ctx)
 {
-    return suspend_and_send_dirty(ctx);
+    int rc;
+    xc_interface *xch = ctx->xch;
+
+    xc_set_progress_prefix(xch, "Checkpointed save");
+    rc = suspend_and_send_dirty(ctx);
+    xc_set_progress_prefix(xch, NULL);
+
+    return rc;
 }
 
 /*
@@ -987,11 +1374,54 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
             goto err;
         }
 
+        /*
+         * End-of-checkpoint records are handled differently in the case of
+         * postcopy migration, so we need to alert the destination before
+         * sending them.
+         */
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            rc = write_postcopy_begin_record(ctx);
+            if ( rc )
+                goto err;
+        }
+
         rc = ctx->save.ops.end_of_checkpoint(ctx);
         if ( rc )
             goto err;
 
-        if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            xc_report_progress_single(xch, "Beginning postcopy transition");
+
+            rc = send_postcopy_pfns(ctx);
+            if ( rc )
+                goto err;
+
+            rc = write_postcopy_transition_record(ctx);
+            if ( rc )
+                goto err;
+
+            /*
+             * Yield control to libxl to finish the transition.  Note that this
+             * callback returns _non-zero_ upon success.
+             */
+            rc = ctx->save.callbacks->postcopy_transition(
+                ctx->save.callbacks->data);
+            if ( !rc )
+            {
+                rc = -1;
+                goto err;
+            }
+
+            /* When libxl is done, we can begin the postcopy loop. */
+            rc = postcopy_domain_memory(ctx);
+            if ( rc )
+                goto err;
+        }
+        else if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
         {
             /*
              * We have now completed the initial live portion of the checkpoint
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index 54ddbfe..b12f0dd 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -92,6 +92,9 @@ static int write_hvm_params(struct xc_sr_context *ctx)
     unsigned int i;
     int rc;
 
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
     for ( i = 0; i < ARRAY_SIZE(params); i++ )
     {
         uint32_t index = params[i];
@@ -106,6 +109,16 @@ static int write_hvm_params(struct xc_sr_context *ctx)
 
         if ( value != 0 )
         {
+            if ( ctx->save.live &&
+                 ctx->save.policy_decision == XGS_POLICY_POSTCOPY &&
+                 ( index == HVM_PARAM_CONSOLE_PFN ||
+                   index == HVM_PARAM_STORE_PFN ||
+                   index == HVM_PARAM_IOREQ_PFN ||
+                   index == HVM_PARAM_BUFIOREQ_PFN ||
+                   index == HVM_PARAM_PAGING_RING_PFN ) &&
+                 test_and_clear_bit(value, dirty_bitmap) )
+                --ctx->save.nr_final_dirty_pages;
+
             entries[hdr.count].index = index;
             entries[hdr.count].value = value;
             hdr.count++;
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
index 40debf6..9f5b223 100644
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -24,7 +24,21 @@
 ** We process save/restore/migrate in batches of pages; the below
 ** determines how many pages we (at maximum) deal with in each batch.
 */
-#define MAX_PRECOPY_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
+#define MAX_PRECOPY_BATCH_SIZE ((size_t)1024U)   /* up to 1024 pages (4MB) */
+
+/*
+** We process the migration postcopy transition in batches of pfns to ensure
+** that we stay within the record size bound.  Because these records contain
+** only pfns (and _not_ their contents), we can accomodate many more of them
+** in a batch.
+*/
+#define MAX_PFN_BATCH_SIZE ((4U << 20) / sizeof(uint64_t)) /* up to 512k pfns */
+
+/*
+** The postcopy background copy uses a smaller batch size to ensure it can
+** quickly respond to remote faults.
+*/
+#define MAX_POSTCOPY_BATCH_SIZE ((size_t)64U)
 
 /* When pinning page tables at the end of restore, we also use batching. */
 #define MAX_PIN_BATCH  1024
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index b65135d..eb1271e 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -350,6 +350,12 @@ static int libxl__save_live_migration_precopy_policy(
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
+static void libxl__save_live_migration_postcopy_transition_callback(void *user)
+{
+    /* XXX we're not yet ready to deal with this */
+    assert(0);
+}
+
 /*----- main code for saving, in order of execution -----*/
 
 void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
@@ -409,8 +415,11 @@ void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
         goto out;
     }
 
-    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
+    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE) {
         callbacks->suspend = libxl__domain_suspend_callback;
+        callbacks->postcopy_transition =
+            libxl__save_live_migration_postcopy_transition_callback;
+    }
 
     callbacks->precopy_policy = libxl__save_live_migration_precopy_policy;
     callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 50c97b4..5647b97 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -33,7 +33,8 @@ our @msgs = (
                                               'xen_pfn_t', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
-    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ],
+    [ 11, 'scxA',   "postcopy_transition", [] ]
 );
 
 #----------------------------------------
@@ -225,6 +226,7 @@ foreach my $sr (qw(save restore)) {
 
     f_decl("${setcallbacks}_${sr}", 'helper', 'void',
            "(struct ${sr}_callbacks *cbs, unsigned cbflags)");
+    f_more("${setcallbacks}_${sr}", "    memset(cbs, 0, sizeof(*cbs));\n");
 
     f_more("${receiveds}_${sr}",
            <<END_ALWAYS.($debug ? <<END_DEBUG : '').<<END_ALWAYS);
@@ -335,7 +337,7 @@ END_ALWAYS
         my $c_v = "(1u<<$msgnum)";
         my $c_cb = "cbs->$name";
         $f_more_sr->("    if ($c_cb) cbflags |= $c_v;\n", $enumcallbacks);
-        $f_more_sr->("    $c_cb = (cbflags & $c_v) ? ${encode}_${name} : 0;\n",
+        $f_more_sr->("    if (cbflags & $c_v) $c_cb = ${encode}_${name};\n",
                      $setcallbacks);
     }
     $f_more_sr->("        return 1;\n    }\n\n");
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 15/23] libxc/migration: implement the receiver side of postcopy live migration
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (13 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 14/23] libxc/migration: implement the sender side of postcopy live migration Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 16/23] libxl/libxl_stream_write.c: track callback chains with an explicit phase Joshua Otto
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

Add the receive-side logic for a new 'postcopy' phase in the live
migration algorithm.

To support this migration phase:
- Augment the main restore record-processing logic to recognize and
  handle the postcopy-initiation records.
- Add the core logic for the phase, postcopy_restore(), which marks as
  paged-out all pfns reported by the sender as outstanding at the
  beginning of the phase, and subsequently serves as a pager for this
  subset of memory by forwarding paging requests to the migration sender
  and filling the outstanding domain memory as it is received.

The new restore callbacks required for this migration phase are stubbed
in libxl for now, to be replaced in a subsequent patch that adds libxl
support for this migration phase.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenguest.h      |  65 ++-
 tools/libxc/xc_sr_common.h          |  90 +++-
 tools/libxc/xc_sr_restore.c         | 957 ++++++++++++++++++++++++++++++++++--
 tools/libxc/xc_sr_restore_x86_hvm.c |  41 +-
 tools/libxl/libxl_create.c          |  17 +
 tools/libxl/libxl_save_msgs_gen.pl  |   2 +-
 6 files changed, 1113 insertions(+), 59 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index a662273..0049723 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -146,35 +146,52 @@ struct restore_callbacks {
      */
     int (*suspend)(void* data);
 
-    /* Called after the secondary vm is ready to resume.
-     * Callback function resumes the guest & the device model,
-     * returns to xc_domain_restore.
-     */
-    int (*aftercopy)(void* data);
+    union {
+        struct {
+            /*
+             * Called upon receipt of the POSTCOPY_TRANSITION record in the
+             * stream to yield control of the stream to the higher layer so that
+             * the remaining data needed to resume the domain in the postcopy
+             * phase can be obtained.  Returns as soon as the higher layer is
+             * finished with the stream.
+             *
+             * Returns 1 on success, 0 on failure.
+             */
+            int (*postcopy_transition)(void *data);
+        };
 
-    /* A checkpoint record has been found in the stream.
-     * returns: */
+        struct {
+            /* Called after the secondary vm is ready to resume.
+             * Callback function resumes the guest & the device model,
+             * returns to xc_domain_restore.
+             */
+            int (*aftercopy)(void* data);
+
+            /* A checkpoint record has been found in the stream.
+             * returns: */
 #define XGR_CHECKPOINT_ERROR    0 /* Terminate processing */
 #define XGR_CHECKPOINT_SUCCESS  1 /* Continue reading more data from the stream */
 #define XGR_CHECKPOINT_FAILOVER 2 /* Failover and resume VM */
-    int (*checkpoint)(void* data);
-
-    /*
-     * Called after the checkpoint callback.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint
-     */
-    int (*wait_checkpoint)(void* data);
+            int (*checkpoint)(void* data);
 
-    /*
-     * callback to send store gfn and console gfn to xl
-     * if we want to resume vm before xc_domain_save()
-     * exits.
-     */
-    void (*restore_results)(xen_pfn_t store_gfn, xen_pfn_t console_gfn,
-                            void *data);
+            /*
+             * Called after the checkpoint callback.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint
+             */
+            int (*wait_checkpoint)(void* data);
+
+            /*
+             * callback to send store gfn and console gfn to xl
+             * if we want to resume vm before xc_domain_save()
+             * exits.
+             */
+            void (*restore_results)(xen_pfn_t store_gfn, xen_pfn_t console_gfn,
+                                    void *data);
+        };
+    };
 
     /* to be provided as the last argument to each callback function */
     void* data;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index 244c536..d382642 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -3,6 +3,10 @@
 
 #include <stdbool.h>
 
+#include <xenevtchn.h>
+
+#include <xen/vm_event.h>
+
 #include "xg_private.h"
 #include "xg_save_restore.h"
 #include "xc_dom.h"
@@ -238,6 +242,90 @@ struct xc_sr_context
             uint32_t guest_type;
             uint32_t guest_page_size;
 
+            /* Is this a postcopy live migration? */
+            bool postcopy;
+
+            struct xc_sr_restore_paging
+            {
+                xenevtchn_handle *xce_handle;
+                int port;
+                vm_event_back_ring_t back_ring;
+                uint32_t evtchn_port;
+                void *ring_page;
+                void *buffer;
+
+                struct xc_sr_pending_postcopy_request
+                {
+                    xen_pfn_t pfn; /* == INVALID_PFN when not in use */
+
+                    /* As from vm_event_request_t */
+                    uint32_t flags;
+                    uint32_t vcpu_id;
+                } *pending_requests;
+
+                /*
+                 * The total count of outstanding and requested pfns.  The
+                 * postcopy phase is complete when this reaches 0.
+                 */
+                unsigned int nr_pending_pfns;
+
+                /*
+                 * Prior to the receipt of the first POSTCOPY_PFNS record, all
+                 * pfns are 'invalid', meaning that we don't (yet) believe that
+                 * they need to be migrated as part of the postcopy phase.
+                 *
+                 * Pfns received in POSTCOPY_PFNS records become 'outstanding',
+                 * meaning that they must be migrated but haven't yet been
+                 * requested, received or dropped.
+                 *
+                 * A pfn transitions from outstanding to requested when we
+                 * receive a request for it on the paging ring and request it
+                 * from the sender, before having received it.  There is at
+                 * least one valid entry in pending_requests for each requested
+                 * pfn.
+                 *
+                 * A pfn transitions from either outstanding or requested to
+                 * ready when its contents are received.  Responses to all
+                 * previous pager requests for this pfn are pushed at this time,
+                 * and subsequent pager requests for this pfn can be responded
+                 * to immediately.
+                 *
+                 * A pfn transitions from outstanding to dropped if we're
+                 * notified on the ring of the drop.  We track this explicitly
+                 * so that we don't panic upon subsequently receiving the
+                 * contents of this page from the sender.
+                 *
+                 * In summary, the per-pfn postcopy state machine is:
+                 *
+                 * invalid -> outstanding -> requested -> ready
+                 *                |                        ^
+                 *                +------------------------+
+                 *                |
+                 *                +------ -> dropped
+                 *
+                 * The state of each pfn is tracked using these four bitmaps.
+                 */
+                unsigned long *outstanding_pfns;
+                unsigned long *requested_pfns;
+                unsigned long *ready_pfns;
+                unsigned long *dropped_pfns;
+
+                /*
+                 * Used to accumulate batches of pfns for which we must forward
+                 * paging requests to the sender.
+                 */
+                uint64_t *request_batch;
+
+                /* For teardown. */
+                bool evtchn_bound, evtchn_opened, paging_enabled, buffer_locked;
+
+                /*
+                 * So we can sanity-check the sequence of postcopy records in
+                 * the stream.
+                 */
+                bool ready;
+            } paging;
+
             /* Plain VM, or checkpoints over time. */
             int checkpointed;
 
@@ -261,7 +349,7 @@ struct xc_sr_context
              * INPUT:  evtchn & domid
              * OUTPUT: gfn
              */
-            xen_pfn_t    xenstore_gfn,    console_gfn;
+            xen_pfn_t    xenstore_gfn,    console_gfn,    paging_ring_gfn;
             unsigned int xenstore_evtchn, console_evtchn;
             domid_t      xenstore_domid,  console_domid;
 
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 51532aa..3aac0f0 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -1,6 +1,7 @@
 #include <arpa/inet.h>
 
 #include <assert.h>
+#include <poll.h>
 
 #include "xc_sr_common.h"
 
@@ -78,6 +79,30 @@ static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
     return test_bit(pfn, ctx->restore.populated_pfns);
 }
 
+static int pfn_bitmap_realloc(struct xc_sr_context *ctx, unsigned long **bitmap,
+                              size_t old_sz, size_t new_sz)
+{
+    xc_interface *xch = ctx->xch;
+    unsigned long *p;
+
+    assert(bitmap);
+    if ( *bitmap )
+    {
+        p = realloc(*bitmap, new_sz);
+        if ( !p )
+        {
+            ERROR("Failed to realloc restore bitmap");
+            errno = ENOMEM;
+            return -1;
+        }
+
+        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+        *bitmap = p;
+    }
+
+    return 0;
+}
+
 /*
  * Set a pfn as populated, expanding the tracking structures if needed. To
  * avoid realloc()ing too excessively, the size increased to the nearest power
@@ -85,13 +110,21 @@ static bool pfn_is_populated(const struct xc_sr_context *ctx, xen_pfn_t pfn)
  */
 static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
 {
-    xc_interface *xch = ctx->xch;
+    int rc = 0;
 
     if ( pfn > ctx->restore.max_populated_pfn )
     {
         xen_pfn_t new_max;
         size_t old_sz, new_sz;
-        unsigned long *p;
+        unsigned int i;
+        unsigned long **bitmaps[] =
+        {
+            &ctx->restore.populated_pfns,
+            &ctx->restore.paging.outstanding_pfns,
+            &ctx->restore.paging.requested_pfns,
+            &ctx->restore.paging.ready_pfns,
+            &ctx->restore.paging.dropped_pfns
+        };
 
         /* Round up to the nearest power of two larger than pfn, less 1. */
         new_max = pfn;
@@ -106,17 +139,13 @@ static int pfn_set_populated(struct xc_sr_context *ctx, xen_pfn_t pfn)
 
         old_sz = bitmap_size(ctx->restore.max_populated_pfn + 1);
         new_sz = bitmap_size(new_max + 1);
-        p = realloc(ctx->restore.populated_pfns, new_sz);
-        if ( !p )
-        {
-            ERROR("Failed to realloc populated bitmap");
-            errno = ENOMEM;
-            return -1;
-        }
 
-        memset((uint8_t *)p + old_sz, 0x00, new_sz - old_sz);
+        for ( i = 0; i < ARRAY_SIZE(bitmaps) && !rc; ++i )
+            rc = pfn_bitmap_realloc(ctx, bitmaps[i], old_sz, new_sz);
+
+        if ( rc )
+            return rc;
 
-        ctx->restore.populated_pfns    = p;
         ctx->restore.max_populated_pfn = new_max;
     }
 
@@ -230,25 +259,8 @@ static int process_page_data(struct xc_sr_context *ctx, unsigned count,
     {
         ctx->restore.ops.set_page_type(ctx, pfns[i], types[i]);
 
-        switch ( types[i] )
-        {
-        case XEN_DOMCTL_PFINFO_NOTAB:
-
-        case XEN_DOMCTL_PFINFO_L1TAB:
-        case XEN_DOMCTL_PFINFO_L1TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
-        case XEN_DOMCTL_PFINFO_L2TAB:
-        case XEN_DOMCTL_PFINFO_L2TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
-        case XEN_DOMCTL_PFINFO_L3TAB:
-        case XEN_DOMCTL_PFINFO_L3TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
-        case XEN_DOMCTL_PFINFO_L4TAB:
-        case XEN_DOMCTL_PFINFO_L4TAB | XEN_DOMCTL_PFINFO_LPINTAB:
-
+        if ( types[i] < XEN_DOMCTL_PFINFO_BROKEN )
             gfns[nr_pages++] = ctx->restore.ops.pfn_to_gfn(ctx, pfns[i]);
-            break;
-        }
     }
 
     /* Nothing to do? */
@@ -425,6 +437,859 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec)
 }
 
 /*
+ * To prepare for entry to the postcopy phase of live migration:
+ * - enable paging on the domain, and set up the paging ring and event channel
+ * - allocate a locked and aligned paging buffer
+ * - allocate the postcopy page bookkeeping structures
+ */
+static int postcopy_paging_setup(struct xc_sr_context *ctx)
+{
+    int rc;
+    unsigned int i;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    xc_interface *xch = ctx->xch;
+
+    /* Sanity-check the migration stream. */
+    if ( !ctx->restore.postcopy )
+    {
+        ERROR("Received POSTCOPY_PFNS_BEGIN before POSTCOPY_BEGIN");
+        return -1;
+    }
+
+    paging->ring_page = xc_vm_event_enable(xch, ctx->domid,
+                                           HVM_PARAM_PAGING_RING_PFN,
+                                           &paging->evtchn_port);
+    if ( !paging->ring_page )
+    {
+        PERROR("Failed to enable paging");
+        return -1;
+    }
+    paging->paging_enabled = true;
+
+    paging->xce_handle = xenevtchn_open(NULL, 0);
+    if (!paging->xce_handle )
+    {
+        ERROR("Failed to open paging evtchn");
+        return -1;
+    }
+    paging->evtchn_opened = true;
+
+    rc = xenevtchn_bind_interdomain(paging->xce_handle, ctx->domid,
+                                    paging->evtchn_port);
+    if ( rc < 0 )
+    {
+        ERROR("Failed to bind paging evtchn");
+        return rc;
+    }
+    paging->evtchn_bound = true;
+    paging->port = rc;
+
+    SHARED_RING_INIT((vm_event_sring_t *)paging->ring_page);
+    BACK_RING_INIT(&paging->back_ring, (vm_event_sring_t *)paging->ring_page,
+                   PAGE_SIZE);
+
+    errno = posix_memalign(&paging->buffer, PAGE_SIZE, PAGE_SIZE);
+    if ( errno != 0 )
+    {
+        PERROR("Failed to allocate paging buffer");
+        return -1;
+    }
+
+    rc = mlock(paging->buffer, PAGE_SIZE);
+    if ( rc < 0 )
+    {
+        PERROR("Failed to lock paging buffer");
+        return rc;
+    }
+    paging->buffer_locked = true;
+
+    paging->outstanding_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+    paging->requested_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+    paging->ready_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+    paging->dropped_pfns = bitmap_alloc(ctx->restore.max_populated_pfn + 1);
+
+    paging->pending_requests = malloc(RING_SIZE(&paging->back_ring) *
+                                      sizeof(*paging->pending_requests));
+    paging->request_batch = malloc(RING_SIZE(&paging->back_ring) *
+                                   sizeof(*paging->request_batch));
+    if ( !paging->outstanding_pfns ||
+         !paging->requested_pfns ||
+         !paging->ready_pfns ||
+         !paging->dropped_pfns ||
+         !paging->pending_requests ||
+         !paging->request_batch )
+    {
+        PERROR("Failed to allocate pfn state tracking buffers");
+        return -1;
+    }
+
+    /* All slots are initially empty. */
+    for ( i = 0; i < RING_SIZE(&paging->back_ring); ++i )
+        paging->pending_requests[i].pfn = INVALID_PFN;
+
+    paging->ready = true;
+
+    return 0;
+}
+
+static void postcopy_paging_cleanup(struct xc_sr_context *ctx)
+{
+    int rc;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    xc_interface *xch = ctx->xch;
+
+    if ( paging->ring_page )
+        munmap(paging->ring_page, PAGE_SIZE);
+
+    if ( paging->paging_enabled )
+    {
+        rc = xc_vm_event_control(xch, ctx->domid, XEN_VM_EVENT_DISABLE,
+                                 XEN_DOMCTL_VM_EVENT_OP_PAGING, NULL);
+        if ( rc != 0 )
+            ERROR("Failed to disable paging");
+    }
+
+    if ( paging->evtchn_bound )
+    {
+        rc = xenevtchn_unbind(paging->xce_handle, paging->port);
+        if ( rc != 0 )
+            ERROR("Failed to unbind event port");
+    }
+
+    if ( paging->evtchn_opened )
+    {
+        rc = xenevtchn_close(paging->xce_handle);
+        if ( rc != 0 )
+            ERROR("Failed to close event channel");
+    }
+
+    if ( paging->buffer )
+    {
+        if ( paging->buffer_locked )
+            munlock(paging->buffer, PAGE_SIZE);
+
+        free(paging->buffer);
+    }
+
+    free(paging->outstanding_pfns);
+    free(paging->requested_pfns);
+    free(paging->ready_pfns);
+    free(paging->dropped_pfns);
+    free(paging->pending_requests);
+    free(paging->request_batch);
+}
+
+/* Helpers to query and transition the state of postcopy pfns. */
+
+static inline bool postcopy_pfn_outstanding(struct xc_sr_context *ctx,
+                                            xen_pfn_t pfn)
+{
+    return (pfn <= ctx->restore.max_populated_pfn)
+        ? test_bit(pfn, ctx->restore.paging.outstanding_pfns)
+        : false;
+}
+
+static inline bool postcopy_pfn_requested(struct xc_sr_context *ctx,
+                                            xen_pfn_t pfn)
+{
+    return (pfn <= ctx->restore.max_populated_pfn)
+        ? test_bit(pfn, ctx->restore.paging.requested_pfns)
+        : false;
+}
+
+static inline bool postcopy_pfn_ready(struct xc_sr_context *ctx,
+                                            xen_pfn_t pfn)
+{
+    return (pfn <= ctx->restore.max_populated_pfn)
+        ? test_bit(pfn, ctx->restore.paging.ready_pfns)
+        : false;
+}
+
+static inline bool postcopy_pfn_dropped(struct xc_sr_context *ctx,
+                                            xen_pfn_t pfn)
+{
+    return (pfn <= ctx->restore.max_populated_pfn)
+        ? test_bit(pfn, ctx->restore.paging.dropped_pfns)
+        : false;
+}
+
+static inline bool postcopy_pfn_invalid(struct xc_sr_context *ctx,
+                                        xen_pfn_t pfn)
+{
+    if (pfn > ctx->restore.max_populated_pfn)
+        return false;
+
+    return !postcopy_pfn_outstanding(ctx, pfn) &&
+           !postcopy_pfn_requested(ctx, pfn) &&
+           !postcopy_pfn_ready(ctx, pfn) &&
+           !postcopy_pfn_dropped(ctx, pfn);
+}
+
+static inline void mark_postcopy_pfn_outstanding(struct xc_sr_context *ctx,
+                                                 xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_invalid(ctx, pfn));
+
+    set_bit(pfn, ctx->restore.paging.outstanding_pfns);
+}
+
+static inline void mark_postcopy_pfn_requested(struct xc_sr_context *ctx,
+                                               xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_outstanding(ctx, pfn));
+
+    clear_bit(pfn, ctx->restore.paging.outstanding_pfns);
+    set_bit(pfn, ctx->restore.paging.requested_pfns);
+}
+
+static inline void mark_postcopy_pfn_ready(struct xc_sr_context *ctx,
+                                           xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_outstanding(ctx, pfn) ||
+           postcopy_pfn_requested(ctx, pfn));
+
+    clear_bit(pfn, ctx->restore.paging.outstanding_pfns);
+    clear_bit(pfn, ctx->restore.paging.requested_pfns);
+    set_bit(pfn, ctx->restore.paging.ready_pfns);
+}
+
+static inline void mark_postcopy_pfn_dropped(struct xc_sr_context *ctx,
+                                             xen_pfn_t pfn)
+{
+    assert(pfn <= ctx->restore.max_populated_pfn);
+    assert(postcopy_pfn_outstanding(ctx, pfn));
+
+    clear_bit(pfn, ctx->restore.paging.outstanding_pfns);
+    set_bit(pfn, ctx->restore.paging.dropped_pfns);
+}
+
+static int process_postcopy_pfns(struct xc_sr_context *ctx, unsigned int count,
+                                 xen_pfn_t *pfns, uint32_t *types)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    int rc;
+    unsigned int i;
+    xen_pfn_t bpfn;
+
+    rc = populate_pfns(ctx, count, pfns, types);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", count);
+        goto out;
+    }
+
+    for ( i = 0; i < count; ++i )
+    {
+        if ( types[i] < XEN_DOMCTL_PFINFO_BROKEN )
+        {
+            bpfn = pfns[i];
+
+            /* We should never see the same pfn twice at this stage.  */
+            if ( !postcopy_pfn_invalid(ctx, bpfn) )
+            {
+                rc = -1;
+                ERROR("Duplicate postcopy pfn %"PRI_xen_pfn, bpfn);
+                goto out;
+            }
+
+            /*
+             * We now consider this pfn 'outstanding' - pending, and not yet
+             * requested.
+             */
+            mark_postcopy_pfn_outstanding(ctx, bpfn);
+            ++paging->nr_pending_pfns;
+
+            /*
+             * Neither nomination nor eviction can be permitted to fail - the
+             * guest isn't yet running, so a failure would imply a foreign or
+             * hypervisor mapping on the page, and that would be bogus because
+             * the migration isn't yet complete.
+             */
+            rc = xc_mem_paging_nominate(xch, ctx->domid, bpfn);
+            if ( rc < 0 )
+            {
+                PERROR("Error nominating postcopy pfn %"PRI_xen_pfn, bpfn);
+                goto out;
+            }
+
+            rc = xc_mem_paging_evict(xch, ctx->domid, bpfn);
+            if ( rc < 0 )
+            {
+                PERROR("Error evicting postcopy pfn %"PRI_xen_pfn, bpfn);
+                goto out;
+            }
+        }
+    }
+
+    rc = 0;
+
+ out:
+    return rc;
+}
+
+static int handle_postcopy_pfns(struct xc_sr_context *ctx,
+                                struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+    int rc;
+    xen_pfn_t *pfns = NULL;
+    uint32_t *types = NULL;
+
+    /* Sanity-check the migration stream. */
+    if ( !ctx->restore.paging.ready )
+    {
+        ERROR("Received POSTCOPY_PFNS record before POSTCOPY_PFNS_BEGIN");
+        rc = -1;
+        goto err;
+    }
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_PFNS);
+    if ( rc )
+        goto err;
+
+    pfns = malloc(pages->count * sizeof(*pfns));
+    types = malloc(pages->count * sizeof(*types));
+    if ( !pfns || !types )
+    {
+        rc = -1;
+        ERROR("Unable to allocate enough memory for %u pfns",
+              pages->count);
+        goto err;
+    }
+
+    (void)decode_pages_record(ctx, pages, pfns, types);
+    if ( rc )
+        goto err;
+
+    if ( rec->length != (sizeof(*pages) + (sizeof(uint64_t) * pages->count)) )
+    {
+        ERROR("POSTCOPY_PFNS record wrong size: length %u, expected "
+              "%zu + %zu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count));
+        goto err;
+    }
+
+    rc = process_postcopy_pfns(ctx, pages->count, pfns, types);
+
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+static int handle_postcopy_transition(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    void *data = ctx->restore.callbacks->data;
+
+    /* Sanity-check the migration stream. */
+    if ( !ctx->restore.paging.ready )
+    {
+        ERROR("Received POSTCOPY_TRANSITION record before POSTCOPY_PFNS_BEGIN");
+        return -1;
+    }
+
+    rc = ctx->restore.ops.stream_complete(ctx);
+    if ( rc )
+        return rc;
+
+    ctx->restore.callbacks->restore_results(ctx->restore.xenstore_gfn,
+                                            ctx->restore.console_gfn,
+                                            data);
+
+    /*
+     * Asynchronously resume the guest.  We'll return when we've been handed
+     * back control of the stream, so that we can begin filling in the
+     * outstanding postcopy page data and forwarding guest requests for specific
+     * pages.
+     */
+    IPRINTF("Postcopy transition: resuming guest");
+    return ctx->restore.callbacks->postcopy_transition(data) ? 0 : -1;
+}
+
+static int postcopy_load_page(struct xc_sr_context *ctx, xen_pfn_t pfn,
+                              void *page_data)
+{
+    int rc;
+    unsigned int i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    struct xc_sr_pending_postcopy_request *preq;
+    vm_event_response_t rsp;
+    vm_event_back_ring_t *back_ring = &paging->back_ring;
+
+    assert(postcopy_pfn_outstanding(ctx, pfn) ||
+           postcopy_pfn_requested(ctx, pfn));
+
+    memcpy(paging->buffer, page_data, PAGE_SIZE);
+    rc = xc_mem_paging_load(ctx->xch, ctx->domid, pfn, paging->buffer);
+    if ( rc < 0 )
+    {
+        PERROR("Failed to paging load pfn %"PRI_xen_pfn, pfn);
+        return rc;
+    }
+
+    if ( postcopy_pfn_requested(ctx, pfn) )
+    {
+        for ( i = 0; i < RING_SIZE(back_ring); ++i )
+        {
+            preq = &paging->pending_requests[i];
+            if ( preq->pfn != pfn )
+                continue;
+
+            /* Put the response on the ring. */
+            rsp = (vm_event_response_t)
+            {
+                .version = VM_EVENT_INTERFACE_VERSION,
+                .vcpu_id = preq->vcpu_id,
+                .flags   = (preq->flags & VM_EVENT_FLAG_VCPU_PAUSED),
+                .reason  = VM_EVENT_REASON_MEM_PAGING,
+                .u       = { .mem_paging = { .gfn = pfn } }
+            };
+
+            memcpy(RING_GET_RESPONSE(back_ring, back_ring->rsp_prod_pvt),
+                   &rsp, sizeof(rsp));
+		    ++back_ring->rsp_prod_pvt;
+
+            /* And free the pending request slot. */
+            preq->pfn = INVALID_PFN;
+        }
+    }
+
+    --paging->nr_pending_pfns;
+    mark_postcopy_pfn_ready(ctx, pfn);
+    return 0;
+}
+
+static int process_postcopy_page_data(struct xc_sr_context *ctx,
+                                      unsigned int count, xen_pfn_t *pfns,
+                                      uint32_t *types, void *page_data)
+{
+    int rc;
+    unsigned int i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    bool push_responses = false;
+
+    for ( i = 0; i < count; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_XTAB:
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+            ERROR("Received postcopy pfn %"PRI_xen_pfn
+                  " with invalid type %"PRIu32, pfns[i], types[i]);
+            return -1;
+        default:
+            if ( postcopy_pfn_invalid(ctx, pfns[i]) )
+            {
+                ERROR("Expected pfn %"PRI_xen_pfn" to be invalid", pfns[i]);
+                return -1;
+            }
+            else if ( postcopy_pfn_ready(ctx, pfns[i]) )
+            {
+                ERROR("pfn %"PRI_xen_pfn" already received", pfns[i]);
+                return -1;
+            }
+            else if ( postcopy_pfn_dropped(ctx, pfns[i]) )
+            {
+                /* Nothing to do - move on to the next page. */
+                page_data += PAGE_SIZE;
+            }
+            else
+            {
+                if ( postcopy_pfn_requested(ctx, pfns[i]) )
+                {
+                    DBGPRINTF("Received requested pfn %"PRI_xen_pfn, pfns[i]);
+                    push_responses = true;
+                }
+
+                rc = postcopy_load_page(ctx, pfns[i], page_data);
+                if ( rc )
+                    return rc;
+
+                page_data += PAGE_SIZE;
+            }
+            break;
+        }
+    }
+
+    if ( push_responses )
+    {
+        /*
+         * We put at least one response on the ring as a result of processing
+         * this batch of pages, so we need to push them and kick the ring event
+         * channel.
+         */
+        RING_PUSH_RESPONSES(&paging->back_ring);
+
+        rc = xenevtchn_notify(paging->xce_handle, paging->port);
+        if ( rc )
+        {
+            ERROR("Failed to notify paging event channel");
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+static int handle_postcopy_page_data(struct xc_sr_context *ctx,
+                                     struct xc_sr_record *rec)
+{
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *pages = rec->data;
+    unsigned int pages_of_data;
+    int rc = -1;
+
+    xen_pfn_t *pfns = NULL;
+    uint32_t *types = NULL;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_PAGE_DATA);
+    if ( rc )
+        goto err;
+
+    pfns = malloc(pages->count * sizeof(*pfns));
+    types = malloc(pages->count * sizeof(*types));
+    if ( !pfns || !types )
+    {
+        rc = -1;
+        ERROR("Unable to allocate enough memory for %u pfns",
+              pages->count);
+        goto err;
+    }
+
+    pages_of_data = decode_pages_record(ctx, pages, pfns, types);
+
+    if ( rec->length != (sizeof(*pages) +
+                         (sizeof(uint64_t) * pages->count) +
+                         (PAGE_SIZE * pages_of_data)) )
+    {
+        ERROR("POSTCOPY_PAGE_DATA record wrong size: length %u, expected "
+              "%zu + %zu + %lu", rec->length, sizeof(*pages),
+              (sizeof(uint64_t) * pages->count), (PAGE_SIZE * pages_of_data));
+        goto err;
+    }
+
+    rc = process_postcopy_page_data(ctx, pages->count, pfns, types,
+                                    &pages->pfn[pages->count]);
+ err:
+    free(types);
+    free(pfns);
+
+    return rc;
+}
+
+static int forward_postcopy_paging_requests(struct xc_sr_context *ctx,
+                                            unsigned int nr_batch_requests)
+{
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    size_t batchsz = nr_batch_requests * sizeof(*paging->request_batch);
+    struct xc_sr_rec_pages_header phdr =
+    {
+        .count = nr_batch_requests
+    };
+    struct xc_sr_record rec =
+    {
+        .type   = REC_TYPE_POSTCOPY_FAULT,
+        .length = sizeof(phdr),
+        .data   = &phdr
+    };
+
+    return write_split_record(ctx, ctx->restore.send_back_fd, &rec,
+                              paging->request_batch, batchsz);
+}
+
+static int handle_postcopy_paging_requests(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    struct xc_sr_pending_postcopy_request *preq;
+    vm_event_back_ring_t *back_ring = &paging->back_ring;
+    vm_event_request_t req;
+    vm_event_response_t rsp;
+    xen_pfn_t pfn;
+    bool put_responses = false, drop_requested;
+    unsigned int i, nr_batch_requests = 0;
+
+    while ( RING_HAS_UNCONSUMED_REQUESTS(back_ring) )
+    {
+        RING_COPY_REQUEST(back_ring, back_ring->req_cons, &req);
+        ++back_ring->req_cons;
+
+        drop_requested = !!(req.u.mem_paging.flags & MEM_PAGING_DROP_PAGE);
+        pfn = req.u.mem_paging.gfn;
+
+        DBGPRINTF("Postcopy page fault! %"PRI_xen_pfn, pfn);
+
+        if ( postcopy_pfn_invalid(ctx, pfn) )
+        {
+            ERROR("pfn %"PRI_xen_pfn" does not need to be migrated", pfn);
+            rc = -1;
+            goto err;
+        }
+        else if ( postcopy_pfn_ready(ctx, pfn) || drop_requested )
+        {
+            if ( drop_requested )
+            {
+                if ( postcopy_pfn_outstanding(ctx, pfn) )
+                {
+                    mark_postcopy_pfn_dropped(ctx, pfn);
+                    --paging->nr_pending_pfns;
+                }
+                else
+                {
+                    ERROR("Pager requesting we drop non-paged "
+                          "(or previously-requested) pfn %"PRI_xen_pfn, pfn);
+                    rc = -1;
+                    goto err;
+                }
+            }
+
+            /*
+             * This page has already been loaded (or has been dropped), so we
+             * can respond immediately.
+             */
+            rsp = (vm_event_response_t)
+            {
+                .version = VM_EVENT_INTERFACE_VERSION,
+                .vcpu_id = req.vcpu_id,
+                .flags   = (req.flags & VM_EVENT_FLAG_VCPU_PAUSED),
+                .reason  = VM_EVENT_REASON_MEM_PAGING,
+                .u       = { .mem_paging = { .gfn = pfn } }
+            };
+
+            memcpy(RING_GET_RESPONSE(back_ring, back_ring->rsp_prod_pvt),
+                   &rsp, sizeof(rsp));
+		    ++back_ring->rsp_prod_pvt;
+
+			put_responses = true;
+        }
+        else /* implies not dropped AND either outstanding or requested */
+        {
+            if ( postcopy_pfn_outstanding(ctx, pfn) )
+            {
+                /* This is the first time this pfn has been requested. */
+                mark_postcopy_pfn_requested(ctx, pfn);
+
+                paging->request_batch[nr_batch_requests] = pfn;
+                ++nr_batch_requests;
+            }
+
+            /* Find a free pending_requests slot. */
+            for ( i = 0; i < RING_SIZE(back_ring); ++i )
+            {
+                preq = &paging->pending_requests[i];
+                if ( preq->pfn == INVALID_PFN )
+                {
+                    /* Claim this slot. */
+                    preq->pfn = pfn;
+
+                    preq->flags = req.flags;
+                    preq->vcpu_id = req.vcpu_id;
+                    break;
+                }
+            }
+
+            /*
+             * We _must_ find a free slot - there cannot be more outstanding
+             * requests than there are slots in the ring.
+             */
+            assert(i < RING_SIZE(back_ring));
+        }
+    }
+
+    if ( put_responses )
+    {
+        RING_PUSH_RESPONSES(back_ring);
+
+        rc = xenevtchn_notify(paging->xce_handle, paging->port);
+        if ( rc )
+        {
+            ERROR("Failed to notify paging event channel");
+            goto err;
+        }
+    }
+
+    if ( nr_batch_requests )
+    {
+        rc = forward_postcopy_paging_requests(ctx, nr_batch_requests);
+        if ( rc )
+        {
+            ERROR("Failed to forward postcopy paging requests");
+            goto err;
+        }
+    }
+
+    rc = 0;
+
+ err:
+    return rc;
+}
+
+static int write_postcopy_complete_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record rec = { REC_TYPE_POSTCOPY_COMPLETE };
+
+    return write_record(ctx, ctx->restore.send_back_fd, &rec);
+}
+
+static int postcopy_restore(struct xc_sr_context *ctx)
+{
+    int rc;
+    int recv_fd = ctx->fd;
+    int old_flags;
+    int port;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_restore_paging *paging = &ctx->restore.paging;
+    struct xc_sr_read_record_context rrctx;
+    struct xc_sr_record rec = { 0, 0, NULL };
+    struct pollfd pfds[] =
+    {
+        { .fd = xenevtchn_fd(paging->xce_handle), .events = POLLIN },
+        { .fd = recv_fd,                          .events = POLLIN }
+    };
+
+    assert(ctx->restore.postcopy);
+    assert(paging->xce_handle);
+
+    read_record_init(&rrctx, ctx);
+
+    /*
+     * For the duration of the postcopy loop, configuring the receive stream as
+     * non-blocking.
+     */
+    old_flags = fcntl(recv_fd, F_GETFL);
+    if ( old_flags == -1 )
+    {
+        rc = old_flags;
+        goto err;
+    }
+
+    assert(!(old_flags & O_NONBLOCK));
+
+    rc = fcntl(recv_fd, F_SETFL, old_flags | O_NONBLOCK);
+    if ( rc == -1 )
+    {
+        goto err;
+    }
+
+    while ( paging->nr_pending_pfns )
+    {
+        rc = poll(pfds, ARRAY_SIZE(pfds), -1);
+        if ( rc < 0 )
+        {
+            if ( errno == EINTR )
+                continue;
+
+            PERROR("Failed to poll the pager event channel/restore stream");
+            goto err;
+        }
+
+        /*
+         * Fill in any newly received page data first, on the off chance that
+         * new pager requests are for that data.
+         */
+        if ( pfds[1].revents & POLLIN )
+        {
+            rc = try_read_record(&rrctx, recv_fd, &rec);
+            if ( rc && (errno != EAGAIN) && (errno != EWOULDBLOCK) )
+            {
+                goto err;
+            }
+            else if ( !rc )
+            {
+                read_record_destroy(&rrctx);
+                read_record_init(&rrctx, ctx);
+
+                rc = handle_postcopy_page_data(ctx, &rec);
+                if ( rc )
+                    goto err;
+
+                free(rec.data);
+                rec.data = NULL;
+            }
+        }
+
+        if ( pfds[0].revents & POLLIN )
+        {
+            port = xenevtchn_pending(paging->xce_handle);
+            if ( port == -1 )
+            {
+                ERROR("Failed to read port from pager event channel");
+                rc = -1;
+                goto err;
+            }
+
+            rc = xenevtchn_unmask(paging->xce_handle, port);
+            if ( rc != 0 )
+            {
+                ERROR("Failed to unmask pager event channel port");
+                goto err;
+            }
+
+            rc = handle_postcopy_paging_requests(ctx);
+            if ( rc )
+                goto err;
+        }
+    }
+
+    /*
+     * At this point, all outstanding postcopy pages have been loaded.  We now
+     * need only flush any outstanding requests that may have accumulated in the
+     * ring while we were processing the final POSTCOPY_PAGE_DATA records.
+     */
+    rc = handle_postcopy_paging_requests(ctx);
+    if ( rc )
+        goto err;
+
+    rc = write_postcopy_complete_record(ctx);
+    if ( rc )
+        goto err;
+
+    /*
+     * End-of-stream synchronization: make the receive stream blocking again,
+     * and wait to receive what must be the END record.
+     */
+    rc = fcntl(recv_fd, F_SETFL, old_flags);
+    if ( rc == -1 )
+        goto err;
+
+    rc = read_record(ctx, recv_fd, &rec);
+    if ( rc )
+    {
+        goto err;
+    }
+    else if ( rec.type != REC_TYPE_END )
+    {
+        ERROR("Expected end of stream, received %s", rec_type_to_str(rec.type));
+        rc = -1;
+        goto err;
+    }
+
+ err:
+    /*
+     * If _we_ fail here, we can't safely synchronize with the completion of
+     * domain resumption because it might be waiting for us (to fulfill a pager
+     * request).  Since we therefore can't know whether or not the domain was
+     * unpaused, just abruptly bail and let the sender assume the worst.
+     */
+    free(rec.data);
+    read_record_destroy(&rrctx);
+
+    return rc;
+}
+
+/*
  * Send checkpoint dirty pfn list to primary.
  */
 static int send_checkpoint_dirty_pfn_list(struct xc_sr_context *ctx)
@@ -643,6 +1508,25 @@ static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec)
         rc = handle_checkpoint(ctx);
         break;
 
+    case REC_TYPE_POSTCOPY_BEGIN:
+        if ( ctx->restore.postcopy )
+            rc = -1;
+        else
+            ctx->restore.postcopy = true;
+        break;
+
+    case REC_TYPE_POSTCOPY_PFNS_BEGIN:
+        rc = postcopy_paging_setup(ctx);
+        break;
+
+    case REC_TYPE_POSTCOPY_PFNS:
+        rc = handle_postcopy_pfns(ctx, rec);
+        break;
+
+    case REC_TYPE_POSTCOPY_TRANSITION:
+        rc = handle_postcopy_transition(ctx);
+        break;
+
     default:
         rc = ctx->restore.ops.process_record(ctx, rec);
         break;
@@ -715,6 +1599,10 @@ static void cleanup(struct xc_sr_context *ctx)
     if ( ctx->restore.checkpointed == XC_MIG_STREAM_COLO )
         xc_hypercall_buffer_free_pages(xch, dirty_bitmap,
                                    NRPAGES(bitmap_size(ctx->restore.p2m_size)));
+
+    if ( ctx->restore.postcopy )
+        postcopy_paging_cleanup(ctx);
+
     free(ctx->restore.buffered_records);
     free(ctx->restore.populated_pfns);
     if ( ctx->restore.ops.cleanup(ctx) )
@@ -777,7 +1665,8 @@ static int restore(struct xc_sr_context *ctx)
                 goto err;
         }
 
-    } while ( rec.type != REC_TYPE_END );
+    } while ( rec.type != REC_TYPE_END &&
+              rec.type != REC_TYPE_POSTCOPY_TRANSITION );
 
  remus_failover:
 
@@ -788,6 +1677,14 @@ static int restore(struct xc_sr_context *ctx)
         IPRINTF("COLO Failover");
         goto done;
     }
+    else if ( ctx->restore.postcopy )
+    {
+        rc = postcopy_restore(ctx);
+        if ( rc )
+            goto err;
+
+        goto done;
+    }
 
     /*
      * With Remus, if we reach here, there must be some error on primary,
diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c
index 1dca853..5a6b5bf 100644
--- a/tools/libxc/xc_sr_restore_x86_hvm.c
+++ b/tools/libxc/xc_sr_restore_x86_hvm.c
@@ -27,6 +27,27 @@ static int handle_hvm_context(struct xc_sr_context *ctx,
     return 0;
 }
 
+static int handle_hvm_magic_page(struct xc_sr_context *ctx,
+                                 struct xc_sr_rec_hvm_params_entry *entry)
+{
+    int rc;
+    xen_pfn_t pfn = entry->value;
+
+    if ( ctx->restore.postcopy )
+    {
+        rc = populate_pfns(ctx, 1, &pfn, NULL);
+        if ( rc )
+            return rc;
+    }
+
+    if ( entry->index != HVM_PARAM_PAGING_RING_PFN )
+    {
+        xc_clear_domain_page(ctx->xch, ctx->domid, pfn);
+    }
+
+    return 0;
+}
+
 /*
  * Process an HVM_PARAMS record from the stream.
  */
@@ -71,18 +92,32 @@ static int handle_hvm_params(struct xc_sr_context *ctx,
         {
         case HVM_PARAM_CONSOLE_PFN:
             ctx->restore.console_gfn = entry->value;
-            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            rc = handle_hvm_magic_page(ctx, entry);
             break;
         case HVM_PARAM_STORE_PFN:
             ctx->restore.xenstore_gfn = entry->value;
-            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            rc = handle_hvm_magic_page(ctx, entry);
+            break;
+        case HVM_PARAM_PAGING_RING_PFN:
+            ctx->restore.paging_ring_gfn = entry->value;
+            rc = handle_hvm_magic_page(ctx, entry);
             break;
         case HVM_PARAM_IOREQ_PFN:
         case HVM_PARAM_BUFIOREQ_PFN:
-            xc_clear_domain_page(xch, ctx->domid, entry->value);
+            rc = handle_hvm_magic_page(ctx, entry);
+            break;
+        default:
+            rc = 0;
             break;
         }
 
+        if ( rc )
+        {
+            PERROR("populate/clear magic HVM page %"PRId64" = 0x%016"PRIx64,
+                   entry->index, entry->value);
+            return rc;
+        }
+
         rc = xc_hvm_param_set(xch, ctx->domid, entry->index, entry->value);
         if ( rc < 0 )
         {
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index bffbc45..1354689 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -748,6 +748,8 @@ static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc);
 
+static void domcreate_postcopy_transition_callback(void *user);
+
 static void domcreate_launch_dm(libxl__egc *egc, libxl__multidev *aodevs,
                                 int ret);
 
@@ -1102,6 +1104,13 @@ static void domcreate_bootloader_done(libxl__egc *egc,
             libxl__remus_restore_setup(egc, dcs);
             /* fall through */
         case LIBXL_CHECKPOINTED_STREAM_NONE:
+            /*
+             * When the restore helper initiates the postcopy transition, pick
+             * up in domcreate_postcopy_transition_callback()
+             */
+            callbacks->postcopy_transition =
+                domcreate_postcopy_transition_callback;
+
             libxl__stream_read_start(egc, &dcs->srs);
         }
         return;
@@ -1111,6 +1120,14 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     domcreate_stream_done(egc, &dcs->srs, rc);
 }
 
+/* ----- postcopy live migration ----- */
+
+static void domcreate_postcopy_transition_callback(void *user)
+{
+    /* XXX we're not ready to deal with this yet */
+    assert(0);
+}
+
 void libxl__srm_callout_callback_restore_results(xen_pfn_t store_mfn,
           xen_pfn_t console_mfn, void *user)
 {
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 5647b97..7f59e03 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -34,7 +34,7 @@ our @msgs = (
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
     [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ],
-    [ 11, 'scxA',   "postcopy_transition", [] ]
+    [ 11, 'srcxA',  "postcopy_transition", [] ]
 );
 
 #----------------------------------------
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 16/23] libxl/libxl_stream_write.c: track callback chains with an explicit phase
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (14 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 15/23] libxc/migration: implement the receiver " Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 17/23] libxl/libxl_stream_read.c: " Joshua Otto
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

There are three callback chains through libxl_stream_write: the 'normal'
straight-through save path initiated by libxl__stream_write_start(), the
iterated checkpoint path initiated each time by
libxl__stream_write_start_checkpoint(), and the (short) back-channel
checkpoint path initiated by libxl__stream_write_checkpoint_state().
These paths share significant common code but handle failure and
completion slightly differently, so it is necessary to keep track of
the callback chain currently in progress and act accordingly at various
points.

Until now, a collection of booleans in the stream write state has been
used to indicate the current callback chain.  However, the set of
callback chains is really better described by an enum, since only one
callback chain can actually be active at one time.  In anticipation of
the addition of a new chain for postcopy live migration, refactor the
existing logic to use an enum rather than booleans for callback chain
tracking.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl_internal.h     |  7 ++-
 tools/libxl/libxl_stream_write.c | 96 ++++++++++++++++++----------------------
 2 files changed, 48 insertions(+), 55 deletions(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 89de86b..cef2f39 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3211,9 +3211,12 @@ struct libxl__stream_write_state {
     /* Private */
     int rc;
     bool running;
-    bool in_checkpoint;
+    enum {
+        SWS_PHASE_NORMAL,
+        SWS_PHASE_CHECKPOINT,
+        SWS_PHASE_CHECKPOINT_STATE
+    } phase;
     bool sync_teardown;  /* Only used to coordinate shutdown on error path. */
-    bool in_checkpoint_state;
     libxl__save_helper_state shs;
 
     /* Main stream-writing data. */
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index c96a6a2..8f2a1c9 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -89,12 +89,9 @@ static void emulator_context_read_done(libxl__egc *egc,
                                        int rc, int onwrite, int errnoval);
 static void emulator_context_record_done(libxl__egc *egc,
                                          libxl__stream_write_state *stream);
-static void write_end_record(libxl__egc *egc,
-                             libxl__stream_write_state *stream);
+static void write_phase_end_record(libxl__egc *egc,
+                                   libxl__stream_write_state *stream);
 
-/* Event chain unique to checkpointed streams. */
-static void write_checkpoint_end_record(libxl__egc *egc,
-                                        libxl__stream_write_state *stream);
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 
@@ -213,7 +210,7 @@ void libxl__stream_write_init(libxl__stream_write_state *stream)
 
     stream->rc = 0;
     stream->running = false;
-    stream->in_checkpoint = false;
+    stream->phase = SWS_PHASE_NORMAL;
     stream->sync_teardown = false;
     FILLZERO(stream->dc);
     stream->record_done_callback = NULL;
@@ -294,9 +291,9 @@ void libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                           libxl__stream_write_state *stream)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint);
+    assert(stream->phase == SWS_PHASE_NORMAL);
     assert(!stream->back_channel);
-    stream->in_checkpoint = true;
+    stream->phase = SWS_PHASE_CHECKPOINT;
 
     write_emulator_xenstore_record(egc, stream);
 }
@@ -431,12 +428,8 @@ static void emulator_xenstore_record_done(libxl__egc *egc,
 
     if (dss->type == LIBXL_DOMAIN_TYPE_HVM)
         write_emulator_context_record(egc, stream);
-    else {
-        if (stream->in_checkpoint)
-            write_checkpoint_end_record(egc, stream);
-        else
-            write_end_record(egc, stream);
-    }
+    else
+        write_phase_end_record(egc, stream);
 }
 
 static void write_emulator_context_record(libxl__egc *egc,
@@ -534,34 +527,35 @@ static void emulator_context_record_done(libxl__egc *egc,
     free(stream->emu_body);
     stream->emu_body = NULL;
 
-    if (stream->in_checkpoint)
-        write_checkpoint_end_record(egc, stream);
-    else
-        write_end_record(egc, stream);
+    write_phase_end_record(egc, stream);
 }
 
-static void write_end_record(libxl__egc *egc,
-                             libxl__stream_write_state *stream)
+static void write_phase_end_record(libxl__egc *egc,
+                                   libxl__stream_write_state *stream)
 {
     struct libxl__sr_rec_hdr rec;
+    sws_record_done_cb cb;
+    const char *what;
 
     FILLZERO(rec);
-    rec.type = REC_TYPE_END;
-
-    setup_write(egc, stream, "end record",
-                &rec, NULL, stream_success);
-}
-
-static void write_checkpoint_end_record(libxl__egc *egc,
-                                        libxl__stream_write_state *stream)
-{
-    struct libxl__sr_rec_hdr rec;
 
-    FILLZERO(rec);
-    rec.type = REC_TYPE_CHECKPOINT_END;
+    switch (stream->phase) {
+    case SWS_PHASE_NORMAL:
+        rec.type = REC_TYPE_END;
+        what     = "end record";
+        cb       = stream_success;
+        break;
+    case SWS_PHASE_CHECKPOINT:
+        rec.type = REC_TYPE_CHECKPOINT_END;
+        what     = "checkpoint end record";
+        cb       = checkpoint_end_record_done;
+        break;
+    default:
+        /* SWS_PHASE_CHECKPOINT_STATE has no end record */
+        assert(false);
+    }
 
-    setup_write(egc, stream, "checkpoint end record",
-                &rec, NULL, checkpoint_end_record_done);
+    setup_write(egc, stream, what, &rec, NULL, cb);
 }
 
 static void checkpoint_end_record_done(libxl__egc *egc,
@@ -582,21 +576,20 @@ static void stream_complete(libxl__egc *egc,
 {
     assert(stream->running);
 
-    if (stream->in_checkpoint) {
+    switch (stream->phase) {
+    case SWS_PHASE_NORMAL:
+        stream_done(egc, stream, rc);
+        break;
+    case SWS_PHASE_CHECKPOINT:
         assert(rc);
-
         /*
          * If an error is encountered while in a checkpoint, pass it
          * back to libxc.  The failure will come back around to us via
          * libxl__xc_domain_save_done()
          */
         checkpoint_done(egc, stream, rc);
-        return;
-    }
-
-    if (stream->in_checkpoint_state) {
-        assert(rc);
-
+        break;
+    case SWS_PHASE_CHECKPOINT_STATE:
         /*
          * If an error is encountered while in a checkpoint, pass it
          * back to libxc.  The failure will come back around to us via
@@ -606,17 +599,15 @@ static void stream_complete(libxl__egc *egc,
          *    libxl__stream_write_abort()
          */
         checkpoint_state_done(egc, stream, rc);
-        return;
+        break;
     }
-
-    stream_done(egc, stream, rc);
 }
 
 static void stream_done(libxl__egc *egc,
                         libxl__stream_write_state *stream, int rc)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint_state);
+    assert(stream->phase != SWS_PHASE_CHECKPOINT_STATE);
     stream->running = false;
 
     if (stream->emu_carefd)
@@ -640,9 +631,9 @@ static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_write_state *stream,
                             int rc)
 {
-    assert(stream->in_checkpoint);
+    assert(stream->phase == SWS_PHASE_CHECKPOINT);
 
-    stream->in_checkpoint = false;
+    stream->phase = SWS_PHASE_NORMAL;
     stream->checkpoint_callback(egc, stream, rc);
 }
 
@@ -699,9 +690,8 @@ void libxl__stream_write_checkpoint_state(libxl__egc *egc,
     struct libxl__sr_rec_hdr rec;
 
     assert(stream->running);
-    assert(!stream->in_checkpoint);
-    assert(!stream->in_checkpoint_state);
-    stream->in_checkpoint_state = true;
+    assert(stream->phase == SWS_PHASE_NORMAL);
+    stream->phase = SWS_PHASE_CHECKPOINT_STATE;
 
     FILLZERO(rec);
     rec.type = REC_TYPE_CHECKPOINT_STATE;
@@ -720,8 +710,8 @@ static void write_checkpoint_state_done(libxl__egc *egc,
 static void checkpoint_state_done(libxl__egc *egc,
                                   libxl__stream_write_state *stream, int rc)
 {
-    assert(stream->in_checkpoint_state);
-    stream->in_checkpoint_state = false;
+    assert(stream->phase == SWS_PHASE_CHECKPOINT_STATE);
+    stream->phase = SWS_PHASE_NORMAL;
     stream->checkpoint_callback(egc, stream, rc);
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 17/23] libxl/libxl_stream_read.c: track callback chains with an explicit phase
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (15 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 16/23] libxl/libxl_stream_write.c: track callback chains with an explicit phase Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 18/23] libxl/migration: implement the sender side of postcopy live migration Joshua Otto
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

As the previous patch did for libxl_stream_write, do for
libxl_stream_read.  libxl_stream_read already has a notion of phase for
its record-buffering behaviour - this is combined with the callback
chain phase.  Again, this is done to support the addition of a new
callback chain for postcopy live migration.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl_internal.h    |  7 ++--
 tools/libxl/libxl_stream_read.c | 83 +++++++++++++++++++++--------------------
 2 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index cef2f39..30d5492 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3133,9 +3133,7 @@ struct libxl__stream_read_state {
     /* Private */
     int rc;
     bool running;
-    bool in_checkpoint;
     bool sync_teardown; /* Only used to coordinate shutdown on error path. */
-    bool in_checkpoint_state;
     libxl__save_helper_state shs;
     libxl__conversion_helper_state chs;
 
@@ -3145,8 +3143,9 @@ struct libxl__stream_read_state {
     LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; /* NOGC */
     enum {
         SRS_PHASE_NORMAL,
-        SRS_PHASE_BUFFERING,
-        SRS_PHASE_UNBUFFERING,
+        SRS_PHASE_CHECKPOINT_BUFFERING,
+        SRS_PHASE_CHECKPOINT_UNBUFFERING,
+        SRS_PHASE_CHECKPOINT_STATE
     } phase;
     bool recursion_guard;
 
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 89c2f21..4cb553e 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -29,14 +29,15 @@
  * processed, and all records will be processed in queue order.
  *
  * Internal states:
- *           running  phase       in_         record   incoming
- *                                checkpoint  _queue   _record
+ *           running  phase                   record   incoming
+ *                                            _queue   _record
  *
- * Undefined    undef  undef        undef       undef    undef
- * Idle         false  undef        false       0        0
- * Active       true   NORMAL       false       0/1      0/partial
- * Active       true   BUFFERING    true        any      0/partial
- * Active       true   UNBUFFERING  true        any      0
+ * Undefined    undef  undef                    undef    undef
+ * Idle         false  undef                    0        0
+ * Active       true   NORMAL                   0/1      0/partial
+ * Active       true   CHECKPOINT_BUFFERING     any      0/partial
+ * Active       true   CHECKPOINT_UNBUFFERING   any      0
+ * Active       true   CHECKPOINT_STATE         0/1      0/partial
  *
  * While reading data from the stream, 'dc' is active and a callback
  * is expected.  Most actions in process_record() start a callback of
@@ -48,12 +49,12 @@
  *   Records are read one at time and immediately processed.  (The
  *   record queue will not contain more than a single record.)
  *
- * PHASE_BUFFERING:
+ * PHASE_CHECKPOINT_BUFFERING:
  *   This phase is used in checkpointed streams, when libxc signals
  *   the presence of a checkpoint in the stream.  Records are read and
  *   buffered until a CHECKPOINT_END record has been read.
  *
- * PHASE_UNBUFFERING:
+ * PHASE_CHECKPOINT_UNBUFFERING:
  *   Once a CHECKPOINT_END record has been read, all buffered records
  *   are processed.
  *
@@ -172,6 +173,12 @@ static void checkpoint_state_done(libxl__egc *egc,
 
 /*----- Helpers -----*/
 
+static inline bool stream_in_checkpoint(libxl__stream_read_state *stream)
+{
+    return stream->phase == SRS_PHASE_CHECKPOINT_BUFFERING ||
+           stream->phase == SRS_PHASE_CHECKPOINT_UNBUFFERING;
+}
+
 /* Helper to set up reading some data from the stream. */
 static int setup_read(libxl__stream_read_state *stream,
                       const char *what, void *ptr, size_t nr_bytes,
@@ -210,7 +217,6 @@ void libxl__stream_read_init(libxl__stream_read_state *stream)
 
     stream->rc = 0;
     stream->running = false;
-    stream->in_checkpoint = false;
     stream->sync_teardown = false;
     FILLZERO(stream->dc);
     FILLZERO(stream->hdr);
@@ -297,10 +303,9 @@ void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                          libxl__stream_read_state *stream)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint);
+    assert(stream->phase == SRS_PHASE_NORMAL);
 
-    stream->in_checkpoint = true;
-    stream->phase = SRS_PHASE_BUFFERING;
+    stream->phase = SRS_PHASE_CHECKPOINT_BUFFERING;
 
     /*
      * Libxc has handed control of the fd to us.  Start reading some
@@ -392,6 +397,7 @@ static void stream_continue(libxl__egc *egc,
 
     switch (stream->phase) {
     case SRS_PHASE_NORMAL:
+    case SRS_PHASE_CHECKPOINT_STATE:
         /*
          * Normal phase (regular migration or restore from file):
          *
@@ -416,9 +422,9 @@ static void stream_continue(libxl__egc *egc,
         }
         break;
 
-    case SRS_PHASE_BUFFERING: {
+    case SRS_PHASE_CHECKPOINT_BUFFERING: {
         /*
-         * Buffering phase (checkpointed streams only):
+         * Buffering phase:
          *
          * logically:
          *   do { read_record(); } while ( not CHECKPOINT_END );
@@ -431,8 +437,6 @@ static void stream_continue(libxl__egc *egc,
         libxl__sr_record_buf *rec = LIBXL_STAILQ_LAST(
             &stream->record_queue, libxl__sr_record_buf, entry);
 
-        assert(stream->in_checkpoint);
-
         if (!rec || (rec->hdr.type != REC_TYPE_CHECKPOINT_END)) {
             setup_read_record(egc, stream);
             break;
@@ -442,19 +446,18 @@ static void stream_continue(libxl__egc *egc,
          * There are now some number of buffered records, with a
          * CHECKPOINT_END at the end. Start processing them all.
          */
-        stream->phase = SRS_PHASE_UNBUFFERING;
+        stream->phase = SRS_PHASE_CHECKPOINT_UNBUFFERING;
     }
         /* FALLTHROUGH */
-    case SRS_PHASE_UNBUFFERING:
+    case SRS_PHASE_CHECKPOINT_UNBUFFERING:
         /*
-         * Unbuffering phase (checkpointed streams only):
+         * Unbuffering phase:
          *
          * logically:
          *   do { process_record(); } while ( not CHECKPOINT_END );
          *
          * Process all records collected during the buffering phase.
          */
-        assert(stream->in_checkpoint);
 
         while (process_record(egc, stream))
             ; /*
@@ -625,7 +628,7 @@ static bool process_record(libxl__egc *egc,
         break;
 
     case REC_TYPE_CHECKPOINT_END:
-        if (!stream->in_checkpoint) {
+        if (!stream_in_checkpoint(stream)) {
             LOG(ERROR, "Unexpected CHECKPOINT_END record in stream");
             rc = ERROR_FAIL;
             goto err;
@@ -634,7 +637,7 @@ static bool process_record(libxl__egc *egc,
         break;
 
     case REC_TYPE_CHECKPOINT_STATE:
-        if (!stream->in_checkpoint_state) {
+        if (stream->phase != SRS_PHASE_CHECKPOINT_STATE) {
             LOG(ERROR, "Unexpected CHECKPOINT_STATE record in stream");
             rc = ERROR_FAIL;
             goto err;
@@ -743,7 +746,12 @@ static void stream_complete(libxl__egc *egc,
 {
     assert(stream->running);
 
-    if (stream->in_checkpoint) {
+    switch (stream->phase) {
+    case SRS_PHASE_NORMAL:
+        stream_done(egc, stream, rc);
+        break;
+    case SRS_PHASE_CHECKPOINT_BUFFERING:
+    case SRS_PHASE_CHECKPOINT_UNBUFFERING:
         assert(rc);
 
         /*
@@ -752,10 +760,8 @@ static void stream_complete(libxl__egc *egc,
          * libxl__xc_domain_restore_done()
          */
         checkpoint_done(egc, stream, rc);
-        return;
-    }
-
-    if (stream->in_checkpoint_state) {
+        break;
+    case SRS_PHASE_CHECKPOINT_STATE:
         assert(rc);
 
         /*
@@ -767,10 +773,8 @@ static void stream_complete(libxl__egc *egc,
          *    libxl__stream_read_abort()
          */
         checkpoint_state_done(egc, stream, rc);
-        return;
+        break;
     }
-
-    stream_done(egc, stream, rc);
 }
 
 static void checkpoint_done(libxl__egc *egc,
@@ -778,18 +782,17 @@ static void checkpoint_done(libxl__egc *egc,
 {
     int ret;
 
-    assert(stream->in_checkpoint);
+    assert(stream_in_checkpoint(stream));
 
     if (rc == 0)
         ret = XGR_CHECKPOINT_SUCCESS;
-    else if (stream->phase == SRS_PHASE_BUFFERING)
+    else if (stream->phase == SRS_PHASE_CHECKPOINT_BUFFERING)
         ret = XGR_CHECKPOINT_FAILOVER;
     else
         ret = XGR_CHECKPOINT_ERROR;
 
     stream->checkpoint_callback(egc, stream, ret);
 
-    stream->in_checkpoint = false;
     stream->phase = SRS_PHASE_NORMAL;
 }
 
@@ -799,8 +802,7 @@ static void stream_done(libxl__egc *egc,
     libxl__sr_record_buf *rec, *trec;
 
     assert(stream->running);
-    assert(!stream->in_checkpoint);
-    assert(!stream->in_checkpoint_state);
+    assert(stream->phase == SRS_PHASE_NORMAL);
     stream->running = false;
 
     if (stream->incoming_record)
@@ -955,9 +957,8 @@ void libxl__stream_read_checkpoint_state(libxl__egc *egc,
                                          libxl__stream_read_state *stream)
 {
     assert(stream->running);
-    assert(!stream->in_checkpoint);
-    assert(!stream->in_checkpoint_state);
-    stream->in_checkpoint_state = true;
+    assert(stream->phase == SRS_PHASE_NORMAL);
+    stream->phase = SRS_PHASE_CHECKPOINT_STATE;
 
     setup_read_record(egc, stream);
 }
@@ -965,8 +966,8 @@ void libxl__stream_read_checkpoint_state(libxl__egc *egc,
 static void checkpoint_state_done(libxl__egc *egc,
                                   libxl__stream_read_state *stream, int rc)
 {
-    assert(stream->in_checkpoint_state);
-    stream->in_checkpoint_state = false;
+    assert(stream->phase == SRS_PHASE_CHECKPOINT_STATE);
+    stream->phase = SRS_PHASE_NORMAL;
     stream->checkpoint_callback(egc, stream, rc);
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 18/23] libxl/migration: implement the sender side of postcopy live migration
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (16 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 17/23] libxl/libxl_stream_read.c: " Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 19/23] libxl/migration: implement the receiver " Joshua Otto
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

To make the libxl sender capable of supporting postcopy live migration:
- Add a postcopy transition callback chain through the stream writer (this
  callback chain is nearly identical to the checkpoint callback chain, and
  differs meaningfully only in its failure/completion behaviour)
- Wire this callback chain up to the xc postcopy callback entries in the domain
  save logic.
- Introduce a new libxl API function, libxl_domain_live_migrate(),
  taking the same parameters as libxl_domain_suspend() as well as a
  recv_fd to enable bi-directional communication between the sender and
  receiver and a boolean out-parameter to enable the caller to reason
  about the safety of recovery from a postcopy failure. (the
  live_migrate() and domain_suspend() parameter lists will likely only
  continue to diverge over time, so it makes good sense to split them
  now)

No mechanism is introduced yet to enable library clients to induce a postcopy
live migration - this will follow after the libxl postcopy receiver logic.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 docs/specs/libxl-migration-stream.pandoc | 19 ++++++++-
 tools/libxl/libxl.h                      |  7 ++++
 tools/libxl/libxl_dom_save.c             | 25 +++++++++++-
 tools/libxl/libxl_domain.c               | 29 +++++++++++++-
 tools/libxl/libxl_internal.h             | 21 ++++++++--
 tools/libxl/libxl_sr_stream_format.h     | 13 +++---
 tools/libxl/libxl_stream_write.c         | 69 ++++++++++++++++++++++++++++++--
 tools/xl/xl_migrate.c                    |  6 ++-
 8 files changed, 169 insertions(+), 20 deletions(-)

diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc
index a1ba1ac..8d00cd7 100644
--- a/docs/specs/libxl-migration-stream.pandoc
+++ b/docs/specs/libxl-migration-stream.pandoc
@@ -2,7 +2,8 @@
 % Andrew Cooper <<andrew.cooper3@citrix.com>>
   Wen Congyang <<wency@cn.fujitsu.com>>
   Yang Hongyang <<hongyang.yang@easystack.cn>>
-% Revision 2
+  Joshua Otto <<jtotto@uwaterloo.ca>>
+% Revision 3
 
 Introduction
 ============
@@ -123,7 +124,9 @@ type         0x00000000: END
 
              0x00000005: CHECKPOINT_STATE
 
-             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
+             0x00000006: POSTCOPY_TRANSITION_END
+
+             0x00000007 - 0x7FFFFFFF: Reserved for future _mandatory_
              records.
 
              0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
@@ -304,6 +307,18 @@ While Secondary is running in below loop:
     b. Send _CHECKPOINT\_SVM\_SUSPENDED_ to primary
 4. Checkpoint
 
+POSTCOPY\_TRANSITION\_END
+-------------------------
+
+A postcopy transition end record marks the end of a postcopy transition in a
+libxl live migration stream.  It indicates that control of the stream should be
+returned to libxc for the postcopy memory migration phase.
+
+     0     1     2     3     4     5     6     7 octet
+    +-------------------------------------------------+
+
+The postcopy transition end record contains no fields; its body_length is 0.
+
 Future Extensions
 =================
 
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index cf8687a..5e48862 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1387,6 +1387,13 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd,
 #define LIBXL_SUSPEND_DEBUG 1
 #define LIBXL_SUSPEND_LIVE 2
 
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
+                              int flags, /* LIBXL_SUSPEND_* */
+                              int recv_fd,
+                              bool *postcopy_transitioned, /* OUT */
+                              const libxl_asyncop_how *ao_how)
+                              LIBXL_EXTERNAL_CALLERS_ONLY;
+
 /* @param suspend_cancel [from xenctrl.h:xc_domain_resume( @param fast )]
  *   If this parameter is true, use co-operative resume. The guest
  *   must support this.
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index eb1271e..75ab523 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -350,10 +350,31 @@ static int libxl__save_live_migration_precopy_policy(
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *sws, int rc);
+
 static void libxl__save_live_migration_postcopy_transition_callback(void *user)
 {
-    /* XXX we're not yet ready to deal with this */
-    assert(0);
+    libxl__save_helper_state *shs = user;
+    libxl__stream_write_state *sws = CONTAINER_OF(shs, *sws, shs);
+    sws->postcopy_transition_callback = postcopy_transition_done;
+    libxl__stream_write_start_postcopy_transition(shs->egc, sws);
+}
+
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *sws,
+                                     int rc)
+{
+    libxl__domain_save_state *dss = sws->dss;
+
+    /* Past here, it's _possible_ that the domain may execute at the
+     * destination, so - unless we're given positive confirmation by the
+     * destination that it failed to resume there - we must assume it has. */
+    assert(dss->postcopy_transitioned);
+    *dss->postcopy_transitioned = !rc;
+
+    /* Return control to libxc. */
+    libxl__xc_domain_saverestore_async_callback_done(egc, &sws->shs, !rc);
 }
 
 /*----- main code for saving, in order of execution -----*/
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index 08eccd0..fc37f47 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -486,8 +486,9 @@ static void domain_suspend_cb(libxl__egc *egc,
 
 }
 
-int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
-                         const libxl_asyncop_how *ao_how)
+static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                             int recv_fd, bool *postcopy_transitioned,
+                             const libxl_asyncop_how *ao_how)
 {
     AO_CREATE(ctx, domid, ao_how);
     int rc;
@@ -506,6 +507,8 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
 
     dss->domid = domid;
     dss->fd = fd;
+    dss->recv_fd = recv_fd;
+    dss->postcopy_transitioned = postcopy_transitioned;
     dss->type = type;
     dss->live = flags & LIBXL_SUSPEND_LIVE;
     dss->debug = flags & LIBXL_SUSPEND_DEBUG;
@@ -523,6 +526,28 @@ int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     return AO_CREATE_FAIL(rc);
 }
 
+int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
+                         const libxl_asyncop_how *ao_how)
+{
+    return do_domain_suspend(ctx, domid, fd, flags, -1, NULL, ao_how);
+}
+
+int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
+                              int flags, int recv_fd,
+                              bool *postcopy_transitioned,
+                              const libxl_asyncop_how *ao_how)
+{
+    if (!postcopy_transitioned) {
+        errno = EINVAL;
+        return -1;
+    }
+
+    flags |= LIBXL_SUSPEND_LIVE;
+
+    return do_domain_suspend(ctx, domid, send_fd, flags, recv_fd,
+                             postcopy_transitioned, ao_how);
+}
+
 int libxl_domain_pause(libxl_ctx *ctx, uint32_t domid)
 {
     int ret;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 30d5492..c8ea3ba 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3204,17 +3204,25 @@ struct libxl__stream_write_state {
     void (*completion_callback)(libxl__egc *egc,
                                 libxl__stream_write_state *sws,
                                 int rc);
-    void (*checkpoint_callback)(libxl__egc *egc,
-                                libxl__stream_write_state *sws,
-                                int rc);
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        void (*checkpoint_callback)(libxl__egc *egc,
+                                    libxl__stream_write_state *sws,
+                                    int rc);
+        void (*postcopy_transition_callback)(libxl__egc *egc,
+                                             libxl__stream_write_state *sws,
+                                             int rc);
+    };
     /* Private */
     int rc;
     bool running;
     enum {
         SWS_PHASE_NORMAL,
         SWS_PHASE_CHECKPOINT,
-        SWS_PHASE_CHECKPOINT_STATE
+        SWS_PHASE_CHECKPOINT_STATE,
+        SWS_PHASE_POSTCOPY_TRANSITION
     } phase;
+    bool postcopy_transitioned;
     bool sync_teardown;  /* Only used to coordinate shutdown on error path. */
     libxl__save_helper_state shs;
 
@@ -3237,6 +3245,10 @@ _hidden void libxl__stream_write_init(libxl__stream_write_state *stream);
 _hidden void libxl__stream_write_start(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 _hidden void
+libxl__stream_write_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream);
+_hidden void
 libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                      libxl__stream_write_state *stream);
 _hidden void
@@ -3300,6 +3312,7 @@ struct libxl__domain_save_state {
     int fd;
     int fdfl; /* original flags on fd */
     int recv_fd;
+    bool *postcopy_transitioned;
     libxl_domain_type type;
     int live;
     int debug;
diff --git a/tools/libxl/libxl_sr_stream_format.h b/tools/libxl/libxl_sr_stream_format.h
index 75f5190..a789126 100644
--- a/tools/libxl/libxl_sr_stream_format.h
+++ b/tools/libxl/libxl_sr_stream_format.h
@@ -31,12 +31,13 @@ typedef struct libxl__sr_rec_hdr
 /* All records must be aligned up to an 8 octet boundary */
 #define REC_ALIGN_ORDER              3U
 
-#define REC_TYPE_END                    0x00000000U
-#define REC_TYPE_LIBXC_CONTEXT          0x00000001U
-#define REC_TYPE_EMULATOR_XENSTORE_DATA 0x00000002U
-#define REC_TYPE_EMULATOR_CONTEXT       0x00000003U
-#define REC_TYPE_CHECKPOINT_END         0x00000004U
-#define REC_TYPE_CHECKPOINT_STATE       0x00000005U
+#define REC_TYPE_END                     0x00000000U
+#define REC_TYPE_LIBXC_CONTEXT           0x00000001U
+#define REC_TYPE_EMULATOR_XENSTORE_DATA  0x00000002U
+#define REC_TYPE_EMULATOR_CONTEXT        0x00000003U
+#define REC_TYPE_CHECKPOINT_END          0x00000004U
+#define REC_TYPE_CHECKPOINT_STATE        0x00000005U
+#define REC_TYPE_POSTCOPY_TRANSITION_END 0x00000006U
 
 typedef struct libxl__sr_emulator_hdr
 {
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index 8f2a1c9..1c4b1f1 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -22,6 +22,9 @@
  * Entry points from outside:
  *  - libxl__stream_write_start()
  *     - Start writing a stream from the start.
+ *  - libxl__stream_write_postcopy_transition()
+ *     - Write the records required to permit postcopy resumption at the
+ *       migration target.
  *  - libxl__stream_write_start_checkpoint()
  *     - Write the records which form a checkpoint into a stream.
  *
@@ -65,6 +68,9 @@ static void stream_complete(libxl__egc *egc,
                             libxl__stream_write_state *stream, int rc);
 static void stream_done(libxl__egc *egc,
                         libxl__stream_write_state *stream, int rc);
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *stream,
+                                     int rc);
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_write_state *stream,
                             int rc);
@@ -91,7 +97,9 @@ static void emulator_context_record_done(libxl__egc *egc,
                                          libxl__stream_write_state *stream);
 static void write_phase_end_record(libxl__egc *egc,
                                    libxl__stream_write_state *stream);
-
+static void postcopy_transition_end_record_done(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream);
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream);
 
@@ -211,6 +219,7 @@ void libxl__stream_write_init(libxl__stream_write_state *stream)
     stream->rc = 0;
     stream->running = false;
     stream->phase = SWS_PHASE_NORMAL;
+    stream->postcopy_transitioned = false;
     stream->sync_teardown = false;
     FILLZERO(stream->dc);
     stream->record_done_callback = NULL;
@@ -287,6 +296,22 @@ void libxl__stream_write_start(libxl__egc *egc,
     stream_complete(egc, stream, rc);
 }
 
+void libxl__stream_write_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream)
+{
+    libxl__domain_save_state *dss = stream->dss;
+
+    assert(stream->running);
+    assert(dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE);
+    assert(stream->phase == SWS_PHASE_NORMAL);
+    assert(!stream->postcopy_transitioned);
+
+    stream->phase = SWS_PHASE_POSTCOPY_TRANSITION;
+
+    write_emulator_xenstore_record(egc, stream);
+}
+
 void libxl__stream_write_start_checkpoint(libxl__egc *egc,
                                           libxl__stream_write_state *stream)
 {
@@ -369,7 +394,7 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
      * If the stream is not still alive, we must not continue any work.
      */
     if (libxl__stream_write_inuse(stream)) {
-        if (dss->checkpointed_stream != LIBXL_CHECKPOINTED_STREAM_NONE)
+        if (dss->checkpointed_stream != LIBXL_CHECKPOINTED_STREAM_NONE) {
             /*
              * For remus, if libxl__xc_domain_save_done() completes,
              * there was an error sending data to the secondary.
@@ -377,8 +402,17 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
              * return value (Please refer to libxl__remus_teardown())
              */
             stream_complete(egc, stream, 0);
-        else
+        } else if (stream->postcopy_transitioned) {
+            /*
+             * If, on the other hand, this is a normal migration that had a
+             * postcopy migration stage, we're completely done at this point and
+             * want to report any error received here to our caller.
+             */
+            assert(stream->phase == SWS_PHASE_NORMAL);
+            write_phase_end_record(egc, stream);
+        } else {
             write_emulator_xenstore_record(egc, stream);
+        }
     }
 }
 
@@ -550,6 +584,11 @@ static void write_phase_end_record(libxl__egc *egc,
         what     = "checkpoint end record";
         cb       = checkpoint_end_record_done;
         break;
+    case SWS_PHASE_POSTCOPY_TRANSITION:
+        rec.type = REC_TYPE_POSTCOPY_TRANSITION_END;
+        what     = "postcopy transition end record";
+        cb       = postcopy_transition_end_record_done;
+        break;
     default:
         /* SWS_PHASE_CHECKPOINT_STATE has no end record */
         assert(false);
@@ -558,6 +597,13 @@ static void write_phase_end_record(libxl__egc *egc,
     setup_write(egc, stream, what, &rec, NULL, cb);
 }
 
+static void postcopy_transition_end_record_done(
+    libxl__egc *egc,
+    libxl__stream_write_state *stream)
+{
+    postcopy_transition_done(egc, stream, 0);
+}
+
 static void checkpoint_end_record_done(libxl__egc *egc,
                                        libxl__stream_write_state *stream)
 {
@@ -600,6 +646,13 @@ static void stream_complete(libxl__egc *egc,
          */
         checkpoint_state_done(egc, stream, rc);
         break;
+    case SWS_PHASE_POSTCOPY_TRANSITION:
+        /*
+         * To deal with errors during the postcopy transition, we use the same
+         * strategy as during checkpoints.
+         */
+        postcopy_transition_done(egc, stream, rc);
+        break;
     }
 }
 
@@ -627,6 +680,16 @@ static void stream_done(libxl__egc *egc,
     }
 }
 
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_write_state *stream,
+                                     int rc)
+{
+    assert(stream->phase == SWS_PHASE_POSTCOPY_TRANSITION);
+    stream->postcopy_transitioned = true;
+    stream->phase = SWS_PHASE_NORMAL;
+    stream->postcopy_transition_callback(egc, stream, rc);
+}
+
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_write_state *stream,
                             int rc)
diff --git a/tools/xl/xl_migrate.c b/tools/xl/xl_migrate.c
index 1f0e87d..9656204 100644
--- a/tools/xl/xl_migrate.c
+++ b/tools/xl/xl_migrate.c
@@ -186,6 +186,7 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
     char rc_buf;
     uint8_t *config_data;
     int config_len, flags = LIBXL_SUSPEND_LIVE;
+    bool postcopy_transitioned;
 
     save_domain_core_begin(domid, override_config_file,
                            &config_data, &config_len);
@@ -205,7 +206,10 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
 
     if (debug)
         flags |= LIBXL_SUSPEND_DEBUG;
-    rc = libxl_domain_suspend(ctx, domid, send_fd, flags, NULL);
+    rc = libxl_domain_live_migrate(ctx, domid, send_fd, flags,
+                                   recv_fd, &postcopy_transitioned, NULL);
+    assert(!postcopy_transitioned);
+
     if (rc) {
         fprintf(stderr, "migration sender: libxl_domain_suspend failed"
                 " (rc=%d)\n", rc);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 19/23] libxl/migration: implement the receiver side of postcopy live migration
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (17 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 18/23] libxl/migration: implement the sender side of postcopy live migration Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 20/23] tools: expose postcopy live migration support in libxl and xl Joshua Otto
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

To make the libxl receiver capable of supporting postcopy live
migration:
- As was done for the libxl stream writer, add a symmetric callback
  chain through the stream reader that reads the sequence of xl records
  necessary to resume the guest and enter the postcopy phase.  This
  chain is very similar to the checkpoint chain.
- Add a new postcopy path through the domain creation sequence that
  permits the xc memory postcopy phase to proceed in parallel to the
  libxl domain creation and resumption sequence.
- Add a out-parameter to libxl_domain_create_restore(),
  'postcopy_resumed', that callers can test to determine whether or not
  further action is required on their-part post-migration to get the
  guest running.

A subsequent patch will introduce a mechanism by which library clients
can _induce_ a postcopy live migration.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl.h                  |  28 +++++-
 tools/libxl/libxl_create.c           | 178 +++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_internal.h         |  45 ++++++++-
 tools/libxl/libxl_stream_read.c      |  57 +++++++++++
 tools/ocaml/libs/xl/xenlight_stubs.c |   2 +-
 tools/xl/xl_vmcontrol.c              |   2 +-
 6 files changed, 297 insertions(+), 15 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 5e48862..70441cf 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1308,6 +1308,7 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
 int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 uint32_t *domid, int restore_fd,
                                 int send_back_fd,
+                                bool *postcopy_resumed, /* OUT */
                                 const libxl_domain_restore_params *params,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
@@ -1327,8 +1328,9 @@ static inline int libxl_domain_create_restore_0x040200(
 
     libxl_domain_restore_params_init(&params);
 
-    ret = libxl_domain_create_restore(
-        ctx, d_config, domid, restore_fd, -1, &params, ao_how, aop_console_how);
+    ret = libxl_domain_create_restore(ctx, d_config, domid, restore_fd,
+                                      -1, NULL, &params, ao_how,
+                                      aop_console_how);
 
     libxl_domain_restore_params_dispose(&params);
     return ret;
@@ -1348,11 +1350,31 @@ static inline int libxl_domain_create_restore_0x040400(
     LIBXL_EXTERNAL_CALLERS_ONLY
 {
     return libxl_domain_create_restore(ctx, d_config, domid, restore_fd,
-                                       -1, params, ao_how, aop_console_how);
+                                       -1, NULL, params, ao_how,
+                                       aop_console_how);
 }
 
 #define libxl_domain_create_restore libxl_domain_create_restore_0x040400
 
+#elif defined(LIBXL_API_VERSION) && LIBXL_API_VERSION >= 0x040700 \
+                                 && LIBXL_API_VERSION < 0x040900
+
+static inline int libxl_domain_create_restore_0x040700(
+    libxl_ctx *ctx, libxl_domain_config *d_config,
+    uint32_t *domid, int restore_fd,
+    int send_back_fd,
+    const libxl_domain_restore_params *params,
+    const libxl_asyncop_how *ao_how,
+    const libxl_asyncprogress_how *aop_console_how)
+    LIBXL_EXTERNAL_CALLERS_ONLY
+{
+    return libxl_domain_create_restore(ctx, d_config, domid, restore_fd,
+                                       -1, NULL, params, ao_how,
+                                       aop_console_how);
+}
+
+#define libxl_domain_create_restore libxl_domain_create_restore_0x040700
+
 #endif
 
 int libxl_domain_soft_reset(libxl_ctx *ctx,
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 1354689..227fdfb 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -748,8 +748,24 @@ static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc);
 
+/*
+ * If a postcopy migration is initiated by the sending side during a live
+ * migration, this function returns control of the stream to the stream reader
+ * so it can finish the libxl stream.
+ */
 static void domcreate_postcopy_transition_callback(void *user);
 
+/*
+ * When the stream reader postcopy transition completes, this callback is
+ * invoked.  It transfers control of the restore stream back to the helper.
+ */
+void domcreate_postcopy_transition_complete_callback(
+    libxl__egc *egc, libxl__stream_read_state *srs, int rc);
+
+static void domcreate_postcopy_stream_done(libxl__egc *egc,
+                                           libxl__stream_read_state *srs,
+                                           int ret);
+
 static void domcreate_launch_dm(libxl__egc *egc, libxl__multidev *aodevs,
                                 int ret);
 
@@ -776,6 +792,10 @@ static void domcreate_destruction_cb(libxl__egc *egc,
                                      libxl__domain_destroy_state *dds,
                                      int rc);
 
+static void domcreate_report_result(libxl__egc *egc,
+                                    libxl__domain_create_state *dcs,
+                                    int rc);
+
 static void initiate_domain_create(libxl__egc *egc,
                                    libxl__domain_create_state *dcs)
 {
@@ -1111,6 +1131,15 @@ static void domcreate_bootloader_done(libxl__egc *egc,
             callbacks->postcopy_transition =
                 domcreate_postcopy_transition_callback;
 
+            /*
+             * When the stream reader is finished reading the postcopy
+             * transition, we'll find out in the
+             * domcreate_postcopy_transition_complete_callback(), where we'll
+             * hand control of the stream back to the libxc helper.
+             */
+            dcs->srs.postcopy_transition_callback =
+                domcreate_postcopy_transition_complete_callback;
+
             libxl__stream_read_start(egc, &dcs->srs);
         }
         return;
@@ -1124,8 +1153,81 @@ static void domcreate_bootloader_done(libxl__egc *egc,
 
 static void domcreate_postcopy_transition_callback(void *user)
 {
-    /* XXX we're not ready to deal with this yet */
-    assert(0);
+    libxl__save_helper_state *shs = user;
+    libxl__domain_create_state *dcs = shs->caller_state;
+    libxl__stream_read_state *srs = &dcs->srs;
+
+    libxl__stream_read_start_postcopy_transition(shs->egc, srs);
+}
+
+void domcreate_postcopy_transition_complete_callback(
+    libxl__egc *egc, libxl__stream_read_state *srs, int rc)
+{
+    libxl__domain_create_state *dcs = srs->dcs;
+
+    if (!rc)
+        srs->completion_callback = domcreate_postcopy_stream_done;
+
+     /*
+      * If all is well (for now) we'll find out about the eventual termination
+      * of the restore helper/stream through domcreate_postcopy_stream_done().
+      * Otherwise, we'll find out sooner through domcreate_stream_done().
+      */
+    libxl__xc_domain_saverestore_async_callback_done(egc, &srs->shs, !rc);
+
+    if (!rc) {
+        /* In parallel, resume the guest. */
+        dcs->postcopy.active = true;
+        dcs->postcopy.resume.state = DCS_POSTCOPY_RESUME_INPROGRESS;
+        dcs->postcopy.stream.state = DCS_POSTCOPY_STREAM_INPROGRESS;
+        domcreate_stream_done(egc, srs, 0);
+    }
+}
+
+static void domcreate_postcopy_stream_done(libxl__egc *egc,
+                                           libxl__stream_read_state *srs,
+                                           int ret)
+{
+    libxl__domain_create_state *dcs = srs->dcs;
+
+    EGC_GC;
+
+    assert(dcs->postcopy.stream.state == DCS_POSTCOPY_STREAM_INPROGRESS);
+
+    switch (dcs->postcopy.resume.state) {
+    case DCS_POSTCOPY_RESUME_INPROGRESS:
+        if (ret) {
+            /*
+             * The stream failed, and the resumption is still in progress.
+             * Stash our return code for resumption to find later.
+             */
+            dcs->postcopy.stream.state = DCS_POSTCOPY_STREAM_FAILED;
+            dcs->postcopy.stream.rc = ret;
+        } else {
+            /*
+             * We've successfully completed, but the resumption is still humming
+             * away.
+             */
+			dcs->postcopy.stream.state = DCS_POSTCOPY_STREAM_SUCCESS;
+
+			/* Just let it finish.  Nothing to do for now. */
+			LOG(INFO, "Postcopy stream completed _before_ domain unpaused");
+        }
+        break;
+    case DCS_POSTCOPY_RESUME_FAILED:
+        /* The resumption failed first, so report its result. */
+        dcs->callback(egc, dcs, dcs->postcopy.resume.rc, dcs->guest_domid);
+        break;
+    case DCS_POSTCOPY_RESUME_SUCCESS:
+        /*
+         * This is the expected case - resumption completed, and some time later
+         * the final postcopy pages were migrated and the stream wrapped up.
+         * We're now totally done!
+         */
+        LOG(INFO, "Postcopy stream completed after domain unpaused");
+        dcs->callback(egc, dcs, ret, dcs->guest_domid);
+        break;
+    }
 }
 
 void libxl__srm_callout_callback_restore_results(xen_pfn_t store_mfn,
@@ -1582,7 +1684,8 @@ static void domcreate_complete(libxl__egc *egc,
         }
         dcs->guest_domid = -1;
     }
-    dcs->callback(egc, dcs, rc, dcs->guest_domid);
+
+    domcreate_report_result(egc, dcs, rc);
 }
 
 static void domcreate_destruction_cb(libxl__egc *egc,
@@ -1595,7 +1698,63 @@ static void domcreate_destruction_cb(libxl__egc *egc,
     if (rc)
         LOGD(ERROR, dds->domid, "unable to destroy domain following failed creation");
 
-    dcs->callback(egc, dcs, ERROR_FAIL, dcs->guest_domid);
+    domcreate_report_result(egc, dcs, ERROR_FAIL);
+}
+
+static void domcreate_report_result(libxl__egc *egc,
+                                    libxl__domain_create_state *dcs,
+                                    int rc)
+{
+    EGC_GC;
+
+    if (!dcs->postcopy.active) {
+        /*
+         * If we aren't presently in the process of completing a postcopy
+         * resumption (the norm), everything is all cleaned up and we can report
+         * our result directly.
+         */
+        LOG(INFO, "No postcopy at all");
+        dcs->callback(egc, dcs, rc, dcs->guest_domid);
+    } else {
+        switch (dcs->postcopy.stream.state) {
+        case DCS_POSTCOPY_STREAM_INPROGRESS:
+        case DCS_POSTCOPY_STREAM_SUCCESS:
+            /* If we haven't yet failed, try to unpause the guest. */
+            rc = rc ?: libxl_domain_unpause(CTX, dcs->guest_domid);
+            if (dcs->postcopy_resumed)
+                *dcs->postcopy_resumed = !rc;
+
+            if (dcs->postcopy.stream.state == DCS_POSTCOPY_STREAM_SUCCESS) {
+                /*
+                 * The stream finished successfully, so we can report our local
+                 * result as the overall result.
+                 */
+                dcs->callback(egc, dcs, rc, dcs->guest_domid);
+                LOG(INFO, "Postcopy domain unpaused after stream completed");
+            } else if (rc) {
+                /*
+                 * The stream isn't done yet, but we failed.  Tell it to bail,
+                 * and stash our return code for the postcopy stream completion
+                 * callback to find.
+                 */
+                dcs->postcopy.resume.state = DCS_POSTCOPY_RESUME_FAILED;
+                dcs->postcopy.resume.rc = rc;
+
+                libxl__stream_read_abort(egc, &dcs->srs, -1);
+            } else {
+                dcs->postcopy.resume.state = DCS_POSTCOPY_RESUME_SUCCESS;
+                LOG(INFO, "Postcopy domain unpaused before stream completed");
+            }
+            break;
+        case DCS_POSTCOPY_STREAM_FAILED:
+            /*
+             * The stream failed.  Now that we're done, tie things up by
+             * reporting the stream's result.
+             */
+            dcs->callback(egc, dcs, dcs->postcopy.stream.rc, dcs->guest_domid);
+            break;
+        }
+    }
 }
 
 /*----- application-facing domain creation interface -----*/
@@ -1619,6 +1778,7 @@ static void domain_create_cb(libxl__egc *egc,
 
 static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid, int restore_fd, int send_back_fd,
+                            bool *postcopy_resumed, /* OUT */
                             const libxl_domain_restore_params *params,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
@@ -1627,6 +1787,9 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
     libxl__app_domain_create_state *cdcs;
     int rc;
 
+    if (postcopy_resumed)
+        *postcopy_resumed = false;
+
     GCNEW(cdcs);
     cdcs->dcs.ao = ao;
     cdcs->dcs.guest_config = d_config;
@@ -1641,6 +1804,7 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
                                          &cdcs->dcs.restore_fdfl);
         if (rc < 0) goto out_err;
     }
+    cdcs->dcs.postcopy_resumed = postcopy_resumed;
     cdcs->dcs.callback = domain_create_cb;
     cdcs->dcs.domid_soft_reset = INVALID_DOMID;
 
@@ -1862,13 +2026,13 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             const libxl_asyncprogress_how *aop_console_how)
 {
     unset_disk_colo_restore(d_config);
-    return do_domain_create(ctx, d_config, domid, -1, -1, NULL,
+    return do_domain_create(ctx, d_config, domid, -1, -1, NULL, NULL,
                             ao_how, aop_console_how);
 }
 
 int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 uint32_t *domid, int restore_fd,
-                                int send_back_fd,
+                                int send_back_fd, bool *postcopy_resumed,
                                 const libxl_domain_restore_params *params,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
@@ -1880,7 +2044,7 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
     }
 
     return do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
-                            params, ao_how, aop_console_how);
+                            postcopy_resumed, params, ao_how, aop_console_how);
 }
 
 int libxl_domain_soft_reset(libxl_ctx *ctx,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index c8ea3ba..54ad16a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3127,9 +3127,15 @@ struct libxl__stream_read_state {
     void (*completion_callback)(libxl__egc *egc,
                                 libxl__stream_read_state *srs,
                                 int rc);
-    void (*checkpoint_callback)(libxl__egc *egc,
-                                libxl__stream_read_state *srs,
-                                int rc);
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        void (*checkpoint_callback)(libxl__egc *egc,
+                                    libxl__stream_read_state *srs,
+                                    int rc);
+        void (*postcopy_transition_callback)(libxl__egc *egc,
+                                             libxl__stream_read_state *srs,
+                                             int rc);
+    };
     /* Private */
     int rc;
     bool running;
@@ -3143,10 +3149,12 @@ struct libxl__stream_read_state {
     LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; /* NOGC */
     enum {
         SRS_PHASE_NORMAL,
+        SRS_PHASE_POSTCOPY_TRANSITION,
         SRS_PHASE_CHECKPOINT_BUFFERING,
         SRS_PHASE_CHECKPOINT_UNBUFFERING,
         SRS_PHASE_CHECKPOINT_STATE
     } phase;
+    bool postcopy_transitioned;
     bool recursion_guard;
 
     /* Only used while actively reading a record from the stream. */
@@ -3160,6 +3168,9 @@ struct libxl__stream_read_state {
 _hidden void libxl__stream_read_init(libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_start(libxl__egc *egc,
                                       libxl__stream_read_state *stream);
+_hidden void libxl__stream_read_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                                  libxl__stream_read_state *stream);
 _hidden void libxl__stream_read_checkpoint_state(libxl__egc *egc,
@@ -3709,8 +3720,36 @@ struct libxl__domain_create_state {
     int restore_fd, libxc_fd;
     int restore_fdfl; /* original flags of restore_fd */
     int send_back_fd;
+    bool *postcopy_resumed;
     libxl_domain_restore_params restore_params;
     uint32_t domid_soft_reset;
+    struct {
+        /*
+         * Is a postcopy resumption in progress? (i.e. does the rest of this
+         * state have any meaning?)
+         */
+        bool active;
+
+        struct {
+            enum {
+                DCS_POSTCOPY_RESUME_INPROGRESS,
+                DCS_POSTCOPY_RESUME_FAILED,
+                DCS_POSTCOPY_RESUME_SUCCESS
+            } state;
+
+            int rc;
+        } resume;
+
+        struct {
+            enum {
+                DCS_POSTCOPY_STREAM_INPROGRESS,
+                DCS_POSTCOPY_STREAM_FAILED,
+                DCS_POSTCOPY_STREAM_SUCCESS
+            } state;
+
+            int rc;
+        } stream;
+    } postcopy;
     libxl__domain_create_cb *callback;
     libxl_asyncprogress_how aop_console_how;
     /* private to domain_create */
diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 4cb553e..8e9b720 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -35,6 +35,7 @@
  * Undefined    undef  undef                    undef    undef
  * Idle         false  undef                    0        0
  * Active       true   NORMAL                   0/1      0/partial
+ * Active       true   POSTCOPY_TRANSITION      0/1      0/partial
  * Active       true   CHECKPOINT_BUFFERING     any      0/partial
  * Active       true   CHECKPOINT_UNBUFFERING   any      0
  * Active       true   CHECKPOINT_STATE         0/1      0/partial
@@ -133,6 +134,8 @@
 /* Success/error/cleanup handling. */
 static void stream_complete(libxl__egc *egc,
                             libxl__stream_read_state *stream, int rc);
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_read_state *stream, int rc);
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_read_state *stream, int rc);
 static void stream_done(libxl__egc *egc,
@@ -222,6 +225,7 @@ void libxl__stream_read_init(libxl__stream_read_state *stream)
     FILLZERO(stream->hdr);
     LIBXL_STAILQ_INIT(&stream->record_queue);
     stream->phase = SRS_PHASE_NORMAL;
+    stream->postcopy_transitioned = false;
     stream->recursion_guard = false;
     stream->incoming_record = NULL;
     FILLZERO(stream->emu_dc);
@@ -299,6 +303,26 @@ void libxl__stream_read_start(libxl__egc *egc,
     stream_complete(egc, stream, rc);
 }
 
+void libxl__stream_read_start_postcopy_transition(
+    libxl__egc *egc,
+    libxl__stream_read_state *stream)
+{
+    int checkpointed_stream = stream->dcs->restore_params.checkpointed_stream;
+
+    assert(stream->running);
+    assert(checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE);
+    assert(stream->phase == SRS_PHASE_NORMAL);
+    assert(!stream->postcopy_transitioned);
+
+    stream->phase = SRS_PHASE_POSTCOPY_TRANSITION;
+
+    /*
+     * Libxc has handed control of the fd to us.  Start reading some
+     * libxl records out of it.
+     */
+    stream_continue(egc, stream);
+}
+
 void libxl__stream_read_start_checkpoint(libxl__egc *egc,
                                          libxl__stream_read_state *stream)
 {
@@ -397,6 +421,7 @@ static void stream_continue(libxl__egc *egc,
 
     switch (stream->phase) {
     case SRS_PHASE_NORMAL:
+    case SRS_PHASE_POSTCOPY_TRANSITION:
     case SRS_PHASE_CHECKPOINT_STATE:
         /*
          * Normal phase (regular migration or restore from file):
@@ -576,6 +601,13 @@ static bool process_record(libxl__egc *egc,
 
     LOG(DEBUG, "Record: %u, length %u", rec->hdr.type, rec->hdr.length);
 
+    if (stream->postcopy_transitioned &&
+        rec->hdr.type != REC_TYPE_END) {
+        rc = ERROR_FAIL;
+        LOG(ERROR, "Received non-end record after postcopy transition");
+        goto err;
+    }
+
     switch (rec->hdr.type) {
 
     case REC_TYPE_END:
@@ -627,6 +659,15 @@ static bool process_record(libxl__egc *egc,
         write_emulator_blob(egc, stream, rec);
         break;
 
+    case REC_TYPE_POSTCOPY_TRANSITION_END:
+        if (stream->phase != SRS_PHASE_POSTCOPY_TRANSITION) {
+            LOG(ERROR, "Unexpected POSTCOPY_TRANSITION_END record in stream");
+            rc = ERROR_FAIL;
+            goto err;
+        }
+        postcopy_transition_done(egc, stream, 0);
+        break;
+
     case REC_TYPE_CHECKPOINT_END:
         if (!stream_in_checkpoint(stream)) {
             LOG(ERROR, "Unexpected CHECKPOINT_END record in stream");
@@ -761,6 +802,13 @@ static void stream_complete(libxl__egc *egc,
          */
         checkpoint_done(egc, stream, rc);
         break;
+    case SRS_PHASE_POSTCOPY_TRANSITION:
+        assert(rc);
+
+        /*
+         * To deal with errors during the postcopy transition, we use the same
+         * strategy as during checkpoints.
+         */
     case SRS_PHASE_CHECKPOINT_STATE:
         assert(rc);
 
@@ -777,6 +825,15 @@ static void stream_complete(libxl__egc *egc,
     }
 }
 
+static void postcopy_transition_done(libxl__egc *egc,
+                                     libxl__stream_read_state *stream, int rc)
+{
+    assert(stream->phase == SRS_PHASE_POSTCOPY_TRANSITION);
+    stream->postcopy_transitioned = true;
+    stream->phase = SRS_PHASE_NORMAL;
+    stream->postcopy_transition_callback(egc, stream, rc);
+}
+
 static void checkpoint_done(libxl__egc *egc,
                             libxl__stream_read_state *stream, int rc)
 {
diff --git a/tools/ocaml/libs/xl/xenlight_stubs.c b/tools/ocaml/libs/xl/xenlight_stubs.c
index 98b52b9..3ef5a1e 100644
--- a/tools/ocaml/libs/xl/xenlight_stubs.c
+++ b/tools/ocaml/libs/xl/xenlight_stubs.c
@@ -538,7 +538,7 @@ value stub_libxl_domain_create_restore(value ctx, value domain_config, value par
 
 	caml_enter_blocking_section();
 	ret = libxl_domain_create_restore(CTX, &c_dconfig, &c_domid, restore_fd,
-		-1, &c_params, ao_how, NULL);
+		-1, NULL, &c_params, ao_how, NULL);
 	caml_leave_blocking_section();
 
 	free(ao_how);
diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 89c2b25..47ba9f3 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -882,7 +882,7 @@ start:
 
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
-                                          send_back_fd, &params,
+                                          send_back_fd, NULL, &params,
                                           0, autoconnect_console_how);
 
         libxl_domain_restore_params_dispose(&params);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 20/23] tools: expose postcopy live migration support in libxl and xl
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (18 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 19/23] libxl/migration: implement the receiver " Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 21/23] xen/mem_paging: move paging op arguments into a union Joshua Otto
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

- Add a 'memory_strategy' parameter to libxl_domain_live_migrate(),
  which specifies how the remainder of the memory migration should be
  approached after the iterative precopy phase is completed.
- Plug this parameter into the libxl migration precopy policy
  implementation.
- Add --postcopy to xl migrate, and skip the xl-level handshaking at
  both sides when postcopy migration occurs.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxl/libxl.h          |  5 ++++
 tools/libxl/libxl_dom_save.c | 17 ++++++++----
 tools/libxl/libxl_domain.c   |  8 ++++--
 tools/libxl/libxl_internal.h |  1 +
 tools/xl/xl.h                |  7 ++++-
 tools/xl/xl_cmdtable.c       |  3 ++
 tools/xl/xl_migrate.c        | 65 ++++++++++++++++++++++++++++++++++++++++----
 tools/xl/xl_vmcontrol.c      |  8 ++++--
 8 files changed, 97 insertions(+), 17 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 70441cf..b569734 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1413,9 +1413,14 @@ int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
                               int flags, /* LIBXL_SUSPEND_* */
                               int recv_fd,
                               bool *postcopy_transitioned, /* OUT */
+                              int memory_strategy,
                               const libxl_asyncop_how *ao_how)
                               LIBXL_EXTERNAL_CALLERS_ONLY;
 
+#define LIBXL_LM_MEMORY_STOP_AND_COPY 0
+#define LIBXL_LM_MEMORY_POSTCOPY 1
+#define LIBXL_LM_MEMORY_DEFAULT LIBXL_LM_MEMORY_STOP_AND_COPY
+
 /* @param suspend_cancel [from xenctrl.h:xc_domain_resume( @param fast )]
  *   If this parameter is true, use co-operative resume. The guest
  *   must support this.
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index 75ab523..c54f728 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -338,14 +338,19 @@ int libxl__save_emulator_xenstore_data(libxl__domain_save_state *dss,
  * the live migration when there are either fewer than 50 dirty pages, or more
  * than 5 precopy rounds have completed.
  */
-static int libxl__save_live_migration_precopy_policy(
-    struct precopy_stats stats, void *user)
+static int libxl__save_live_migration_precopy_policy(struct precopy_stats stats,
+                                                     void *user)
 {
-    if (stats.dirty_count >= 0 && stats.dirty_count < 50)
-        return XGS_POLICY_STOP_AND_COPY;
+    libxl__save_helper_state *shs = user;
+    libxl__domain_save_state *dss = shs->caller_state;
 
-    if (stats.iteration >= 5)
-        return XGS_POLICY_STOP_AND_COPY;
+    if ((stats.dirty_count >= 0 &&
+         stats.dirty_count <= 50) ||
+        (stats.iteration >= 5)) {
+        return (dss->memory_strategy == LIBXL_LM_MEMORY_POSTCOPY)
+            ? XGS_POLICY_POSTCOPY
+            : XGS_POLICY_STOP_AND_COPY;
+    }
 
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
diff --git a/tools/libxl/libxl_domain.c b/tools/libxl/libxl_domain.c
index fc37f47..e211b88 100644
--- a/tools/libxl/libxl_domain.c
+++ b/tools/libxl/libxl_domain.c
@@ -488,6 +488,7 @@ static void domain_suspend_cb(libxl__egc *egc,
 
 static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
                              int recv_fd, bool *postcopy_transitioned,
+                             int memory_strategy,
                              const libxl_asyncop_how *ao_how)
 {
     AO_CREATE(ctx, domid, ao_how);
@@ -509,6 +510,7 @@ static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
     dss->fd = fd;
     dss->recv_fd = recv_fd;
     dss->postcopy_transitioned = postcopy_transitioned;
+    dss->memory_strategy = memory_strategy;
     dss->type = type;
     dss->live = flags & LIBXL_SUSPEND_LIVE;
     dss->debug = flags & LIBXL_SUSPEND_DEBUG;
@@ -529,12 +531,14 @@ static int do_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
 int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags,
                          const libxl_asyncop_how *ao_how)
 {
-    return do_domain_suspend(ctx, domid, fd, flags, -1, NULL, ao_how);
+    return do_domain_suspend(ctx, domid, fd, flags, -1, NULL,
+                             LIBXL_LM_MEMORY_DEFAULT, ao_how);
 }
 
 int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
                               int flags, int recv_fd,
                               bool *postcopy_transitioned,
+                              int memory_strategy,
                               const libxl_asyncop_how *ao_how)
 {
     if (!postcopy_transitioned) {
@@ -545,7 +549,7 @@ int libxl_domain_live_migrate(libxl_ctx *ctx, uint32_t domid, int send_fd,
     flags |= LIBXL_SUSPEND_LIVE;
 
     return do_domain_suspend(ctx, domid, send_fd, flags, recv_fd,
-                             postcopy_transitioned, ao_how);
+                             postcopy_transitioned, memory_strategy, ao_how);
 }
 
 int libxl_domain_pause(libxl_ctx *ctx, uint32_t domid)
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 54ad16a..5c4f139 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3324,6 +3324,7 @@ struct libxl__domain_save_state {
     int fdfl; /* original flags on fd */
     int recv_fd;
     bool *postcopy_transitioned;
+    int memory_strategy;
     libxl_domain_type type;
     int live;
     int debug;
diff --git a/tools/xl/xl.h b/tools/xl/xl.h
index aa95b77..279c716 100644
--- a/tools/xl/xl.h
+++ b/tools/xl/xl.h
@@ -48,6 +48,7 @@ struct domain_create {
     bool userspace_colo_proxy;
     int migrate_fd; /* -1 means none */
     int send_back_fd; /* -1 means none */
+    bool *postcopy_resumed;
     char **migration_domname_r; /* from malloc */
 };
 
@@ -66,7 +67,6 @@ static const char migrate_permission_to_go[]=
     "domain is yours, you are cleared to unpause";
 static const char migrate_report[]=
     "my copy unpause results are as follows";
-#endif
 
   /* followed by one byte:
    *     0: everything went well, domain is running
@@ -76,6 +76,11 @@ static const char migrate_report[]=
    *            from target to source
    */
 
+static const char migrate_postcopy_sync[]=
+    "postcopy migration completed successfully";
+
+#endif
+
 #define XL_MANDATORY_FLAG_JSON (1U << 0) /* config data is in JSON format */
 #define XL_MANDATORY_FLAG_STREAMv2 (1U << 1) /* stream is v2 */
 #define XL_MANDATORY_FLAG_ALL  (XL_MANDATORY_FLAG_JSON |        \
diff --git a/tools/xl/xl_cmdtable.c b/tools/xl/xl_cmdtable.c
index 30eb93c..9e7ec83 100644
--- a/tools/xl/xl_cmdtable.c
+++ b/tools/xl/xl_cmdtable.c
@@ -166,6 +166,9 @@ struct cmd_spec cmd_table[] = {
       "                of the domain.\n"
       "--debug         Print huge (!) amount of debug during the migration process.\n"
       "-p              Do not unpause domain after migrating it."
+      "--postcopy      At the end of the iterative precopy phase, transition to a\n"
+      "                postcopy memory migration rather than performing a stop-and-copy\n"
+      "                migration of the outstanding dirty pages.\n"
     },
     { "restore",
       &main_restore, 0, 1,
diff --git a/tools/xl/xl_migrate.c b/tools/xl/xl_migrate.c
index 9656204..80d7321 100644
--- a/tools/xl/xl_migrate.c
+++ b/tools/xl/xl_migrate.c
@@ -177,7 +177,8 @@ static void migrate_do_preamble(int send_fd, int recv_fd, pid_t child,
 }
 
 static void migrate_domain(uint32_t domid, const char *rune, int debug,
-                           const char *override_config_file)
+                           const char *override_config_file,
+                           int memory_strategy)
 {
     pid_t child = -1;
     int rc;
@@ -207,18 +208,34 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
     if (debug)
         flags |= LIBXL_SUSPEND_DEBUG;
     rc = libxl_domain_live_migrate(ctx, domid, send_fd, flags,
-                                   recv_fd, &postcopy_transitioned, NULL);
-    assert(!postcopy_transitioned);
-
+                                   recv_fd, &postcopy_transitioned,
+                                   memory_strategy, NULL);
     if (rc) {
         fprintf(stderr, "migration sender: libxl_domain_suspend failed"
                 " (rc=%d)\n", rc);
-        if (rc == ERROR_GUEST_TIMEDOUT)
+        if (postcopy_transitioned)
+            goto failed_postcopy;
+        else if (rc == ERROR_GUEST_TIMEDOUT)
             goto failed_suspend;
         else
             goto failed_resume;
     }
 
+    /*
+     * No need for additional ceremony if we already resumed the guest as part
+     * of a postcopy live migration.
+     */
+    if (postcopy_transitioned) {
+        /* It doesn't matter if something happens to the pipe after we get to
+         * this point - we only bother to synchronize here for tidiness. */
+        migrate_read_fixedmessage(recv_fd, migrate_postcopy_sync,
+                                  sizeof(migrate_postcopy_sync),
+                                  "postcopy sync", rune);
+        libxl_domain_destroy(ctx, domid, 0);
+        fprintf(stderr, "Migration successful.\n");
+        exit(EXIT_SUCCESS);
+    }
+
     //fprintf(stderr, "migration sender: Transfer complete.\n");
     // Should only be printed when debugging as it's a bit messy with
     // progress indication.
@@ -317,6 +334,21 @@ static void migrate_domain(uint32_t domid, const char *rune, int debug,
     close(send_fd);
     migration_child_report(recv_fd);
     exit(EXIT_FAILURE);
+
+ failed_postcopy:
+    if (common_domname) {
+        xasprintf(&away_domname, "%s--postcopy-inconsistent", common_domname);
+        libxl_domain_rename(ctx, domid, common_domname, away_domname);
+    }
+
+    fprintf(stderr,
+ "** Migration failed during memory postcopy **\n"
+ "It's possible that the guest has executed/is executing at the destination,\n"
+ " so resuming it here now may be unsafe.\n");
+
+    close(send_fd);
+    migration_child_report(recv_fd);
+    exit(EXIT_FAILURE);
 }
 
 static void migrate_receive(int debug, int daemonize, int monitor,
@@ -330,6 +362,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     int rc, rc2;
     char rc_buf;
     char *migration_domname;
+    bool postcopy_resumed;
     struct domain_create dom_info;
 
     signal(SIGPIPE, SIG_IGN);
@@ -349,6 +382,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.paused = 1;
     dom_info.migrate_fd = recv_fd;
     dom_info.send_back_fd = send_fd;
+    dom_info.postcopy_resumed = &postcopy_resumed;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = checkpointed;
     dom_info.colo_proxy_script = colo_proxy_script;
@@ -411,6 +445,20 @@ static void migrate_receive(int debug, int daemonize, int monitor,
         break;
     }
 
+    /*
+     * No need for additional ceremony if we already resumed the guest as part
+     * of a postcopy live migration.
+     */
+    if (postcopy_resumed) {
+        libxl_write_exactly(ctx, send_fd, migrate_postcopy_sync,
+                            sizeof(migrate_postcopy_sync),
+                            "migration ack stream", "postcopy sync");
+        fprintf(stderr, "migration target: Domain started successsfully.\n");
+        libxl_domain_rename(ctx, domid, migration_domname, common_domname);
+        exit(EXIT_SUCCESS);
+    }
+
+
     fprintf(stderr, "migration target: Transfer complete,"
             " requesting permission to start domain.\n");
 
@@ -541,9 +589,11 @@ int main_migrate(int argc, char **argv)
     char *rune = NULL;
     char *host;
     int opt, daemonize = 1, monitor = 1, debug = 0, pause_after_migration = 0;
+    int memory_strategy = LIBXL_LM_MEMORY_DEFAULT;
     static struct option opts[] = {
         {"debug", 0, 0, 0x100},
         {"live", 0, 0, 0x200},
+        {"postcopy", 0, 0, 0x400},
         COMMON_LONG_OPTS
     };
 
@@ -570,6 +620,9 @@ int main_migrate(int argc, char **argv)
     case 0x200: /* --live */
         /* ignored for compatibility with xm */
         break;
+    case 0x400: /* --postcopy */
+        memory_strategy = LIBXL_LM_MEMORY_POSTCOPY;
+        break;
     }
 
     domid = find_domain(argv[optind]);
@@ -600,7 +653,7 @@ int main_migrate(int argc, char **argv)
                   pause_after_migration ? " -p" : "");
     }
 
-    migrate_domain(domid, rune, debug, config_filename);
+    migrate_domain(domid, rune, debug, config_filename, memory_strategy);
     return EXIT_SUCCESS;
 }
 
diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index 47ba9f3..62e09c1 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -655,6 +655,7 @@ int create_domain(struct domain_create *dom_info)
     const char *config_source = NULL;
     const char *restore_source = NULL;
     int migrate_fd = dom_info->migrate_fd;
+    bool *postcopy_resumed = dom_info->postcopy_resumed;
     bool config_in_json;
 
     int i;
@@ -675,6 +676,9 @@ int create_domain(struct domain_create *dom_info)
 
     int restoring = (restore_file || (migrate_fd >= 0));
 
+    if (postcopy_resumed)
+        *postcopy_resumed = false;
+
     libxl_domain_config_init(&d_config);
 
     if (restoring) {
@@ -882,8 +886,8 @@ start:
 
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
-                                          send_back_fd, NULL, &params,
-                                          0, autoconnect_console_how);
+                                          send_back_fd, postcopy_resumed,
+                                          &params, 0, autoconnect_console_how);
 
         libxl_domain_restore_params_dispose(&params);
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 21/23] xen/mem_paging: move paging op arguments into a union
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (19 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 20/23] tools: expose postcopy live migration support in libxl and xl Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 22/23] xen/mem_paging: add a populate_evicted paging op Joshua Otto
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

In preparation for the addition of a mem paging op with different
arguments than the existing ops, move the op-specific arguments into a
union.

No functional change.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_mem_paging.c  |  8 ++++----
 xen/arch/x86/mm/mem_paging.c |  6 +++---
 xen/include/public/memory.h  | 12 ++++++++----
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/tools/libxc/xc_mem_paging.c b/tools/libxc/xc_mem_paging.c
index 28611f4..f314b08 100644
--- a/tools/libxc/xc_mem_paging.c
+++ b/tools/libxc/xc_mem_paging.c
@@ -29,10 +29,10 @@ static int xc_mem_paging_memop(xc_interface *xch, domid_t domain_id,
 
     memset(&mpo, 0, sizeof(mpo));
 
-    mpo.op      = op;
-    mpo.domain  = domain_id;
-    mpo.gfn     = gfn;
-    mpo.buffer  = (unsigned long) buffer;
+    mpo.op               = op;
+    mpo.domain           = domain_id;
+    mpo.u.single.gfn     = gfn;
+    mpo.u.single.buffer  = (unsigned long) buffer;
 
     return do_memory_op(xch, XENMEM_paging_op, &mpo, sizeof(mpo));
 }
diff --git a/xen/arch/x86/mm/mem_paging.c b/xen/arch/x86/mm/mem_paging.c
index a049e0d..e23e26c 100644
--- a/xen/arch/x86/mm/mem_paging.c
+++ b/xen/arch/x86/mm/mem_paging.c
@@ -49,15 +49,15 @@ int mem_paging_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg)
     switch( mpo.op )
     {
     case XENMEM_paging_op_nominate:
-        rc = p2m_mem_paging_nominate(d, mpo.gfn);
+        rc = p2m_mem_paging_nominate(d, mpo.u.single.gfn);
         break;
 
     case XENMEM_paging_op_evict:
-        rc = p2m_mem_paging_evict(d, mpo.gfn);
+        rc = p2m_mem_paging_evict(d, mpo.u.single.gfn);
         break;
 
     case XENMEM_paging_op_prep:
-        rc = p2m_mem_paging_prep(d, mpo.gfn, mpo.buffer);
+        rc = p2m_mem_paging_prep(d, mpo.u.single.gfn, mpo.u.single.buffer);
         if ( !rc )
             copyback = 1;
         break;
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 6eee0c8..49ef162 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -394,10 +394,14 @@ struct xen_mem_paging_op {
     uint8_t     op;         /* XENMEM_paging_op_* */
     domid_t     domain;
 
-    /* PAGING_PREP IN: buffer to immediately fill page in */
-    uint64_aligned_t    buffer;
-    /* Other OPs */
-    uint64_aligned_t    gfn;           /* IN:  gfn of page being operated on */
+    union {
+        struct {
+            /* PAGING_PREP IN: buffer to immediately fill page in */
+            uint64_aligned_t    buffer;
+            /* Other OPs */
+            uint64_aligned_t    gfn;   /* IN:  gfn of page being operated on */
+        } single;
+    } u;
 };
 typedef struct xen_mem_paging_op xen_mem_paging_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_mem_paging_op_t);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 22/23] xen/mem_paging: add a populate_evicted paging op
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (20 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 21/23] xen/mem_paging: move paging op arguments into a union Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2018-06-17 10:18 ` [PATCH RFC v2 23/23] libxc/xc_sr_restore.c: use populate_evicted() Joshua Otto
  2021-06-02 11:20 ` [Xen-devel] [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Olaf Hering
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

The paging API presently permits only individual, populated pages to be
evicted, and even then only after a previous nomination op on the
candidate page.  This works well at steady-state, but is somewhat
awkward and inefficient for pagers attempting to implement startup
demand-paging for guests: in this case it is necessary to populate all
of the holes in the physmap to be demand-paged, only to then nominate
and immediately evict each page one-by-one.

To permit more efficient startup demand-paging, introduce a new
populate_evicted paging op.  Given a batch of gfns, it:
- marks gfns corresponding to phymap holes as paged-out directly
- frees the backing frames of previously-populated gfns, and then marks
  them as paged-out directly (skipping the nomination step)

The latter behaviour is needed to fully support postcopy live migration:
a page may be populated only to have its contents subsequently
invalidated by a write at the sender, requiring it to ultimately be
demand-paged anyway.

I measured a reduction in time required to evict a batch of 512k
previously-unpopulated pfns from 8.535s to 1.590s (~5.4x speedup).

Note: as a long-running batching memory op, populate_evicted takes
advantage of the existing pre-emption/continuation hack (encoding the
starting offset into the batch in bits [:6] of the op argument).  To
make this work, plumb the cmd argument all the way down through
do_memory_op() -> arch_memory_op() -> subarch_memory_op() ->
mem_paging_memop(), fixing up each switch statement along the way to
use only the MEMOP_CMD bits.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/include/xenctrl.h    |   2 +
 tools/libxc/xc_mem_paging.c      |  31 ++++++++++++
 xen/arch/x86/mm.c                |   5 +-
 xen/arch/x86/mm/mem_paging.c     |  34 ++++++++++++-
 xen/arch/x86/mm/p2m.c            | 101 +++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/x86_64/compat/mm.c  |   6 ++-
 xen/arch/x86/x86_64/mm.c         |   6 ++-
 xen/include/asm-x86/mem_paging.h |   3 +-
 xen/include/asm-x86/p2m.h        |   2 +
 xen/include/public/memory.h      |  13 +++--
 10 files changed, 190 insertions(+), 13 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 1629f41..22992b9 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1945,6 +1945,8 @@ int xc_mem_paging_resume(xc_interface *xch, domid_t domain_id);
 int xc_mem_paging_nominate(xc_interface *xch, domid_t domain_id,
                            uint64_t gfn);
 int xc_mem_paging_evict(xc_interface *xch, domid_t domain_id, uint64_t gfn);
+int xc_mem_paging_populate_evicted(xc_interface *xch, domid_t domain_id,
+                                   xen_pfn_t *gfns, uint32_t nr);
 int xc_mem_paging_prep(xc_interface *xch, domid_t domain_id, uint64_t gfn);
 int xc_mem_paging_load(xc_interface *xch, domid_t domain_id,
                        uint64_t gfn, void *buffer);
diff --git a/tools/libxc/xc_mem_paging.c b/tools/libxc/xc_mem_paging.c
index f314b08..b0416b6 100644
--- a/tools/libxc/xc_mem_paging.c
+++ b/tools/libxc/xc_mem_paging.c
@@ -116,6 +116,37 @@ int xc_mem_paging_load(xc_interface *xch, domid_t domain_id,
     return rc;
 }
 
+int xc_mem_paging_populate_evicted(xc_interface *xch,
+                                   domid_t domain_id,
+                                   xen_pfn_t *gfns,
+                                   uint32_t nr)
+{
+    DECLARE_HYPERCALL_BOUNCE(gfns, nr * sizeof(*gfns),
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+    int rc;
+
+    xen_mem_paging_op_t mpo =
+    {
+        .op       = XENMEM_paging_op_populate_evicted,
+        .domain   = domain_id,
+        .u        = { .batch = { .nr = nr } }
+    };
+
+    if ( xc_hypercall_bounce_pre(xch, gfns) )
+    {
+        PERROR("Could not bounce memory for XENMEM_paging_op_populate_evicted");
+        return -1;
+    }
+
+    set_xen_guest_handle(mpo.u.batch.gfns, gfns);
+
+    rc = do_memory_op(xch, XENMEM_paging_op, &mpo, sizeof(mpo));
+
+    xc_hypercall_bounce_post(xch, gfns);
+
+    return rc;
+}
+
 
 /*
  * Local variables:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 77b0af1..bc41bde 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4955,9 +4955,10 @@ int xenmem_add_to_physmap_one(
 
 long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
-    int rc;
+    long rc;
+    int op = cmd & MEMOP_CMD_MASK;
 
-    switch ( cmd )
+    switch ( op )
     {
     case XENMEM_set_memory_map:
     {
diff --git a/xen/arch/x86/mm/mem_paging.c b/xen/arch/x86/mm/mem_paging.c
index e23e26c..8f62f58 100644
--- a/xen/arch/x86/mm/mem_paging.c
+++ b/xen/arch/x86/mm/mem_paging.c
@@ -21,12 +21,17 @@
 
 
 #include <asm/p2m.h>
+#include <xen/event.h>
 #include <xen/guest_access.h>
+#include <xen/hypercall.h>
 #include <xsm/xsm.h>
 
-int mem_paging_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg)
+long mem_paging_memop(unsigned long cmd,
+                      XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg)
 {
-    int rc;
+    long rc;
+    unsigned long start_gfn = cmd >> MEMOP_EXTENT_SHIFT;
+    xen_pfn_t gfn;
     xen_mem_paging_op_t mpo;
     struct domain *d;
     bool_t copyback = 0;
@@ -56,6 +61,31 @@ int mem_paging_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg)
         rc = p2m_mem_paging_evict(d, mpo.u.single.gfn);
         break;
 
+    case XENMEM_paging_op_populate_evicted:
+        while ( start_gfn < mpo.u.batch.nr )
+        {
+            if ( copy_from_guest_offset(&gfn, mpo.u.batch.gfns, start_gfn, 1) )
+            {
+                rc = -EFAULT;
+                goto out;
+            }
+
+            rc = p2m_mem_paging_populate_evicted(d, gfn);
+            if ( rc )
+                goto out;
+
+            if ( mpo.u.batch.nr > ++start_gfn && hypercall_preempt_check() )
+            {
+                cmd = XENMEM_paging_op | (start_gfn << MEMOP_EXTENT_SHIFT);
+                rc = hypercall_create_continuation(__HYPERVISOR_memory_op, "lh",
+                                                   cmd, arg);
+                goto out;
+            }
+        }
+
+        rc = 0;
+        break;
+
     case XENMEM_paging_op_prep:
         rc = p2m_mem_paging_prep(d, mpo.u.single.gfn, mpo.u.single.buffer);
         if ( !rc )
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 9eb6dc8..2ad46f6 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1449,6 +1449,107 @@ int p2m_mem_paging_evict(struct domain *d, unsigned long gfn)
 }
 
 /**
+ * p2m_mem_paging_populate_evicted - 'populate' a guest page as paged-out
+ * @d: guest domain
+ * @gfn: guest page to populate
+ *
+ * Returns 0 for success or negative errno values if eviction is not possible.
+ *
+ * p2m_mem_paging_populate_evicted() is mostly commonly called by a pager
+ * during guest restoration to mark a page as evicted so that the guest can be
+ * resumed before memory restoration is complete.
+ *
+ * Ideally, the page has never previously been populated, and it is only
+ * necessary to mark the existing hole in the physmap as an evicted page.
+ * However, to accomodate the common live migration scenario in which a page is
+ * populated but subsequently has its contents invalidated by a write at the
+ * sender, permit @gfn to have already been populated and free its current
+ * backing frame if so.
+ */
+int p2m_mem_paging_populate_evicted(struct domain *d, unsigned long gfn)
+{
+    struct page_info *page = NULL;
+    p2m_type_t p2mt;
+    p2m_access_t a;
+    mfn_t mfn;
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int rc = -EBUSY;
+
+    gfn_lock(p2m, gfn, 0);
+
+    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
+
+    if ( mfn_valid(mfn) )
+    {
+        /*
+         * This is the first case we know how to deal with: the page has
+         * previously been populated, but the caller wants it in the evicted
+         * state anyway (e.g. because it was dirtied during live migration and
+         * is now being postcopy migrated).
+         *
+         * Double-check that it's pageable according to the union of the
+         * normal nominate() and evict() criteria, and free its backing page if
+         * so.
+         */
+
+        if ( !p2m_is_pageable(p2mt) )
+            goto out;
+
+        page = mfn_to_page(mfn);
+        if ( !get_page(page, d) )
+            goto out;
+
+        if ( is_iomem_page(mfn) )
+            goto err_put;
+
+        if ( (page->count_info & (PGC_count_mask | PGC_allocated)) !=
+             (2 | PGC_allocated) )
+            goto err_put;
+
+        if ( (page->u.inuse.type_info & PGT_count_mask) != 0 )
+            goto err_put;
+
+        /* Decrement guest domain's ref count of the page. */
+        if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
+            put_page(page);
+
+        /* Clear content before returning the page to Xen. */
+        scrub_one_page(page);
+
+        /* Finally, drop the ref _we_ took on the page, freeing it fully. */
+        put_page(page);
+    }
+    else if ( p2m_is_hole(p2mt) && !p2m_is_paging(p2mt) )
+    {
+        /*
+         * This is the second case we know how to deal with: the pfn isn't
+         * currently populated, and can transition directly to paged_out.  All
+         * we need to do is adjust its p2m entry, which we share with the first
+         * case, so there's nothing further to do along this branch.
+         */
+    }
+    else
+    {
+        /* We can't handle this - error out. */
+        goto out;
+    }
+
+    rc = p2m_set_entry(p2m, gfn, INVALID_MFN, PAGE_ORDER_4K, p2m_ram_paged, a);
+    if ( !rc )
+        atomic_inc(&d->paged_pages);
+
+    /* Hop over the inapplicable put_page(). */
+    goto out;
+
+ err_put:
+    put_page(page);
+
+ out:
+    gfn_unlock(p2m, gfn, 0);
+    return rc;
+}
+
+/**
  * p2m_mem_paging_drop_page - Tell pager to drop its reference to a paged page
  * @d: guest domain
  * @gfn: guest page to drop
diff --git a/xen/arch/x86/x86_64/compat/mm.c b/xen/arch/x86/x86_64/compat/mm.c
index b737af1..f4aff90 100644
--- a/xen/arch/x86/x86_64/compat/mm.c
+++ b/xen/arch/x86/x86_64/compat/mm.c
@@ -53,8 +53,9 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     compat_pfn_t mfn;
     unsigned int i;
     int rc = 0;
+    int op = cmd & MEMOP_CMD_MASK;
 
-    switch ( cmd )
+    switch ( op )
     {
     case XENMEM_set_memory_map:
     {
@@ -187,7 +188,8 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         return mem_sharing_get_nr_shared_mfns();
 
     case XENMEM_paging_op:
-        return mem_paging_memop(guest_handle_cast(arg, xen_mem_paging_op_t));
+        return mem_paging_memop(cmd,
+                                guest_handle_cast(arg, xen_mem_paging_op_t));
 
     case XENMEM_sharing_op:
         return mem_sharing_memop(guest_handle_cast(arg, xen_mem_sharing_op_t));
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index aa1b94f..7394d92 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -926,8 +926,9 @@ long subarch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     xen_pfn_t mfn, last_mfn;
     unsigned int i;
     long rc = 0;
+    int op = cmd & MEMOP_CMD_MASK;
 
-    switch ( cmd )
+    switch ( op )
     {
     case XENMEM_machphys_mfn_list:
         if ( copy_from_guest(&xmml, arg, 1) )
@@ -1004,7 +1005,8 @@ long subarch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         return mem_sharing_get_nr_shared_mfns();
 
     case XENMEM_paging_op:
-        return mem_paging_memop(guest_handle_cast(arg, xen_mem_paging_op_t));
+        return mem_paging_memop(cmd,
+                                guest_handle_cast(arg, xen_mem_paging_op_t));
 
     case XENMEM_sharing_op:
         return mem_sharing_memop(guest_handle_cast(arg, xen_mem_sharing_op_t));
diff --git a/xen/include/asm-x86/mem_paging.h b/xen/include/asm-x86/mem_paging.h
index 176acaf..7b9a4f6 100644
--- a/xen/include/asm-x86/mem_paging.h
+++ b/xen/include/asm-x86/mem_paging.h
@@ -22,7 +22,8 @@
 #ifndef __ASM_X86_MEM_PAGING_H__
 #define __ASM_X86_MEM_PAGING_H__
 
-int mem_paging_memop(XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg);
+long mem_paging_memop(unsigned long cmd,
+                      XEN_GUEST_HANDLE_PARAM(xen_mem_paging_op_t) arg);
 
 #endif /*__ASM_X86_MEM_PAGING_H__ */
 
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 408f7da..653d413 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -676,6 +676,8 @@ int set_shared_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 int p2m_mem_paging_nominate(struct domain *d, unsigned long gfn);
 /* Evict a frame */
 int p2m_mem_paging_evict(struct domain *d, unsigned long gfn);
+/* If @gfn is populated, evict it.  If not, mark it as paged-out directly. */
+int p2m_mem_paging_populate_evicted(struct domain *d, unsigned long gfn);
 /* Tell xenpaging to drop a paged out frame */
 void p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn, 
                                 p2m_type_t p2mt);
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 49ef162..5196803 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -385,10 +385,11 @@ typedef struct xen_pod_target xen_pod_target_t;
 #define XENMEM_get_sharing_freed_pages    18
 #define XENMEM_get_sharing_shared_pages   19
 
-#define XENMEM_paging_op                    20
-#define XENMEM_paging_op_nominate           0
-#define XENMEM_paging_op_evict              1
-#define XENMEM_paging_op_prep               2
+#define XENMEM_paging_op                     20
+#define XENMEM_paging_op_nominate            0
+#define XENMEM_paging_op_evict               1
+#define XENMEM_paging_op_prep                2
+#define XENMEM_paging_op_populate_evicted    3
 
 struct xen_mem_paging_op {
     uint8_t     op;         /* XENMEM_paging_op_* */
@@ -401,6 +402,10 @@ struct xen_mem_paging_op {
             /* Other OPs */
             uint64_aligned_t    gfn;   /* IN:  gfn of page being operated on */
         } single;
+        struct {
+            XEN_GUEST_HANDLE(xen_pfn_t) gfns;
+            uint32_t                    nr;
+        } batch;
     } u;
 };
 typedef struct xen_mem_paging_op xen_mem_paging_op_t;
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC v2 23/23] libxc/xc_sr_restore.c: use populate_evicted()
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (21 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 22/23] xen/mem_paging: add a populate_evicted paging op Joshua Otto
@ 2018-06-17 10:18 ` Joshua Otto
  2021-06-02 11:20 ` [Xen-devel] [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Olaf Hering
  23 siblings, 0 replies; 26+ messages in thread
From: Joshua Otto @ 2018-06-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: andrew.cooper3, wei.liu2, Joshua Otto

From: Joshua Otto <jtotto@uwaterloo.ca>

During the transition downtime phase of postcopy live migration, mark
batches of dirty pfns as paged-out using the new populate_evicted()
paging op rather than populating, nominating and evicting each dirty pfn
individually.  This significantly reduces downtime during transitions
with many dirty pfns.

Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
---
 tools/libxc/xc_sr_restore.c | 71 +++++++++++++++++++++++++++++----------------
 1 file changed, 46 insertions(+), 25 deletions(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 3aac0f0..950bbf0 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -672,13 +672,15 @@ static int process_postcopy_pfns(struct xc_sr_context *ctx, unsigned int count,
     xc_interface *xch = ctx->xch;
     struct xc_sr_restore_paging *paging = &ctx->restore.paging;
     int rc;
-    unsigned int i;
+    unsigned int i, nr_bpfns = 0, nr_xapfns = 0;
     xen_pfn_t bpfn;
+    xen_pfn_t *bpfns = malloc(count * sizeof(*bpfns)),
+              *xapfns = malloc(count * sizeof(*xapfns));
 
-    rc = populate_pfns(ctx, count, pfns, types);
-    if ( rc )
+    if ( !bpfns || !xapfns )
     {
-        ERROR("Failed to populate pfns for batch of %u pages", count);
+        rc = -1;
+        ERROR("Failed to allocate %zu bytes pfns", 2 * count * sizeof(*bpfns));
         goto out;
     }
 
@@ -686,7 +688,7 @@ static int process_postcopy_pfns(struct xc_sr_context *ctx, unsigned int count,
     {
         if ( types[i] < XEN_DOMCTL_PFINFO_BROKEN )
         {
-            bpfn = pfns[i];
+            bpfn = bpfns[nr_bpfns++] = pfns[i];
 
             /* We should never see the same pfn twice at this stage.  */
             if ( !postcopy_pfn_invalid(ctx, bpfn) )
@@ -695,6 +697,42 @@ static int process_postcopy_pfns(struct xc_sr_context *ctx, unsigned int count,
                 ERROR("Duplicate postcopy pfn %"PRI_xen_pfn, bpfn);
                 goto out;
             }
+        }
+        else if ( types[i] == XEN_DOMCTL_PFINFO_XALLOC )
+        {
+            xapfns[nr_xapfns++] = pfns[i];
+        }
+    }
+
+    /* Follow the normal path to populate XALLOC pfns... */
+    rc = populate_pfns(ctx, nr_xapfns, xapfns, NULL);
+    if ( rc )
+    {
+        ERROR("Failed to populate pfns for batch of %u pages", nr_xapfns);
+        goto out;
+    }
+
+    /* ... and 'populate' the backed pfns directly into the evicted state. */
+    if ( nr_bpfns )
+    {
+        rc = xc_mem_paging_populate_evicted(xch, ctx->domid, bpfns, nr_bpfns);
+        if ( rc )
+        {
+            ERROR("Failed to evict batch of %u pfns", nr_bpfns);
+            goto out;
+        }
+
+        for ( i = 0; i < nr_bpfns; ++i )
+        {
+            bpfn = bpfns[i];
+
+            /* If it hasn't yet been populated, mark it as so now. */
+            if ( !pfn_is_populated(ctx, bpfn) )
+            {
+                rc = pfn_set_populated(ctx, bpfn);
+                if ( rc )
+                    goto out;
+            }
 
             /*
              * We now consider this pfn 'outstanding' - pending, and not yet
@@ -702,32 +740,15 @@ static int process_postcopy_pfns(struct xc_sr_context *ctx, unsigned int count,
              */
             mark_postcopy_pfn_outstanding(ctx, bpfn);
             ++paging->nr_pending_pfns;
-
-            /*
-             * Neither nomination nor eviction can be permitted to fail - the
-             * guest isn't yet running, so a failure would imply a foreign or
-             * hypervisor mapping on the page, and that would be bogus because
-             * the migration isn't yet complete.
-             */
-            rc = xc_mem_paging_nominate(xch, ctx->domid, bpfn);
-            if ( rc < 0 )
-            {
-                PERROR("Error nominating postcopy pfn %"PRI_xen_pfn, bpfn);
-                goto out;
-            }
-
-            rc = xc_mem_paging_evict(xch, ctx->domid, bpfn);
-            if ( rc < 0 )
-            {
-                PERROR("Error evicting postcopy pfn %"PRI_xen_pfn, bpfn);
-                goto out;
-            }
         }
     }
 
     rc = 0;
 
  out:
+    free(bpfns);
+    free(xapfns);
+
     return rc;
 }
 
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC v2 04/23] libxc/xc_sr: naming correction: mfns -> gfns
  2018-06-17 10:18 ` [PATCH RFC v2 04/23] libxc/xc_sr: naming correction: mfns -> gfns Joshua Otto
@ 2018-07-05 15:12   ` Wei Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Wei Liu @ 2018-07-05 15:12 UTC (permalink / raw)
  To: Joshua Otto; +Cc: xen-devel, wei.liu2, Joshua Otto, andrew.cooper3

On Sun, Jun 17, 2018 at 03:18:15AM -0700, Joshua Otto wrote:
> From: Joshua Otto <jtotto@uwaterloo.ca>
> 
> In write_batch() on the migration save side and in process_page_data()
> on the corresponding path on the restore side, a local variable named
> 'mfns' is used to refer to an array of what are actually gfns.  Rename
> both to 'gfns' to address this.
> 
> No functional change.
> 
> Signed-off-by: Joshua Otto <jtotto@uwaterloo.ca>
> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration
  2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
                   ` (22 preceding siblings ...)
  2018-06-17 10:18 ` [PATCH RFC v2 23/23] libxc/xc_sr_restore.c: use populate_evicted() Joshua Otto
@ 2021-06-02 11:20 ` Olaf Hering
  23 siblings, 0 replies; 26+ messages in thread
From: Olaf Hering @ 2021-06-02 11:20 UTC (permalink / raw)
  To: Joshua Otto; +Cc: xen-devel, andrew.cooper3, wei.liu2, Joshua Otto

[-- Attachment #1: Type: text/plain, Size: 278 bytes --]

Am Sun, 17 Jun 2018 03:18:11 -0700
schrieb Joshua Otto <joshua.t.otto@gmail.com>:

> A little over a year ago, I posted a patch series implementing support for
> post-copy live migration via xenpaging [1].

That that went anywhere, or was this a wasted effort?


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-06-02 11:21 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-17 10:18 [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 01/23] tools: rename COLO 'postcopy' to 'aftercopy' Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 02/23] libxc/xc_sr: parameterise write_record() on fd Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 03/23] libxc/xc_sr_restore.c: use write_record() in send_checkpoint_dirty_pfn_list() Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 04/23] libxc/xc_sr: naming correction: mfns -> gfns Joshua Otto
2018-07-05 15:12   ` Wei Liu
2018-06-17 10:18 ` [PATCH RFC v2 05/23] libxc/xc_sr_restore: introduce generic 'pages' records Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 06/23] libxc/xc_sr_restore: factor helpers out of handle_page_data() Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 07/23] libxc/migration: tidy the xc_domain_save()/restore() interface Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 08/23] libxc/migration: defer precopy policy to a callback Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 09/23] libxl/migration: wire up the precopy policy RPC callback Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 10/23] libxc/xc_sr_save: introduce save batch types Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 11/23] libxc/migration: correct hvm record ordering specification Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 12/23] libxc/migration: specify postcopy live migration Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 13/23] libxc/migration: add try_read_record() Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 14/23] libxc/migration: implement the sender side of postcopy live migration Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 15/23] libxc/migration: implement the receiver " Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 16/23] libxl/libxl_stream_write.c: track callback chains with an explicit phase Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 17/23] libxl/libxl_stream_read.c: " Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 18/23] libxl/migration: implement the sender side of postcopy live migration Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 19/23] libxl/migration: implement the receiver " Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 20/23] tools: expose postcopy live migration support in libxl and xl Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 21/23] xen/mem_paging: move paging op arguments into a union Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 22/23] xen/mem_paging: add a populate_evicted paging op Joshua Otto
2018-06-17 10:18 ` [PATCH RFC v2 23/23] libxc/xc_sr_restore.c: use populate_evicted() Joshua Otto
2021-06-02 11:20 ` [Xen-devel] [PATCH RFC v2 00/23] Design document and performance evaluation for post-copy live migration Olaf Hering

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).