All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 00/45] Postcopy implementation
@ 2015-02-25 16:51 Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works Dr. David Alan Gilbert (git)
                   ` (44 more replies)
  0 siblings, 45 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Hi,
  This is the 5th cut of my version of postcopy; it is designed for use with
the Linux kernel additions posted by Andrea Arcangeli here:

git clone --reference linux -b userfault16 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

(Note this is a different API from the last version)

This qemu series can be found at:

https://github.com/orbitfp7/qemu.git
on the wp3-postcopy-v5 tag.

I believe I've fixed most of the issues from the reviews, but there are
probably a few more to cover before the final version:
   * I've got a rare (1/1000 migration) bug I'm fighting
   * The kernel API is still being discussed
   
v5
  New kernel API
     Uses an atomic copy rather than remap to avoid IPIs
  Fix builds on older machines
  Fix seg if block migration enabled (thanks to Gary Hook for reporting)
  Fix for mlock (thanks to zhanghailiang for reporting)
  Send postcopy-notify before sending the 'begin' part of device state
  Disable ballooning during postcopy
  Don't allocate zero pages as they migrate
  Stop using sysconf to get pagesizes
    - using getpagesize() and qemu_host_page_size in target dependent parts
  Fix for worst case page size ratio (untested)
     - ARM is 64k host, 1k target; with 32bit longs
  Drop the assertion check on migration_dirty_pages
     I don't like killing src vms anyway, and I've not seen it trigger in 9months+
  Free up migration_incoming-state at end of postcopy
  Calculate downtime on source differently for postcopy
  Use trace rather than dprintf
  Paolo's comments: 
    Rename postcopy_ram_sensitise_area
    Tidy up of loadvm loop flags
    Remove one layer of socket_shutdown abstraction
    Remove _RAM_ from many of the postcopy symbols as appropriate
      - most messages/states could now be used for other postcopy
        state.
    Used a qemu_event to wait for the end of the main thread finishing rather
       than spinning on postcopy state
    Rename inward/outward cmd messages for consistency
    Fixup old-vm-running check for postcopy

  Dave Gibson's comments:
    Lots of cleanup

  Eric's comments
    80col in qapi-schema


Dr. David Alan Gilbert (45):
  Start documenting how postcopy works.
  Split header writing out of qemu_save_state_begin
  qemu_ram_foreach_block: pass up error value, and down the ramblock
    name
  Add qemu_get_counted_string to read a string prefixed by a count byte
  Create MigrationIncomingState
  Provide runtime Target page information
  Return path: Open a return path on QEMUFile for sockets
  Return path: socket_writev_buffer: Block even on non-blocking fd's
  Migration commands
  Return path: Control commands
  Return path: Send responses from destination to source
  Return path: Source handling of return path
  ram_debug_dump_bitmap: Dump a migration bitmap as text
  Move loadvm_handlers into MigrationIncomingState
  Rework loadvm path for subloops
  Add migration-capability boolean for postcopy-ram.
  Add wrappers and handlers for sending/receiving the postcopy-ram
    migration messages.
  MIG_CMD_PACKAGED: Send a packaged chunk of migration stream
  migrate_init: Call from savevm
  Modify savevm handlers for postcopy
  Add Linux userfaultfd header
  postcopy: OS support test
  migrate_start_postcopy: Command to trigger transition to postcopy
  MIG_STATE_POSTCOPY_ACTIVE: Add new migration state
  qemu_savevm_state_complete: Postcopy changes
  Postcopy page-map-incoming (PMI) structure
  Postcopy: Maintain sentmap and calculate discard
  postcopy: Incoming initialisation
  postcopy: ram_enable_notify to switch on userfault
  Postcopy: Postcopy startup in migration thread
  Postcopy end in migration_thread
  Page request:  Add MIG_RP_CMD_REQ_PAGES reverse command
  Page request: Process incoming page request
  Page request: Consume pages off the post-copy queue
  postcopy_ram.c: place_page and helpers
  Postcopy: Use helpers to map pages during migration
  qemu_ram_block_from_host
  Don't sync dirty bitmaps in postcopy
  Host page!=target page: Cleanup bitmaps
  Postcopy; Handle userfault requests
  Start up a postcopy/listener thread ready for incoming page data
  postcopy: Wire up loadvm_postcopy_handle_{run,end} commands
  End of migration for postcopy
  Disable mlock around incoming postcopy
  Inhibit ballooning during postcopy

 arch_init.c                       |  908 +++++++++++++++++++++++++++++++--
 balloon.c                         |   11 +
 docs/migration.txt                |  189 +++++++
 exec.c                            |   76 ++-
 hmp-commands.hx                   |   15 +
 hmp.c                             |    7 +
 hmp.h                             |    1 +
 hw/virtio/virtio-balloon.c        |    4 +-
 include/exec/cpu-all.h            |    2 -
 include/exec/cpu-common.h         |    8 +-
 include/migration/migration.h     |  145 +++++-
 include/migration/postcopy-ram.h  |   99 ++++
 include/migration/qemu-file.h     |   10 +
 include/migration/vmstate.h       |   12 +-
 include/qemu/typedefs.h           |    6 +
 include/sysemu/balloon.h          |    2 +
 include/sysemu/sysemu.h           |   46 +-
 linux-headers/linux/userfaultfd.h |  150 ++++++
 migration/Makefile.objs           |    2 +-
 migration/block.c                 |    7 +-
 migration/migration.c             |  739 +++++++++++++++++++++++++--
 migration/postcopy-ram.c          | 1018 +++++++++++++++++++++++++++++++++++++
 migration/qemu-file-internal.h    |    2 +
 migration/qemu-file-unix.c        |   99 +++-
 migration/qemu-file.c             |   28 +
 migration/rdma.c                  |    4 +-
 qapi-schema.json                  |   15 +-
 qmp-commands.hx                   |   19 +
 savevm.c                          |  866 ++++++++++++++++++++++++++++---
 trace-events                      |   80 ++-
 30 files changed, 4381 insertions(+), 189 deletions(-)
 create mode 100644 include/migration/postcopy-ram.h
 create mode 100644 linux-headers/linux/userfaultfd.h
 create mode 100644 migration/postcopy-ram.c

-- 
2.1.0

^ permalink raw reply	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works.
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-05  3:21   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 02/45] Split header writing out of qemu_save_state_begin Dr. David Alan Gilbert (git)
                   ` (43 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/migration.txt | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)

diff --git a/docs/migration.txt b/docs/migration.txt
index 0492a45..c6c3798 100644
--- a/docs/migration.txt
+++ b/docs/migration.txt
@@ -294,3 +294,192 @@ save/send this state when we are in the middle of a pio operation
 (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
 not enabled, the values on that fields are garbage and don't need to
 be sent.
+
+= Return path =
+
+In most migration scenarios there is only a single data path that runs
+from the source VM to the destination, typically along a single fd (although
+possibly with another fd or similar for some fast way of throwing pages across).
+
+However, some uses need two way communication; in particular the Postcopy destination
+needs to be able to request pages on demand from the source.
+
+For these scenarios there is a 'return path' from the destination to the source;
+qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
+path.
+
+  Source side
+     Forward path - written by migration thread
+     Return path  - opened by main thread, read by return-path thread
+
+  Destination side
+     Forward path - read by main thread
+     Return path  - opened by main thread, written by main thread AND postcopy
+                    thread (protected by rp_mutex)
+
+= Postcopy =
+'Postcopy' migration is a way to deal with migrations that refuse to converge;
+its plus side is that there is an upper bound on the amount of migration traffic
+and time it takes, the down side is that during the postcopy phase, a failure of
+*either* side or the network connection causes the guest to be lost.
+
+In postcopy the destination CPUs are started before all the memory has been
+transferred, and accesses to pages that are yet to be transferred cause
+a fault that's translated by QEMU into a request to the source QEMU.
+
+Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
+doesn't finish in a given time the switch is made to postcopy.
+
+=== Enabling postcopy ===
+
+To enable postcopy (prior to the start of migration):
+
+migrate_set_capability x-postcopy-ram on
+
+The migration will still start in precopy mode, however issuing:
+
+migrate_start_postcopy
+
+will now cause the transition from precopy to postcopy.
+It can be issued immediately after migration is started or any
+time later on.  Issuing it after the end of a migration is harmless.
+
+=== Postcopy device transfer ===
+
+Loading of device data may cause the device emulation to access guest RAM
+that may trigger faults that have to be resolved by the source, as such
+the migration stream has to be able to respond with page data *during* the
+device load, and hence the device data has to be read from the stream completely
+before the device load begins to free the stream up.  This is achieved by
+'packaging' the device data into a blob that's read in one go.
+
+Source behaviour
+
+Until postcopy is entered the migration stream is identical to normal
+precopy, except for the addition of a 'postcopy advise' command at
+the beginning, to tell the destination that postcopy might happen.
+When postcopy starts the source sends the page discard data and then
+forms the 'package' containing:
+
+   Command: 'postcopy ram listen'
+   The device state
+      A series of sections, identical to the precopy streams device state stream
+      containing everything except postcopiable devices (i.e. RAM)
+   Command: 'postcopy ram run'
+
+The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
+contents are formatted in the same way as the main migration stream.
+
+Destination behaviour
+
+Initially the destination looks the same as precopy, with a single thread
+reading the migration stream; the 'postcopy advise' and 'discard' commands
+are processed to change the way RAM is managed, but don't affect the stream
+processing.
+
+------------------------------------------------------------------------------
+                        1      2   3     4 5                      6   7
+main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
+thread                             |       |
+                                   |     (page request)
+                                   |        \___
+                                   v            \
+listen thread:                     --- page -- page -- page -- page -- page --
+
+                                   a   b        c
+------------------------------------------------------------------------------
+
+On receipt of CMD_PACKAGED (1)
+   All the data associated with the package - the ( ... ) section in the
+diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
+recurses into qemu_loadvm_state_main to process the contents of the package (2)
+which contains commands (3,6) and devices (4...)
+
+On receipt of 'postcopy ram listen' - 3 -(i.e. the 1st command in the package)
+a new thread (a) is started that takes over servicing the migration stream,
+while the main thread carries on loading the package.   It loads normal
+background page data (b) but if during a device load a fault happens (5) the
+returned page (c) is loaded by the listen thread allowing the main threads
+device load to carry on.
+
+The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
+CPUs start running.
+At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
+and is no longer used by migration, while the listen thread carries
+on servicing page data until the end of migration.
+
+=== Postcopy states ===
+
+Postcopy moves through a series of states (see postcopy_state) from
+ADVISE->LISTEN->RUNNING->END
+
+  Advise: Set at the start of migration if postcopy is enabled, even
+          if it hasn't had the start command; here the destination
+          checks that its OS has the support needed for postcopy, and performs
+          setup to ensure the RAM mappings are suitable for later postcopy.
+          (Triggered by reception of POSTCOPY_ADVISE command)
+
+  Listen: The first command in the package, POSTCOPY_LISTEN, switches
+          the destination state to Listen, and starts a new thread
+          (the 'listen thread') which takes over the job of receiving
+          pages off the migration stream, while the main thread carries
+          on processing the blob.  With this thread able to process page
+          reception, the destination now 'sensitises' the RAM to detect
+          any access to missing pages (on Linux using the 'userfault'
+          system).
+
+  Running: POSTCOPY_RUN causes the destination to synchronise all
+          state and start the CPUs and IO devices running.  The main
+          thread now finishes processing the migration package and
+          now carries on as it would for normal precopy migration
+          (although it can't do the cleanup it would do as it
+          finishes a normal migration).
+
+  End: The listen thread can now quit, and perform the cleanup of migration
+          state, the migration is now complete.
+
+=== Source side page maps ===
+
+The source side keeps two bitmaps during postcopy; 'the migration bitmap'
+and 'sent map'.  The 'migration bitmap' is basically the same as in
+the precopy case, and holds a bit to indicate that page is 'dirty' -
+i.e. needs sending.  During the precopy phase this is updated as the CPU
+dirties pages, however during postcopy the CPUs are stopped and nothing
+should dirty anything any more.
+
+The 'sent map' is used for the transition to postcopy. It is a bitmap that
+has a bit set whenever a page is sent to the destination, however during
+the transition to postcopy mode it is masked against the migration bitmap
+(sentmap &= migrationbitmap) to generate a bitmap recording pages that
+have been previously been sent but are now dirty again.  This masked
+sentmap is sent to the destination which discards those now dirty pages
+before starting the CPUs.
+
+Note that once in postcopy mode, the sent map is still updated; however,
+its contents are not necessarily consistent with the pages already sent
+due to the masking with the migration bitmap.
+
+=== Destination side page maps ===
+
+(Needs to be changed so we can update both easily - at the moment updates are done
+ with a lock)
+The destination keeps a state for each page which is 'missing', 'received'
+or 'requested'; these three states are encoded in a 2 bit state array.
+Incoming requests from the kernel cause the state to transition from 'missing'
+to 'requested'.   Received pages cause a transition from either 'missing' or
+'requested' to 'received'; the kernel is notified on reception to wake up
+any threads that were waiting for the page.
+If the kernel requests a page that has already been 'received' the kernel is
+notified without re-requesting.
+
+This leads to four valid page states:
+page states:
+    missing        - page not yet received or requested
+    received       - Page received
+    requested      - page requested but not yet received
+
+state transitions:
+      received -> missing   (only during setup/discard)
+      missing -> received   (normal incoming page)
+      requested -> received (incoming page previously requested)
+      missing -> requested  (userfault request)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 02/45] Split header writing out of qemu_save_state_begin
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  1:05   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name Dr. David Alan Gilbert (git)
                   ` (42 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Split qemu_save_state_begin to:
  qemu_save_state_header   That writes the initial file header.
  qemu_save_state_begin    That sets up devices and does the first
                           device pass.

Used later in postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/sysemu/sysemu.h |  1 +
 migration/migration.c   |  1 +
 savevm.c                | 11 ++++++++---
 trace-events            |  1 +
 4 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 748d059..de1c885 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -84,6 +84,7 @@ void qemu_announce_self(void);
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params);
+void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
 void qemu_savevm_state_cancel(void);
diff --git a/migration/migration.c b/migration/migration.c
index b3adbc6..4a06d79 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -597,6 +597,7 @@ static void *migration_thread(void *opaque)
     int64_t start_time = initial_time;
     bool old_vm_running = false;
 
+    qemu_savevm_state_header(s->file);
     qemu_savevm_state_begin(s->file, &s->params);
 
     s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
diff --git a/savevm.c b/savevm.c
index 8040766..192110a 100644
--- a/savevm.c
+++ b/savevm.c
@@ -616,6 +616,13 @@ bool qemu_savevm_state_blocked(Error **errp)
     return false;
 }
 
+void qemu_savevm_state_header(QEMUFile *f)
+{
+    trace_savevm_state_header();
+    qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+    qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+}
+
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params)
 {
@@ -630,9 +637,6 @@ void qemu_savevm_state_begin(QEMUFile *f,
         se->ops->set_params(params, se->opaque);
     }
 
-    qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
-    qemu_put_be32(f, QEMU_VM_FILE_VERSION);
-
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         int len;
 
@@ -834,6 +838,7 @@ static int qemu_savevm_state(QEMUFile *f)
     }
 
     qemu_mutex_unlock_iothread();
+    qemu_savevm_state_header(f);
     qemu_savevm_state_begin(f, &params);
     qemu_mutex_lock_iothread();
 
diff --git a/trace-events b/trace-events
index f87b077..83231d7 100644
--- a/trace-events
+++ b/trace-events
@@ -1170,6 +1170,7 @@ qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint
 savevm_section_start(const char *id, unsigned int section_id) "%s, section_id %u"
 savevm_section_end(const char *id, unsigned int section_id, int ret) "%s, section_id %u -> %d"
 savevm_state_begin(void) ""
+savevm_state_header(void) ""
 savevm_state_iterate(void) ""
 savevm_state_complete(void) ""
 savevm_state_cancel(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 02/45] Split header writing out of qemu_save_state_begin Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10 15:30   ` Eric Blake
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 04/45] Add qemu_get_counted_string to read a string prefixed by a count byte Dr. David Alan Gilbert (git)
                   ` (41 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

check the return value of the function it calls and error if it's non-0
Fixup qemu_rdma_init_one_block that is the only current caller,
  and __qemu_rdma_add_block the only function it calls using it.

Pass the name of the ramblock to the function; helps in debugging.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 exec.c                    | 10 ++++++++--
 include/exec/cpu-common.h |  4 ++--
 migration/rdma.c          |  4 ++--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 6dff7bc..018b07a 100644
--- a/exec.c
+++ b/exec.c
@@ -2944,12 +2944,18 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
              memory_region_is_romd(mr));
 }
 
-void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
+int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
 {
     RAMBlock *block;
+    int ret;
 
     QTAILQ_FOREACH(block, &ram_list.blocks, next) {
-        func(block->host, block->offset, block->used_length, opaque);
+        ret = func(block->idstr, block->host, block->offset,
+                   block->used_length, opaque);
+        if (ret) {
+            return ret;
+        }
     }
+    return 0;
 }
 #endif
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 427b851..a31300c 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -124,10 +124,10 @@ void cpu_flush_icache_range(hwaddr start, int len);
 extern struct MemoryRegion io_mem_rom;
 extern struct MemoryRegion io_mem_notdirty;
 
-typedef void (RAMBlockIterFunc)(void *host_addr,
+typedef int (RAMBlockIterFunc)(const char *block_name, void *host_addr,
     ram_addr_t offset, ram_addr_t length, void *opaque);
 
-void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
+int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
 
 #endif
 
diff --git a/migration/rdma.c b/migration/rdma.c
index 6bee30c..d5cb6b7 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -569,10 +569,10 @@ static int __qemu_rdma_add_block(RDMAContext *rdma, void *host_addr,
  * in advanced before the migration starts. This tells us where the RAM blocks
  * are so that we can register them individually.
  */
-static void qemu_rdma_init_one_block(void *host_addr,
+static int qemu_rdma_init_one_block(const char *block_name, void *host_addr,
     ram_addr_t block_offset, ram_addr_t length, void *opaque)
 {
-    __qemu_rdma_add_block(opaque, host_addr, block_offset, length);
+    return __qemu_rdma_add_block(opaque, host_addr, block_offset, length);
 }
 
 /*
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 04/45] Add qemu_get_counted_string to read a string prefixed by a count byte
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  1:12   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 05/45] Create MigrationIncomingState Dr. David Alan Gilbert (git)
                   ` (40 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

and use it in loadvm_state and ram_load.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c                   |  5 +----
 include/migration/qemu-file.h |  3 +++
 migration/qemu-file.c         | 16 ++++++++++++++++
 savevm.c                      | 11 ++++++-----
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 89c8fa4..91645cc 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -1077,13 +1077,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             total_ram_bytes = addr;
             while (!ret && total_ram_bytes) {
                 RAMBlock *block;
-                uint8_t len;
                 char id[256];
                 ram_addr_t length;
 
-                len = qemu_get_byte(f);
-                qemu_get_buffer(f, (uint8_t *)id, len);
-                id[len] = 0;
+                qemu_get_counted_string(f, id);
                 length = qemu_get_be64(f);
 
                 QTAILQ_FOREACH(block, &ram_list.blocks, next) {
diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index a923cec..6ae0b03 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -310,4 +310,7 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
 {
     qemu_get_be64s(f, (uint64_t *)pv);
 }
+
+int qemu_get_counted_string(QEMUFile *f, char buf[256]);
+
 #endif
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index e66e557..57eb868 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -545,3 +545,19 @@ uint64_t qemu_get_be64(QEMUFile *f)
     v |= qemu_get_be32(f);
     return v;
 }
+
+/*
+ * Get a string whose length is determined by a single preceding byte
+ * A preallocated 256 byte buffer must be passed in.
+ * Returns: 0 on success and a 0 terminated string in the buffer
+ */
+int qemu_get_counted_string(QEMUFile *f, char buf[256])
+{
+    unsigned int len = qemu_get_byte(f);
+    int res = qemu_get_buffer(f, (uint8_t *)buf, len);
+
+    buf[len] = 0;
+
+    return res != len;
+}
+
diff --git a/savevm.c b/savevm.c
index 192110a..2f8ef45 100644
--- a/savevm.c
+++ b/savevm.c
@@ -960,8 +960,7 @@ int qemu_loadvm_state(QEMUFile *f)
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
         uint32_t instance_id, version_id, section_id;
         SaveStateEntry *se;
-        char idstr[257];
-        int len;
+        char idstr[256];
 
         trace_qemu_loadvm_state_section(section_type);
         switch (section_type) {
@@ -969,9 +968,11 @@ int qemu_loadvm_state(QEMUFile *f)
         case QEMU_VM_SECTION_FULL:
             /* Read section start */
             section_id = qemu_get_be32(f);
-            len = qemu_get_byte(f);
-            qemu_get_buffer(f, (uint8_t *)idstr, len);
-            idstr[len] = 0;
+            if (qemu_get_counted_string(f, idstr)) {
+                error_report("Unable to read ID string for section %u",
+                            section_id);
+                return -EINVAL;
+            }
             instance_id = qemu_get_be32(f);
             version_id = qemu_get_be32(f);
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 05/45] Create MigrationIncomingState
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 04/45] Add qemu_get_counted_string to read a string prefixed by a count byte Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  2:37   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 06/45] Provide runtime Target page information Dr. David Alan Gilbert (git)
                   ` (39 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

There are currently lots of pieces of incoming migration state scattered
around, and postcopy is adding more, and it seems better to try and keep
it together.

allocate MIS in process_incoming_migration_co

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  9 +++++++++
 include/qemu/typedefs.h       |  1 +
 migration/migration.c         | 28 ++++++++++++++++++++++++++++
 savevm.c                      |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index f37348b..8505543 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -42,6 +42,15 @@ struct MigrationParams {
 
 typedef struct MigrationState MigrationState;
 
+/* State for the incoming migration */
+struct MigrationIncomingState {
+    QEMUFile *file;
+};
+
+MigrationIncomingState *migration_incoming_get_current(void);
+MigrationIncomingState *migration_incoming_state_new(QEMUFile *f);
+void migration_incoming_state_destroy(void);
+
 struct MigrationState
 {
     int64_t bandwidth_limit;
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index cde3314..74dfad3 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -38,6 +38,7 @@ typedef struct MemoryListener MemoryListener;
 typedef struct MemoryMappingList MemoryMappingList;
 typedef struct MemoryRegion MemoryRegion;
 typedef struct MemoryRegionSection MemoryRegionSection;
+typedef struct MigrationIncomingState MigrationIncomingState;
 typedef struct MigrationParams MigrationParams;
 typedef struct Monitor Monitor;
 typedef struct MouseTransformInfo MouseTransformInfo;
diff --git a/migration/migration.c b/migration/migration.c
index 4a06d79..a36ea65 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -53,6 +53,7 @@ static NotifierList migration_state_notifiers =
    migrations at once.  For now we don't need to add
    dynamic creation of migration */
 
+/* For outgoing */
 MigrationState *migrate_get_current(void)
 {
     static MigrationState current_migration = {
@@ -65,6 +66,28 @@ MigrationState *migrate_get_current(void)
     return &current_migration;
 }
 
+/* For incoming */
+static MigrationIncomingState *mis_current;
+
+MigrationIncomingState *migration_incoming_get_current(void)
+{
+    return mis_current;
+}
+
+MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
+{
+    mis_current = g_malloc0(sizeof(MigrationIncomingState));
+    mis_current->file = f;
+
+    return mis_current;
+}
+
+void migration_incoming_state_destroy(void)
+{
+    g_free(mis_current);
+    mis_current = NULL;
+}
+
 void qemu_start_incoming_migration(const char *uri, Error **errp)
 {
     const char *p;
@@ -94,9 +117,14 @@ static void process_incoming_migration_co(void *opaque)
     Error *local_err = NULL;
     int ret;
 
+    migration_incoming_state_new(f);
+
     ret = qemu_loadvm_state(f);
+
     qemu_fclose(f);
     free_xbzrle_decoded_buf();
+    migration_incoming_state_destroy();
+
     if (ret < 0) {
         error_report("load of migration failed: %s", strerror(-ret));
         exit(EXIT_FAILURE);
diff --git a/savevm.c b/savevm.c
index 2f8ef45..cce7ff0 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1290,9 +1290,11 @@ int load_vmstate(const char *name)
     }
 
     qemu_system_reset(VMRESET_SILENT);
+    migration_incoming_state_new(f);
     ret = qemu_loadvm_state(f);
 
     qemu_fclose(f);
+    migration_incoming_state_destroy();
     if (ret < 0) {
         error_report("Error %d while loading VM state", ret);
         return ret;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 06/45] Provide runtime Target page information
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (4 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 05/45] Create MigrationIncomingState Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  2:38   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets Dr. David Alan Gilbert (git)
                   ` (38 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The migration code generally is built target-independent, however
there are a few places where knowing the target page size would
avoid artificially moving stuff into arch_init.

Provide 'qemu_target_page_bits()' that returns TARGET_PAGE_BITS
to other bits of code so that they can stay target-independent.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                  | 10 ++++++++++
 include/sysemu/sysemu.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/exec.c b/exec.c
index 018b07a..eafd964 100644
--- a/exec.c
+++ b/exec.c
@@ -2915,6 +2915,16 @@ int cpu_memory_rw_debug(CPUState *cpu, target_ulong addr,
     }
     return 0;
 }
+
+/*
+ * Allows code that needs to deal with migration bitmaps etc to still be built
+ * target independent.
+ */
+size_t qemu_target_page_bits(void)
+{
+    return TARGET_PAGE_BITS;
+}
+
 #endif
 
 /*
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index de1c885..ebab098 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -68,6 +68,7 @@ int qemu_reset_requested_get(void);
 void qemu_system_killed(int signal, pid_t pid);
 void qemu_devices_reset(void);
 void qemu_system_reset(bool report);
+size_t qemu_target_page_bits(void);
 
 void qemu_add_exit_notifier(Notifier *notify);
 void qemu_remove_exit_notifier(Notifier *notify);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (5 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 06/45] Provide runtime Target page information Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  2:49   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's Dr. David Alan Gilbert (git)
                   ` (37 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Postcopy needs a method to send messages from the destination back to
the source, this is the 'return path'.

Wire it up for 'socket' QEMUFile's using a dup'd fd.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/qemu-file.h  |  7 +++++
 migration/qemu-file-internal.h |  2 ++
 migration/qemu-file-unix.c     | 58 +++++++++++++++++++++++++++++++++++-------
 migration/qemu-file.c          | 12 +++++++++
 4 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 6ae0b03..3c38963 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -85,6 +85,11 @@ typedef size_t (QEMURamSaveFunc)(QEMUFile *f, void *opaque,
                                int *bytes_sent);
 
 /*
+ * Return a QEMUFile for comms in the opposite direction
+ */
+typedef QEMUFile *(QEMURetPathFunc)(void *opaque);
+
+/*
  * Stop any read or write (depending on flags) on the underlying
  * transport on the QEMUFile.
  * Existing blocking reads/writes must be woken
@@ -102,6 +107,7 @@ typedef struct QEMUFileOps {
     QEMURamHookFunc *after_ram_iterate;
     QEMURamHookFunc *hook_ram_load;
     QEMURamSaveFunc *save_page;
+    QEMURetPathFunc *get_return_path;
     QEMUFileShutdownFunc *shut_down;
 } QEMUFileOps;
 
@@ -188,6 +194,7 @@ int64_t qemu_file_get_rate_limit(QEMUFile *f);
 int qemu_file_get_error(QEMUFile *f);
 void qemu_file_set_error(QEMUFile *f, int ret);
 int qemu_file_shutdown(QEMUFile *f);
+QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
diff --git a/migration/qemu-file-internal.h b/migration/qemu-file-internal.h
index d95e853..a39b8e3 100644
--- a/migration/qemu-file-internal.h
+++ b/migration/qemu-file-internal.h
@@ -48,6 +48,8 @@ struct QEMUFile {
     unsigned int iovcnt;
 
     int last_error;
+
+    struct QEMUFile *return_path;
 };
 
 #endif
diff --git a/migration/qemu-file-unix.c b/migration/qemu-file-unix.c
index bfbc086..50291cf 100644
--- a/migration/qemu-file-unix.c
+++ b/migration/qemu-file-unix.c
@@ -96,6 +96,45 @@ static int socket_shutdown(void *opaque, bool rd, bool wr)
     }
 }
 
+/*
+ * Give a QEMUFile* off the same socket but data in the opposite
+ * direction.
+ */
+static QEMUFile *socket_dup_return_path(void *opaque)
+{
+    QEMUFileSocket *qfs = opaque;
+    int revfd;
+    bool this_is_read;
+    QEMUFile *result;
+
+    /* We should only be called once to get a RP on a file */
+    assert(!qfs->file->return_path);
+
+    if (qemu_file_get_error(qfs->file)) {
+        /* If the forward file is in error, don't try and open a return */
+        return NULL;
+    }
+
+    /* I don't think there's a better way to tell which direction 'this' is */
+    this_is_read = qfs->file->ops->get_buffer != NULL;
+
+    revfd = dup(qfs->fd);
+    if (revfd == -1) {
+        error_report("Error duplicating fd for return path: %s",
+                      strerror(errno));
+        return NULL;
+    }
+
+    result = qemu_fopen_socket(revfd, this_is_read ? "wb" : "rb");
+    qfs->file->return_path = result;
+
+    if (!result) {
+        close(revfd);
+    }
+
+    return result;
+}
+
 static ssize_t unix_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
                                   int64_t pos)
 {
@@ -204,18 +243,19 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
 }
 
 static const QEMUFileOps socket_read_ops = {
-    .get_fd     = socket_get_fd,
-    .get_buffer = socket_get_buffer,
-    .close      = socket_close,
-    .shut_down  = socket_shutdown
-
+    .get_fd          = socket_get_fd,
+    .get_buffer      = socket_get_buffer,
+    .close           = socket_close,
+    .shut_down       = socket_shutdown,
+    .get_return_path = socket_dup_return_path
 };
 
 static const QEMUFileOps socket_write_ops = {
-    .get_fd        = socket_get_fd,
-    .writev_buffer = socket_writev_buffer,
-    .close         = socket_close,
-    .shut_down     = socket_shutdown
+    .get_fd          = socket_get_fd,
+    .writev_buffer   = socket_writev_buffer,
+    .close           = socket_close,
+    .shut_down       = socket_shutdown,
+    .get_return_path = socket_dup_return_path
 };
 
 QEMUFile *qemu_fopen_socket(int fd, const char *mode)
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 57eb868..02122a5 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -42,6 +42,18 @@ int qemu_file_shutdown(QEMUFile *f)
     return f->ops->shut_down(f->opaque, true, true);
 }
 
+/*
+ * Result: QEMUFile* for a 'return path' for comms in the opposite direction
+ *         NULL if not available
+ */
+QEMUFile *qemu_file_get_return_path(QEMUFile *f)
+{
+    if (!f->ops->get_return_path) {
+        return NULL;
+    }
+    return f->ops->get_return_path(f->opaque);
+}
+
 bool qemu_file_mode_is_not_valid(const char *mode)
 {
     if (mode == NULL ||
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (6 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  2:56   ` David Gibson
  2015-03-28 15:30   ` Paolo Bonzini
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 09/45] Migration commands Dr. David Alan Gilbert (git)
                   ` (36 subsequent siblings)
  44 siblings, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The return path uses a non-blocking fd so as not to block waiting
for the (possibly broken) destination to finish returning a message,
however we still want outbound data to behave in the same way and block.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/qemu-file-unix.c | 41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/migration/qemu-file-unix.c b/migration/qemu-file-unix.c
index 50291cf..218dbd0 100644
--- a/migration/qemu-file-unix.c
+++ b/migration/qemu-file-unix.c
@@ -39,12 +39,43 @@ static ssize_t socket_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
     QEMUFileSocket *s = opaque;
     ssize_t len;
     ssize_t size = iov_size(iov, iovcnt);
+    ssize_t offset = 0;
+    int     err;
 
-    len = iov_send(s->fd, iov, iovcnt, 0, size);
-    if (len < size) {
-        len = -socket_error();
-    }
-    return len;
+    while (size > 0) {
+        len = iov_send(s->fd, iov, iovcnt, offset, size);
+
+        if (len > 0) {
+            size -= len;
+            offset += len;
+        }
+
+        if (size > 0) {
+            err = socket_error();
+
+            if (err != EAGAIN) {
+                error_report("socket_writev_buffer: Got err=%d for (%zd/%zd)",
+                             err, size, len);
+                /*
+                 * If I've already sent some but only just got the error, I
+                 * could return the amount validly sent so far and wait for the
+                 * next call to report the error, but I'd rather flag the error
+                 * immediately.
+                 */
+                return -err;
+            }
+
+            /* Emulate blocking */
+            GPollFD pfd;
+
+            pfd.fd = s->fd;
+            pfd.events = G_IO_OUT | G_IO_ERR;
+            pfd.revents = 0;
+            g_poll(&pfd, 1 /* 1 fd */, -1 /* no timeout */);
+        }
+     }
+
+    return offset;
 }
 
 static int socket_get_fd(void *opaque)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 09/45] Migration commands
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (7 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  4:58   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 10/45] Return path: Control commands Dr. David Alan Gilbert (git)
                   ` (35 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Create QEMU_VM_COMMAND section type for sending commands from
source to destination.  These commands are not intended to convey
guest state but to control the migration process.

For use in postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  1 +
 include/sysemu/sysemu.h       |  7 +++++++
 savevm.c                      | 48 +++++++++++++++++++++++++++++++++++++++++++
 trace-events                  |  1 +
 4 files changed, 57 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 8505543..1b1dc34 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -34,6 +34,7 @@
 #define QEMU_VM_SECTION_FULL         0x04
 #define QEMU_VM_SUBSECTION           0x05
 #define QEMU_VM_VMDESCRIPTION        0x06
+#define QEMU_VM_COMMAND              0x07
 
 struct MigrationParams {
     bool blk;
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index ebab098..88e5e76 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -82,6 +82,11 @@ void do_info_snapshots(Monitor *mon, const QDict *qdict);
 
 void qemu_announce_self(void);
 
+/* Subcommands for QEMU_VM_COMMAND */
+enum qemu_vm_cmd {
+    MIG_CMD_INVALID = 0,   /* Must be 0 */
+};
+
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params);
@@ -90,6 +95,8 @@ int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
 void qemu_savevm_state_cancel(void);
 uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
+void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
+                              uint16_t len, uint8_t *data);
 int qemu_loadvm_state(QEMUFile *f);
 
 /* SLIRP */
diff --git a/savevm.c b/savevm.c
index cce7ff0..3d04ba1 100644
--- a/savevm.c
+++ b/savevm.c
@@ -602,6 +602,25 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se, QJSON *vmdesc)
     vmstate_save_state(f, se->vmsd, se->opaque, vmdesc);
 }
 
+
+/* Send a 'QEMU_VM_COMMAND' type element with the command
+ * and associated data.
+ */
+void qemu_savevm_command_send(QEMUFile *f,
+                              enum qemu_vm_cmd command,
+                              uint16_t len,
+                              uint8_t *data)
+{
+    uint32_t tmp = (uint16_t)command;
+    qemu_put_byte(f, QEMU_VM_COMMAND);
+    qemu_put_be16(f, tmp);
+    qemu_put_be16(f, len);
+    if (len) {
+        qemu_put_buffer(f, data, len);
+    }
+    qemu_fflush(f);
+}
+
 bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
@@ -918,6 +937,29 @@ static SaveStateEntry *find_se(const char *idstr, int instance_id)
     return NULL;
 }
 
+/*
+ * Process an incoming 'QEMU_VM_COMMAND'
+ * negative return on error (will issue error message)
+ */
+static int loadvm_process_command(QEMUFile *f)
+{
+    uint16_t com;
+    uint16_t len;
+
+    com = qemu_get_be16(f);
+    len = qemu_get_be16(f);
+
+    trace_loadvm_process_command(com, len);
+    switch (com) {
+
+    default:
+        error_report("VM_COMMAND 0x%x unknown (len 0x%x)", com, len);
+        return -1;
+    }
+
+    return 0;
+}
+
 typedef struct LoadStateEntry {
     QLIST_ENTRY(LoadStateEntry) entry;
     SaveStateEntry *se;
@@ -1033,6 +1075,12 @@ int qemu_loadvm_state(QEMUFile *f)
                 goto out;
             }
             break;
+        case QEMU_VM_COMMAND:
+            ret = loadvm_process_command(f);
+            if (ret < 0) {
+                goto out;
+            }
+            break;
         default:
             error_report("Unknown savevm section type %d", section_type);
             ret = -EINVAL;
diff --git a/trace-events b/trace-events
index 83231d7..4e2fbc8 100644
--- a/trace-events
+++ b/trace-events
@@ -1167,6 +1167,7 @@ vmware_setmode(uint32_t w, uint32_t h, uint32_t bpp) "%dx%d @ %d bpp"
 qemu_loadvm_state_section(unsigned int section_type) "%d"
 qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
+loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
 savevm_section_start(const char *id, unsigned int section_id) "%s, section_id %u"
 savevm_section_end(const char *id, unsigned int section_id, int ret) "%s, section_id %u -> %d"
 savevm_state_begin(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 10/45] Return path: Control commands
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (8 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 09/45] Migration commands Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  5:40   ` David Gibson
  2015-03-28 15:32   ` Paolo Bonzini
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source Dr. David Alan Gilbert (git)
                   ` (34 subsequent siblings)
  44 siblings, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add two src->dest commands:
   * OPEN_RETURN_PATH - To request that the destination open the return path
   * SEND_PING - Request an acknowledge from the destination

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  2 ++
 include/sysemu/sysemu.h       |  6 ++++-
 savevm.c                      | 59 +++++++++++++++++++++++++++++++++++++++++++
 trace-events                  |  2 ++
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 1b1dc34..c514dd4 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -46,6 +46,8 @@ typedef struct MigrationState MigrationState;
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *file;
+
+    QEMUFile *return_path;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 88e5e76..8da879f 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -84,7 +84,9 @@ void qemu_announce_self(void);
 
 /* Subcommands for QEMU_VM_COMMAND */
 enum qemu_vm_cmd {
-    MIG_CMD_INVALID = 0,   /* Must be 0 */
+    MIG_CMD_INVALID = 0,       /* Must be 0 */
+    MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
+    MIG_CMD_PING,              /* Request a PONG on the RP */
 };
 
 bool qemu_savevm_state_blocked(Error **errp);
@@ -97,6 +99,8 @@ void qemu_savevm_state_cancel(void);
 uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
 void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
                               uint16_t len, uint8_t *data);
+void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
+void qemu_savevm_send_open_return_path(QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f);
 
 /* SLIRP */
diff --git a/savevm.c b/savevm.c
index 3d04ba1..d082738 100644
--- a/savevm.c
+++ b/savevm.c
@@ -621,6 +621,20 @@ void qemu_savevm_command_send(QEMUFile *f,
     qemu_fflush(f);
 }
 
+void qemu_savevm_send_ping(QEMUFile *f, uint32_t value)
+{
+    uint32_t buf;
+
+    trace_savevm_send_ping(value);
+    buf = cpu_to_be32(value);
+    qemu_savevm_command_send(f, MIG_CMD_PING, 4, (uint8_t *)&buf);
+}
+
+void qemu_savevm_send_open_return_path(QEMUFile *f)
+{
+    qemu_savevm_command_send(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
+}
+
 bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
@@ -937,20 +951,65 @@ static SaveStateEntry *find_se(const char *idstr, int instance_id)
     return NULL;
 }
 
+static int loadvm_process_command_simple_lencheck(const char *name,
+                                                  unsigned int actual,
+                                                  unsigned int expected)
+{
+    if (actual != expected) {
+        error_report("%s received with bad length - expecting %d, got %d",
+                     name, expected, actual);
+        return -1;
+    }
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * negative return on error (will issue error message)
  */
 static int loadvm_process_command(QEMUFile *f)
 {
+    MigrationIncomingState *mis = migration_incoming_get_current();
     uint16_t com;
     uint16_t len;
+    uint32_t tmp32;
 
     com = qemu_get_be16(f);
     len = qemu_get_be16(f);
 
     trace_loadvm_process_command(com, len);
     switch (com) {
+    case MIG_CMD_OPEN_RETURN_PATH:
+        if (loadvm_process_command_simple_lencheck("CMD_OPEN_RETURN_PATH",
+                                                   len, 0)) {
+            return -1;
+        }
+        if (mis->return_path) {
+            error_report("CMD_OPEN_RETURN_PATH called when RP already open");
+            /* Not really a problem, so don't give up */
+            return 0;
+        }
+        mis->return_path = qemu_file_get_return_path(f);
+        if (!mis->return_path) {
+            error_report("CMD_OPEN_RETURN_PATH failed");
+            return -1;
+        }
+        break;
+
+    case MIG_CMD_PING:
+        if (loadvm_process_command_simple_lencheck("CMD_PING", len, 4)) {
+            return -1;
+        }
+        tmp32 = qemu_get_be32(f);
+        trace_loadvm_process_command_ping(tmp32);
+        if (!mis->return_path) {
+            error_report("CMD_PING (0x%x) received with no return path",
+                         tmp32);
+            return -1;
+        }
+        /* migrate_send_rp_pong(mis, tmp32); TODO: gets added later */
+        break;
 
     default:
         error_report("VM_COMMAND 0x%x unknown (len 0x%x)", com, len);
diff --git a/trace-events b/trace-events
index 4e2fbc8..99e00b5 100644
--- a/trace-events
+++ b/trace-events
@@ -1168,8 +1168,10 @@ qemu_loadvm_state_section(unsigned int section_type) "%d"
 qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
+loadvm_process_command_ping(uint32_t val) "%x"
 savevm_section_start(const char *id, unsigned int section_id) "%s, section_id %u"
 savevm_section_end(const char *id, unsigned int section_id, int ret) "%s, section_id %u -> %d"
+savevm_send_ping(uint32_t val) "%x"
 savevm_state_begin(void) ""
 savevm_state_header(void) ""
 savevm_state_iterate(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (9 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 10/45] Return path: Control commands Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  5:47   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path Dr. David Alan Gilbert (git)
                   ` (33 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add migrate_send_rp_message to send a message from destination to source along the return path.
  (It uses a mutex to let it be called from multiple threads)
Add migrate_send_rp_shut to send a 'shut' message to indicate
  the destination is finished with the RP.
Add migrate_send_rp_ack to send a 'PONG' message in response to a PING
  Use it in the CMD_PING handler

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h | 17 ++++++++++++++++
 migration/migration.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
 savevm.c                      |  2 +-
 trace-events                  |  1 +
 4 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index c514dd4..6775747 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -41,6 +41,13 @@ struct MigrationParams {
     bool shared;
 };
 
+/* Commands sent on the return path from destination to source*/
+enum mig_rpcomm_cmd {
+    MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
+    MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
+    MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
+};
+
 typedef struct MigrationState MigrationState;
 
 /* State for the incoming migration */
@@ -48,6 +55,7 @@ struct MigrationIncomingState {
     QEMUFile *file;
 
     QEMUFile *return_path;
+    QemuMutex      rp_mutex;    /* We send replies from multiple threads */
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
@@ -169,6 +177,15 @@ int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
 
+/* Sending on the return path - generic and then for each message type */
+void migrate_send_rp_message(MigrationIncomingState *mis,
+                             enum mig_rpcomm_cmd cmd,
+                             uint16_t len, uint8_t *data);
+void migrate_send_rp_shut(MigrationIncomingState *mis,
+                          uint32_t value);
+void migrate_send_rp_pong(MigrationIncomingState *mis,
+                          uint32_t value);
+
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_load_hook(QEMUFile *f, uint64_t flags);
diff --git a/migration/migration.c b/migration/migration.c
index a36ea65..80d234c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
 {
     mis_current = g_malloc0(sizeof(MigrationIncomingState));
     mis_current->file = f;
+    qemu_mutex_init(&mis_current->rp_mutex);
 
     return mis_current;
 }
@@ -88,6 +89,50 @@ void migration_incoming_state_destroy(void)
     mis_current = NULL;
 }
 
+/*
+ * Send a message on the return channel back to the source
+ * of the migration.
+ */
+void migrate_send_rp_message(MigrationIncomingState *mis,
+                             enum mig_rpcomm_cmd cmd,
+                             uint16_t len, uint8_t *data)
+{
+    trace_migrate_send_rp_message((int)cmd, len);
+    qemu_mutex_lock(&mis->rp_mutex);
+    qemu_put_be16(mis->return_path, (unsigned int)cmd);
+    qemu_put_be16(mis->return_path, len);
+    qemu_put_buffer(mis->return_path, data, len);
+    qemu_fflush(mis->return_path);
+    qemu_mutex_unlock(&mis->rp_mutex);
+}
+
+/*
+ * Send a 'SHUT' message on the return channel with the given value
+ * to indicate that we've finished with the RP.  None-0 value indicates
+ * error.
+ */
+void migrate_send_rp_shut(MigrationIncomingState *mis,
+                          uint32_t value)
+{
+    uint32_t buf;
+
+    buf = cpu_to_be32(value);
+    migrate_send_rp_message(mis, MIG_RP_CMD_SHUT, 4, (uint8_t *)&buf);
+}
+
+/*
+ * Send a 'PONG' message on the return channel with the given value
+ * (normally in response to a 'PING')
+ */
+void migrate_send_rp_pong(MigrationIncomingState *mis,
+                          uint32_t value)
+{
+    uint32_t buf;
+
+    buf = cpu_to_be32(value);
+    migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
+}
+
 void qemu_start_incoming_migration(const char *uri, Error **errp)
 {
     const char *p;
diff --git a/savevm.c b/savevm.c
index d082738..7084d07 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1008,7 +1008,7 @@ static int loadvm_process_command(QEMUFile *f)
                          tmp32);
             return -1;
         }
-        /* migrate_send_rp_pong(mis, tmp32); TODO: gets added later */
+        migrate_send_rp_pong(mis, tmp32);
         break;
 
     default:
diff --git a/trace-events b/trace-events
index 99e00b5..4f3eff8 100644
--- a/trace-events
+++ b/trace-events
@@ -1379,6 +1379,7 @@ migrate_fd_cleanup(void) ""
 migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
 migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
+migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
 
 # migration/rdma.c
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (10 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  6:08   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text Dr. David Alan Gilbert (git)
                   ` (32 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Open a return path, and handle messages that are received upon it.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |   8 ++
 migration/migration.c         | 178 +++++++++++++++++++++++++++++++++++++++++-
 trace-events                  |  13 +++
 3 files changed, 198 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 6775747..5242ead 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -73,6 +73,14 @@ struct MigrationState
 
     int state;
     MigrationParams params;
+
+    /* State related to return path */
+    struct {
+        QEMUFile     *file;
+        QemuThread    rp_thread;
+        bool          error;
+    } rp_state;
+
     double mbps;
     int64_t total_time;
     int64_t downtime;
diff --git a/migration/migration.c b/migration/migration.c
index 80d234c..34cd4fe 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -237,6 +237,23 @@ MigrationCapabilityStatusList *qmp_query_migrate_capabilities(Error **errp)
     return head;
 }
 
+/*
+ * Return true if we're already in the middle of a migration
+ * (i.e. any of the active or setup states)
+ */
+static bool migration_already_active(MigrationState *ms)
+{
+    switch (ms->state) {
+    case MIG_STATE_ACTIVE:
+    case MIG_STATE_SETUP:
+        return true;
+
+    default:
+        return false;
+
+    }
+}
+
 static void get_xbzrle_cache_stats(MigrationInfo *info)
 {
     if (migrate_use_xbzrle()) {
@@ -362,6 +379,21 @@ static void migrate_set_state(MigrationState *s, int old_state, int new_state)
     }
 }
 
+static void migrate_fd_cleanup_src_rp(MigrationState *ms)
+{
+    QEMUFile *rp = ms->rp_state.file;
+
+    /*
+     * When stuff goes wrong (e.g. failing destination) on the rp, it can get
+     * cleaned up from a few threads; make sure not to do it twice in parallel
+     */
+    rp = atomic_cmpxchg(&ms->rp_state.file, rp, NULL);
+    if (rp) {
+        trace_migrate_fd_cleanup_src_rp();
+        qemu_fclose(rp);
+    }
+}
+
 static void migrate_fd_cleanup(void *opaque)
 {
     MigrationState *s = opaque;
@@ -369,6 +401,8 @@ static void migrate_fd_cleanup(void *opaque)
     qemu_bh_delete(s->cleanup_bh);
     s->cleanup_bh = NULL;
 
+    migrate_fd_cleanup_src_rp(s);
+
     if (s->file) {
         trace_migrate_fd_cleanup();
         qemu_mutex_unlock_iothread();
@@ -406,6 +440,11 @@ static void migrate_fd_cancel(MigrationState *s)
     QEMUFile *f = migrate_get_current()->file;
     trace_migrate_fd_cancel();
 
+    if (s->rp_state.file) {
+        /* shutdown the rp socket, so causing the rp thread to shutdown */
+        qemu_file_shutdown(s->rp_state.file);
+    }
+
     do {
         old_state = s->state;
         if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
@@ -658,8 +697,145 @@ int64_t migrate_xbzrle_cache_size(void)
     return s->xbzrle_cache_size;
 }
 
-/* migration thread support */
+/*
+ * Something bad happened to the RP stream, mark an error
+ * The caller shall print something to indicate why
+ */
+static void source_return_path_bad(MigrationState *s)
+{
+    s->rp_state.error = true;
+    migrate_fd_cleanup_src_rp(s);
+}
+
+/*
+ * Handles messages sent on the return path towards the source VM
+ *
+ */
+static void *source_return_path_thread(void *opaque)
+{
+    MigrationState *ms = opaque;
+    QEMUFile *rp = ms->rp_state.file;
+    uint16_t expected_len, header_len, header_com;
+    const int max_len = 512;
+    uint8_t buf[max_len];
+    uint32_t tmp32;
+    int res;
+
+    trace_source_return_path_thread_entry();
+    while (rp && !qemu_file_get_error(rp) &&
+        migration_already_active(ms)) {
+        trace_source_return_path_thread_loop_top();
+        header_com = qemu_get_be16(rp);
+        header_len = qemu_get_be16(rp);
+
+        switch (header_com) {
+        case MIG_RP_CMD_SHUT:
+        case MIG_RP_CMD_PONG:
+            expected_len = 4;
+            break;
+
+        default:
+            error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
+                    header_com, header_len);
+            source_return_path_bad(ms);
+            goto out;
+        }
 
+        if (header_len > expected_len) {
+            error_report("RP: Received command 0x%04x with"
+                    "incorrect length %d expecting %d",
+                    header_com, header_len,
+                    expected_len);
+            source_return_path_bad(ms);
+            goto out;
+        }
+
+        /* We know we've got a valid header by this point */
+        res = qemu_get_buffer(rp, buf, header_len);
+        if (res != header_len) {
+            trace_source_return_path_thread_failed_read_cmd_data();
+            source_return_path_bad(ms);
+            goto out;
+        }
+
+        /* OK, we have the command and the data */
+        switch (header_com) {
+        case MIG_RP_CMD_SHUT:
+            tmp32 = be32_to_cpup((uint32_t *)buf);
+            trace_source_return_path_thread_shut(tmp32);
+            if (tmp32) {
+                error_report("RP: Sibling indicated error %d", tmp32);
+                source_return_path_bad(ms);
+            }
+            /*
+             * We'll let the main thread deal with closing the RP
+             * we could do a shutdown(2) on it, but we're the only user
+             * anyway, so there's nothing gained.
+             */
+            goto out;
+
+        case MIG_RP_CMD_PONG:
+            tmp32 = be32_to_cpup((uint32_t *)buf);
+            trace_source_return_path_thread_pong(tmp32);
+            break;
+
+        default:
+            /* This shouldn't happen because we should catch this above */
+            trace_source_return_path_bad_header_com();
+        }
+        /* Latest command processed, now leave a gap for the next one */
+        header_com = MIG_RP_CMD_INVALID;
+    }
+    if (rp && qemu_file_get_error(rp)) {
+        trace_source_return_path_thread_bad_end();
+        source_return_path_bad(ms);
+    }
+
+    trace_source_return_path_thread_end();
+out:
+    return NULL;
+}
+
+__attribute__ (( unused )) /* Until later in patch series */
+static int open_outgoing_return_path(MigrationState *ms)
+{
+
+    ms->rp_state.file = qemu_file_get_return_path(ms->file);
+    if (!ms->rp_state.file) {
+        return -1;
+    }
+
+    trace_open_outgoing_return_path();
+    qemu_thread_create(&ms->rp_state.rp_thread, "return path",
+                       source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
+
+    trace_open_outgoing_return_path_continue();
+
+    return 0;
+}
+
+__attribute__ (( unused )) /* Until later in patch series */
+static void await_outgoing_return_path_close(MigrationState *ms)
+{
+    /*
+     * If this is a normal exit then the destination will send a SHUT and the
+     * rp_thread will exit, however if there's an error we need to cause
+     * it to exit, which we can do by a shutdown.
+     * (canceling must also shutdown to stop us getting stuck here if
+     * the destination died at just the wrong place)
+     */
+    if (qemu_file_get_error(ms->file) && ms->rp_state.file) {
+        qemu_file_shutdown(ms->rp_state.file);
+    }
+    trace_await_outgoing_return_path_joining();
+    qemu_thread_join(&ms->rp_state.rp_thread);
+    trace_await_outgoing_return_path_close();
+}
+
+/*
+ * Master migration thread on the source VM.
+ * It drives the migration and pumps the data down the outgoing channel.
+ */
 static void *migration_thread(void *opaque)
 {
     MigrationState *s = opaque;
diff --git a/trace-events b/trace-events
index 4f3eff8..1951b25 100644
--- a/trace-events
+++ b/trace-events
@@ -1374,12 +1374,25 @@ flic_no_device_api(int err) "flic: no Device Contral API support %d"
 flic_reset_failed(int err) "flic: reset failed %d"
 
 # migration.c
+await_outgoing_return_path_close(void) ""
+await_outgoing_return_path_joining(void) ""
 migrate_set_state(int new_state) "new state %d"
 migrate_fd_cleanup(void) ""
+migrate_fd_cleanup_src_rp(void) ""
 migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
 migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
 migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
+open_outgoing_return_path(void) ""
+open_outgoing_return_path_continue(void) ""
+source_return_path_thread_bad_end(void) ""
+source_return_path_bad_header_com(void) ""
+source_return_path_thread_end(void) ""
+source_return_path_thread_entry(void) ""
+source_return_path_thread_failed_read_cmd_data(void) ""
+source_return_path_thread_loop_top(void) ""
+source_return_path_thread_pong(uint32_t val) "%x"
+source_return_path_thread_shut(uint32_t val) "%x"
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
 
 # migration/rdma.c
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (11 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  6:11   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState Dr. David Alan Gilbert (git)
                   ` (31 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Misses out lines that are all the expected value so the output
can be quite compact depending on the circumstance.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c                   | 39 +++++++++++++++++++++++++++++++++++++++
 include/migration/migration.h |  1 +
 2 files changed, 40 insertions(+)

diff --git a/arch_init.c b/arch_init.c
index 91645cc..fe0df0d 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -776,6 +776,45 @@ static void reset_ram_globals(void)
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
 
+/*
+ * 'expected' is the value you expect the bitmap mostly to be full
+ * of and it won't bother printing lines that are all this value
+ * if 'todump' is null the migration bitmap is dumped.
+ */
+void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
+{
+    int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
+
+    int64_t cur;
+    int64_t linelen = 128;
+    char linebuf[129];
+
+    if (!todump) {
+        todump = migration_bitmap;
+    }
+
+    for (cur = 0; cur < ram_pages; cur += linelen) {
+        int64_t curb;
+        bool found = false;
+        /*
+         * Last line; catch the case where the line length
+         * is longer than remaining ram
+         */
+        if (cur+linelen > ram_pages) {
+            linelen = ram_pages - cur;
+        }
+        for (curb = 0; curb < linelen; curb++) {
+            bool thisbit = test_bit(cur+curb, todump);
+            linebuf[curb] = thisbit ? '1' : '.';
+            found = found || (thisbit != expected);
+        }
+        if (found) {
+            linebuf[curb] = '\0';
+            fprintf(stderr,  "0x%08" PRIx64 " : %s\n", cur, linebuf);
+        }
+    }
+}
+
 static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMBlock *block;
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 5242ead..3776e86 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -156,6 +156,7 @@ uint64_t xbzrle_mig_pages_cache_miss(void);
 double xbzrle_mig_cache_miss_rate(void);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
+void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
 
 /**
  * @migrate_add_blocker - prevent migration from proceeding
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (12 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-10  6:19   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops Dr. David Alan Gilbert (git)
                   ` (30 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

In postcopy we need the loadvm_handlers to be used in a couple
of different instances of the loadvm loop/routine, and thus
it can't be local any more.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  5 +++++
 include/migration/vmstate.h   |  2 ++
 include/qemu/typedefs.h       |  1 +
 migration/migration.c         |  2 ++
 savevm.c                      | 28 ++++++++++++++++------------
 5 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 3776e86..751caa0 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -50,10 +50,15 @@ enum mig_rpcomm_cmd {
 
 typedef struct MigrationState MigrationState;
 
+typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *file;
 
+    /* See savevm.c */
+    LoadStateEntry_Head loadvm_handlers;
+
     QEMUFile *return_path;
     QemuMutex      rp_mutex;    /* We send replies from multiple threads */
 };
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index c20f2d1..18da207 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -797,6 +797,8 @@ extern const VMStateInfo vmstate_info_bitmap;
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
+void loadvm_free_handlers(MigrationIncomingState *mis);
+
 int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
                        void *opaque, int version_id);
 void vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 74dfad3..6fdcbcd 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -31,6 +31,7 @@ typedef struct I2CBus I2CBus;
 typedef struct I2SCodec I2SCodec;
 typedef struct ISABus ISABus;
 typedef struct ISADevice ISADevice;
+typedef struct LoadStateEntry LoadStateEntry;
 typedef struct MACAddr MACAddr;
 typedef struct MachineClass MachineClass;
 typedef struct MachineState MachineState;
diff --git a/migration/migration.c b/migration/migration.c
index 34cd4fe..4592060 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
 {
     mis_current = g_malloc0(sizeof(MigrationIncomingState));
     mis_current->file = f;
+    QLIST_INIT(&mis_current->loadvm_handlers);
     qemu_mutex_init(&mis_current->rp_mutex);
 
     return mis_current;
@@ -85,6 +86,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
 
 void migration_incoming_state_destroy(void)
 {
+    loadvm_free_handlers(mis_current);
     g_free(mis_current);
     mis_current = NULL;
 }
diff --git a/savevm.c b/savevm.c
index 7084d07..f42713d 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1019,18 +1019,26 @@ static int loadvm_process_command(QEMUFile *f)
     return 0;
 }
 
-typedef struct LoadStateEntry {
+struct LoadStateEntry {
     QLIST_ENTRY(LoadStateEntry) entry;
     SaveStateEntry *se;
     int section_id;
     int version_id;
-} LoadStateEntry;
+};
 
-int qemu_loadvm_state(QEMUFile *f)
+void loadvm_free_handlers(MigrationIncomingState *mis)
 {
-    QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
-        QLIST_HEAD_INITIALIZER(loadvm_handlers);
     LoadStateEntry *le, *new_le;
+
+    QLIST_FOREACH_SAFE(le, &mis->loadvm_handlers, entry, new_le) {
+        QLIST_REMOVE(le, entry);
+        g_free(le);
+    }
+}
+
+int qemu_loadvm_state(QEMUFile *f)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     uint8_t section_type;
     unsigned int v;
@@ -1061,6 +1069,7 @@ int qemu_loadvm_state(QEMUFile *f)
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
         uint32_t instance_id, version_id, section_id;
         SaveStateEntry *se;
+        LoadStateEntry *le;
         char idstr[256];
 
         trace_qemu_loadvm_state_section(section_type);
@@ -1102,7 +1111,7 @@ int qemu_loadvm_state(QEMUFile *f)
             le->se = se;
             le->section_id = section_id;
             le->version_id = version_id;
-            QLIST_INSERT_HEAD(&loadvm_handlers, le, entry);
+            QLIST_INSERT_HEAD(&mis->loadvm_handlers, le, entry);
 
             ret = vmstate_load(f, le->se, le->version_id);
             if (ret < 0) {
@@ -1116,7 +1125,7 @@ int qemu_loadvm_state(QEMUFile *f)
             section_id = qemu_get_be32(f);
 
             trace_qemu_loadvm_state_section_partend(section_id);
-            QLIST_FOREACH(le, &loadvm_handlers, entry) {
+            QLIST_FOREACH(le, &mis->loadvm_handlers, entry) {
                 if (le->section_id == section_id) {
                     break;
                 }
@@ -1152,11 +1161,6 @@ int qemu_loadvm_state(QEMUFile *f)
     ret = 0;
 
 out:
-    QLIST_FOREACH_SAFE(le, &loadvm_handlers, entry, new_le) {
-        QLIST_REMOVE(le, entry);
-        g_free(le);
-    }
-
     if (ret == 0) {
         ret = qemu_file_get_error(f);
     }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (13 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-12  6:11   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram Dr. David Alan Gilbert (git)
                   ` (29 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Postcopy needs to have two migration streams loading concurrently;
one from memory (with the device state) and the other from the fd
with the memory transactions.

Split the core of qemu_loadvm_state out so we can use it for both.

Allow the inner loadvm loop to quit and signal whether the parent
should.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 savevm.c     | 106 ++++++++++++++++++++++++++++++++++++-----------------------
 trace-events |   4 +++
 2 files changed, 69 insertions(+), 41 deletions(-)

diff --git a/savevm.c b/savevm.c
index f42713d..4b619da 100644
--- a/savevm.c
+++ b/savevm.c
@@ -951,6 +951,16 @@ static SaveStateEntry *find_se(const char *idstr, int instance_id)
     return NULL;
 }
 
+/* ORable flags that control the (potentially nested) loadvm_state loops */
+enum LoadVMExitCodes {
+    /* Quit the loop level that received this command */
+    LOADVM_QUIT_LOOP     =  1,
+    /* Quit this loop and our parent */
+    LOADVM_QUIT_PARENT   =  2,
+};
+
+static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
+
 static int loadvm_process_command_simple_lencheck(const char *name,
                                                   unsigned int actual,
                                                   unsigned int expected)
@@ -967,6 +977,8 @@ static int loadvm_process_command_simple_lencheck(const char *name,
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * negative return on error (will issue error message)
+ * 0   just a normal return
+ * 1   All good, but exit the loop
  */
 static int loadvm_process_command(QEMUFile *f)
 {
@@ -1036,36 +1048,13 @@ void loadvm_free_handlers(MigrationIncomingState *mis)
     }
 }
 
-int qemu_loadvm_state(QEMUFile *f)
+static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
 {
-    MigrationIncomingState *mis = migration_incoming_get_current();
-    Error *local_err = NULL;
     uint8_t section_type;
-    unsigned int v;
     int ret;
+    int exitcode = 0;
 
-    if (qemu_savevm_state_blocked(&local_err)) {
-        error_report("%s", error_get_pretty(local_err));
-        error_free(local_err);
-        return -EINVAL;
-    }
-
-    v = qemu_get_be32(f);
-    if (v != QEMU_VM_FILE_MAGIC) {
-        error_report("Not a migration stream");
-        return -EINVAL;
-    }
-
-    v = qemu_get_be32(f);
-    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-        error_report("SaveVM v2 format is obsolete and don't work anymore");
-        return -ENOTSUP;
-    }
-    if (v != QEMU_VM_FILE_VERSION) {
-        error_report("Unsupported migration stream version");
-        return -ENOTSUP;
-    }
-
+    trace_qemu_loadvm_state_main();
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
         uint32_t instance_id, version_id, section_id;
         SaveStateEntry *se;
@@ -1093,16 +1082,14 @@ int qemu_loadvm_state(QEMUFile *f)
             if (se == NULL) {
                 error_report("Unknown savevm section or instance '%s' %d",
                              idstr, instance_id);
-                ret = -EINVAL;
-                goto out;
+                return -EINVAL;
             }
 
             /* Validate version */
             if (version_id > se->version_id) {
                 error_report("savevm: unsupported version %d for '%s' v%d",
                              version_id, idstr, se->version_id);
-                ret = -EINVAL;
-                goto out;
+                return -EINVAL;
             }
 
             /* Add entry */
@@ -1117,7 +1104,7 @@ int qemu_loadvm_state(QEMUFile *f)
             if (ret < 0) {
                 error_report("error while loading state for instance 0x%x of"
                              " device '%s'", instance_id, idstr);
-                goto out;
+                return ret;
             }
             break;
         case QEMU_VM_SECTION_PART:
@@ -1132,36 +1119,73 @@ int qemu_loadvm_state(QEMUFile *f)
             }
             if (le == NULL) {
                 error_report("Unknown savevm section %d", section_id);
-                ret = -EINVAL;
-                goto out;
+                return -EINVAL;
             }
 
             ret = vmstate_load(f, le->se, le->version_id);
             if (ret < 0) {
                 error_report("error while loading state section id %d(%s)",
                              section_id, le->se->idstr);
-                goto out;
+                return ret;
             }
             break;
         case QEMU_VM_COMMAND:
             ret = loadvm_process_command(f);
-            if (ret < 0) {
-                goto out;
+            trace_qemu_loadvm_state_section_command(ret);
+            if ((ret < 0) || (ret & LOADVM_QUIT_LOOP)) {
+                return ret;
             }
+            exitcode |= ret; /* Lets us pass flags up to the parent */
             break;
         default:
             error_report("Unknown savevm section type %d", section_type);
-            ret = -EINVAL;
-            goto out;
+            return -EINVAL;
         }
     }
 
-    cpu_synchronize_all_post_init();
+    if (exitcode & LOADVM_QUIT_PARENT) {
+        trace_qemu_loadvm_state_main_quit_parent();
+        exitcode &= ~LOADVM_QUIT_PARENT;
+        exitcode |= LOADVM_QUIT_LOOP;
+    }
+
+    return exitcode;
+}
+
+int qemu_loadvm_state(QEMUFile *f)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    Error *local_err = NULL;
+    unsigned int v;
+    int ret;
+
+    if (qemu_savevm_state_blocked(&local_err)) {
+        error_report("%s", error_get_pretty(local_err));
+        error_free(local_err);
+        return -EINVAL;
+    }
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_VM_FILE_MAGIC) {
+        error_report("Not a migration stream");
+        return -EINVAL;
+    }
+
+    v = qemu_get_be32(f);
+    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+        error_report("SaveVM v2 format is obsolete and don't work anymore");
+        return -ENOTSUP;
+    }
+    if (v != QEMU_VM_FILE_VERSION) {
+        error_report("Unsupported migration stream version");
+        return -ENOTSUP;
+    }
 
-    ret = 0;
+    ret = qemu_loadvm_state_main(f, mis);
 
-out:
+    trace_qemu_loadvm_state_post_main(ret);
     if (ret == 0) {
+        cpu_synchronize_all_post_init();
         ret = qemu_file_get_error(f);
     }
 
diff --git a/trace-events b/trace-events
index 1951b25..4ff55fe 100644
--- a/trace-events
+++ b/trace-events
@@ -1165,7 +1165,11 @@ vmware_setmode(uint32_t w, uint32_t h, uint32_t bpp) "%dx%d @ %d bpp"
 
 # savevm.c
 qemu_loadvm_state_section(unsigned int section_type) "%d"
+qemu_loadvm_state_section_command(int ret) "%d"
 qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
+qemu_loadvm_state_main(void) ""
+qemu_loadvm_state_main_quit_parent(void) ""
+qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
 loadvm_process_command_ping(uint32_t val) "%x"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram.
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (14 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-12  6:14   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages Dr. David Alan Gilbert (git)
                   ` (28 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 include/migration/migration.h | 1 +
 migration/migration.c         | 9 +++++++++
 qapi-schema.json              | 7 ++++++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 751caa0..f94af5b 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -177,6 +177,7 @@ void migrate_add_blocker(Error *reason);
  */
 void migrate_del_blocker(Error *reason);
 
+bool migrate_postcopy_ram(void);
 bool migrate_rdma_pin_all(void);
 bool migrate_zero_blocks(void);
 
diff --git a/migration/migration.c b/migration/migration.c
index 4592060..434864a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -663,6 +663,15 @@ bool migrate_rdma_pin_all(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
 }
 
+bool migrate_postcopy_ram(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_X_POSTCOPY_RAM];
+}
+
 bool migrate_auto_converge(void)
 {
     MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index e16f8eb..a8af1cb 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -494,10 +494,15 @@
 # @auto-converge: If enabled, QEMU will automatically throttle down the guest
 #          to speed up convergence of RAM migration. (since 1.6)
 #
+# @x-postcopy-ram: Start executing on the migration target before all of RAM has
+#          been migrated, pulling the remaining pages along as needed. NOTE: If
+#          the migration fails during postcopy the VM will fail.  (since 2.3)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks'] }
+  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
+           'x-postcopy-ram'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (15 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-12  9:30   ` David Gibson
  2015-03-28 15:43   ` Paolo Bonzini
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream Dr. David Alan Gilbert (git)
                   ` (27 subsequent siblings)
  44 siblings, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The state of the postcopy process is managed via a series of messages;
   * Add wrappers and handlers for sending/receiving these messages
   * Add state variable that track the current state of postcopy

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  15 ++
 include/sysemu/sysemu.h       |  23 +++
 migration/migration.c         |  13 ++
 savevm.c                      | 325 ++++++++++++++++++++++++++++++++++++++++++
 trace-events                  |  11 ++
 5 files changed, 387 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index f94af5b..81cd1f2 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -52,6 +52,14 @@ typedef struct MigrationState MigrationState;
 
 typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
 
+typedef enum {
+    POSTCOPY_INCOMING_NONE = 0,  /* Initial state - no postcopy */
+    POSTCOPY_INCOMING_ADVISE,
+    POSTCOPY_INCOMING_LISTENING,
+    POSTCOPY_INCOMING_RUNNING,
+    POSTCOPY_INCOMING_END
+} PostcopyState;
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *file;
@@ -59,6 +67,8 @@ struct MigrationIncomingState {
     /* See savevm.c */
     LoadStateEntry_Head loadvm_handlers;
 
+    PostcopyState postcopy_state;
+
     QEMUFile *return_path;
     QemuMutex      rp_mutex;    /* We send replies from multiple threads */
 };
@@ -219,4 +229,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
                              ram_addr_t offset, size_t size,
                              int *bytes_sent);
 
+PostcopyState postcopy_state_get(MigrationIncomingState *mis);
+
+/* Set the state and return the old state */
+PostcopyState postcopy_state_set(MigrationIncomingState *mis,
+                                 PostcopyState new_state);
 #endif
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8da879f..d6a6d51 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -87,6 +87,18 @@ enum qemu_vm_cmd {
     MIG_CMD_INVALID = 0,       /* Must be 0 */
     MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
     MIG_CMD_PING,              /* Request a PONG on the RP */
+
+    MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
+                                      warn we might want to do PC */
+    MIG_CMD_POSTCOPY_LISTEN,       /* Start listening for incoming
+                                      pages as it's running. */
+    MIG_CMD_POSTCOPY_RUN,          /* Start execution */
+    MIG_CMD_POSTCOPY_END,          /* Postcopy is finished. */
+
+    MIG_CMD_POSTCOPY_RAM_DISCARD,  /* A list of pages to discard that
+                                      were previously sent during
+                                      precopy but are dirty. */
+
 };
 
 bool qemu_savevm_state_blocked(Error **errp);
@@ -101,6 +113,17 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
                               uint16_t len, uint8_t *data);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
 void qemu_savevm_send_open_return_path(QEMUFile *f);
+void qemu_savevm_send_postcopy_advise(QEMUFile *f);
+void qemu_savevm_send_postcopy_listen(QEMUFile *f);
+void qemu_savevm_send_postcopy_run(QEMUFile *f);
+void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status);
+
+void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
+                                           uint16_t len, uint8_t offset,
+                                           uint64_t *addrlist,
+                                           uint32_t *masklist);
+
+
 int qemu_loadvm_state(QEMUFile *f);
 
 /* SLIRP */
diff --git a/migration/migration.c b/migration/migration.c
index 434864a..957115a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -971,3 +971,16 @@ void migrate_fd_connect(MigrationState *s)
     qemu_thread_create(&s->thread, "migration", migration_thread, s,
                        QEMU_THREAD_JOINABLE);
 }
+
+PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
+{
+    return atomic_fetch_add(&mis->postcopy_state, 0);
+}
+
+/* Set the state and return the old state */
+PostcopyState postcopy_state_set(MigrationIncomingState *mis,
+                                 PostcopyState new_state)
+{
+    return atomic_xchg(&mis->postcopy_state, new_state);
+}
+
diff --git a/savevm.c b/savevm.c
index 4b619da..e31ccb0 100644
--- a/savevm.c
+++ b/savevm.c
@@ -39,6 +39,7 @@
 #include "exec/memory.h"
 #include "qmp-commands.h"
 #include "trace.h"
+#include "qemu/bitops.h"
 #include "qemu/iov.h"
 #include "block/snapshot.h"
 #include "block/qapi.h"
@@ -635,6 +636,90 @@ void qemu_savevm_send_open_return_path(QEMUFile *f)
     qemu_savevm_command_send(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
 }
 
+/* Send prior to any postcopy transfer */
+void qemu_savevm_send_postcopy_advise(QEMUFile *f)
+{
+    uint64_t tmp[2];
+    tmp[0] = cpu_to_be64(getpagesize());
+    tmp[1] = cpu_to_be64(1ul << qemu_target_page_bits());
+
+    trace_qemu_savevm_send_postcopy_advise();
+    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_ADVISE, 16, (uint8_t *)tmp);
+}
+
+/* Prior to running, to cause pages that have been dirtied after precopy
+ * started to be discarded on the destination.
+ * CMD_POSTCOPY_RAM_DISCARD consist of:
+ *  3 byte header (filled in by qemu_savevm_send_postcopy_ram_discard)
+ *      byte   version (0)
+ *      byte   offset to be subtracted from each page address to deal with
+ *             RAMBlocks that don't start on a mask word boundary.
+ *      byte   Length of name field
+ *  n x byte   RAM block name (NOT 0 terminated)
+ *  n x
+ *      be64   Page addresses for start of an invalidation range
+ *      be32   mask of 32 pages, '1' to discard'
+ *
+ *  Hopefully this is pretty sparse so we don't get too many entries,
+ *  and using the mask should deal with most pagesize differences
+ *  just ending up as a single full mask
+ *
+ * The mask is always 32bits irrespective of the long size
+ *
+ *  name:  RAMBlock name that these entries are part of
+ *  len: Number of page entries
+ *  addrlist: 'len' addresses
+ *  masklist: 'len' masks (corresponding to the addresses)
+ */
+void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
+                                           uint16_t len, uint8_t offset,
+                                           uint64_t *addrlist,
+                                           uint32_t *masklist)
+{
+    uint8_t *buf;
+    uint16_t tmplen;
+    uint16_t t;
+
+    trace_qemu_savevm_send_postcopy_ram_discard();
+    buf = g_malloc0(len*12 + strlen(name) + 3);
+    buf[0] = 0; /* Version */
+    buf[1] = offset;
+    assert(strlen(name) < 256);
+    buf[2] = strlen(name);
+    memcpy(buf+3, name, strlen(name));
+    tmplen = 3+strlen(name);
+
+    for (t = 0; t < len; t++) {
+        cpu_to_be64w((uint64_t *)(buf + tmplen), addrlist[t]);
+        tmplen += 8;
+        cpu_to_be32w((uint32_t *)(buf + tmplen), masklist[t]);
+        tmplen += 4;
+    }
+    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_RAM_DISCARD, tmplen, buf);
+    g_free(buf);
+}
+
+/* Get the destination into a state where it can receive postcopy data. */
+void qemu_savevm_send_postcopy_listen(QEMUFile *f)
+{
+    trace_savevm_send_postcopy_listen();
+    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_LISTEN, 0, NULL);
+}
+
+/* Kick the destination into running */
+void qemu_savevm_send_postcopy_run(QEMUFile *f)
+{
+    trace_savevm_send_postcopy_run();
+    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_RUN, 0, NULL);
+}
+
+/* End of postcopy - with a status byte; 0 is good, anything else is a fail */
+void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status)
+{
+    trace_savevm_send_postcopy_end();
+    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_END, 1, &status);
+}
+
 bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
@@ -961,6 +1046,212 @@ enum LoadVMExitCodes {
 
 static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 
+/* ------ incoming postcopy messages ------ */
+/* 'advise' arrives before any transfers just to tell us that a postcopy
+ * *might* happen - it might be skipped if precopy transferred everything
+ * quickly.
+ */
+static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
+                                         uint64_t remote_hps,
+                                         uint64_t remote_tps)
+{
+    PostcopyState ps = postcopy_state_get(mis);
+    trace_loadvm_postcopy_handle_advise();
+    if (ps != POSTCOPY_INCOMING_NONE) {
+        error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state (%d)", ps);
+        return -1;
+    }
+
+    if (remote_hps != getpagesize())  {
+        /*
+         * Some combinations of mismatch are probably possible but it gets
+         * a bit more complicated.  In particular we need to place whole
+         * host pages on the dest at once, and we need to ensure that we
+         * handle dirtying to make sure we never end up sending part of
+         * a hostpage on it's own.
+         */
+        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
+                     (int)remote_hps, getpagesize());
+        return -1;
+    }
+
+    if (remote_tps != (1ul << qemu_target_page_bits())) {
+        /*
+         * Again, some differences could be dealt with, but for now keep it
+         * simple.
+         */
+        error_report("Postcopy needs matching target page sizes (s=%d d=%d)",
+                     (int)remote_tps, 1 << qemu_target_page_bits());
+        return -1;
+    }
+
+    postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
+
+    return 0;
+}
+
+/* After postcopy we will be told to throw some pages away since they're
+ * dirty and will have to be demand fetched.  Must happen before CPU is
+ * started.
+ * There can be 0..many of these messages, each encoding multiple pages.
+ * Bits set in the message represent a page in the source VMs bitmap, but
+ * since the guest/target page sizes can be different on s/d then we have
+ * to convert.
+ */
+static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
+                                              uint16_t len)
+{
+    int tmp;
+    unsigned int first_bit_offset;
+    char ramid[256];
+    PostcopyState ps = postcopy_state_get(mis);
+
+    trace_loadvm_postcopy_ram_handle_discard();
+
+    if (ps != POSTCOPY_INCOMING_ADVISE) {
+        error_report("CMD_POSTCOPY_RAM_DISCARD in wrong postcopy state (%d)",
+                     ps);
+        return -1;
+    }
+    /* We're expecting a
+     *    3 byte header,
+     *    a RAM ID string
+     *    then at least 1 12 byte chunks
+    */
+    if (len < 16) {
+        error_report("CMD_POSTCOPY_RAM_DISCARD invalid length (%d)", len);
+        return -1;
+    }
+
+    tmp = qemu_get_byte(mis->file);
+    if (tmp != 0) {
+        error_report("CMD_POSTCOPY_RAM_DISCARD invalid version (%d)", tmp);
+        return -1;
+    }
+    first_bit_offset = qemu_get_byte(mis->file);
+
+    if (qemu_get_counted_string(mis->file, ramid)) {
+        error_report("CMD_POSTCOPY_RAM_DISCARD Failed to read RAMBlock ID");
+        return -1;
+    }
+
+    len -= 3+strlen(ramid);
+    if (len % 12) {
+        error_report("CMD_POSTCOPY_RAM_DISCARD invalid length (%d)", len);
+        return -1;
+    }
+    while (len) {
+        uint64_t startaddr;
+        uint32_t mask;
+        /*
+         * We now have pairs of address, mask
+         *   Each word of mask is 32 bits, where each bit corresponds to one
+         *   target page.
+         *   RAMBlocks don't necessarily start on word boundaries,
+         *   and the offset in the header indicates the offset into the 1st
+         *   mask word that corresponds to the 1st page of the RAMBlock.
+         */
+        startaddr = qemu_get_be64(mis->file);
+        mask = qemu_get_be32(mis->file);
+
+        len -= 12;
+
+        while (mask) {
+            /* mask= .....?10...0 */
+            /*             ^fs    */
+            int firstset = ctz32(mask);
+
+            /* tmp32=.....?11...1 */
+            /*             ^fs    */
+            uint32_t tmp32 = mask | ((((uint32_t)1)<<firstset)-1);
+
+            /* mask= .?01..10...0 */
+            /*         ^fz ^fs    */
+            int firstzero = cto32(tmp32);
+
+            if ((startaddr == 0) && (firstset < first_bit_offset)) {
+                error_report("CMD_POSTCOPY_RAM_DISCARD bad data; bit set"
+                               " prior to block; block=%s offset=%d"
+                               " firstset=%d\n", ramid, first_bit_offset,
+                               firstzero);
+                return -1;
+            }
+
+            /*
+             * we know there must be at least 1 bit set due to the loop entry
+             * If there is no 0 firstzero will be 32
+             */
+            /* TODO - ram_discard_range gets added in a later patch
+            int ret = ram_discard_range(mis, ramid,
+                                startaddr + firstset - first_bit_offset,
+                                startaddr + (firstzero - 1) - first_bit_offset);
+            ret = -1;
+            if (ret) {
+                return ret;
+            }
+            */
+
+            /* mask= .?0000000000 */
+            /*         ^fz ^fs    */
+            if (firstzero != 32) {
+                mask &= (((uint32_t)-1) << firstzero);
+            } else {
+                mask = 0;
+            }
+        }
+    }
+    trace_loadvm_postcopy_ram_handle_discard_end();
+
+    return 0;
+}
+
+/* After this message we must be able to immediately receive postcopy data */
+static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
+{
+    PostcopyState ps = postcopy_state_set(mis, POSTCOPY_INCOMING_LISTENING);
+    trace_loadvm_postcopy_handle_listen();
+    if (ps != POSTCOPY_INCOMING_ADVISE) {
+        error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
+        return -1;
+    }
+
+    /* TODO start up the postcopy listening thread */
+    return 0;
+}
+
+/* After all discards we can start running and asking for pages */
+static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
+{
+    PostcopyState ps = postcopy_state_set(mis, POSTCOPY_INCOMING_RUNNING);
+    trace_loadvm_postcopy_handle_run();
+    if (ps != POSTCOPY_INCOMING_LISTENING) {
+        error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps);
+        return -1;
+    }
+
+    if (autostart) {
+        /* Hold onto your hats, starting the CPU */
+        vm_start();
+    } else {
+        /* leave it paused and let management decide when to start the CPU */
+        runstate_set(RUN_STATE_PAUSED);
+    }
+
+    return 0;
+}
+
+/* The end - with a byte from the source which can tell us to fail. */
+static int loadvm_postcopy_handle_end(MigrationIncomingState *mis)
+{
+    PostcopyState ps = postcopy_state_get(mis);
+    trace_loadvm_postcopy_handle_end();
+    if (ps == POSTCOPY_INCOMING_NONE) {
+        error_report("CMD_POSTCOPY_END in wrong postcopy state (%d)", ps);
+        return -1;
+    }
+    return -1; /* TODO - expecting 1 byte good/fail */
+}
+
 static int loadvm_process_command_simple_lencheck(const char *name,
                                                   unsigned int actual,
                                                   unsigned int expected)
@@ -986,6 +1277,7 @@ static int loadvm_process_command(QEMUFile *f)
     uint16_t com;
     uint16_t len;
     uint32_t tmp32;
+    uint64_t tmp64a, tmp64b;
 
     com = qemu_get_be16(f);
     len = qemu_get_be16(f);
@@ -1023,6 +1315,39 @@ static int loadvm_process_command(QEMUFile *f)
         migrate_send_rp_pong(mis, tmp32);
         break;
 
+    case MIG_CMD_POSTCOPY_ADVISE:
+        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_ADVISE",
+                                                   len, 16)) {
+            return -1;
+        }
+        tmp64a = qemu_get_be64(f); /* hps */
+        tmp64b = qemu_get_be64(f); /* tps */
+        return loadvm_postcopy_handle_advise(mis, tmp64a, tmp64b);
+
+    case MIG_CMD_POSTCOPY_LISTEN:
+        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_LISTEN",
+                                                   len, 0)) {
+            return -1;
+        }
+        return loadvm_postcopy_handle_listen(mis);
+
+    case MIG_CMD_POSTCOPY_RUN:
+        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_RUN",
+                                                   len, 0)) {
+            return -1;
+        }
+        return loadvm_postcopy_handle_run(mis);
+
+    case MIG_CMD_POSTCOPY_END:
+        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_END",
+                                                   len, 1)) {
+            return -1;
+        }
+        return loadvm_postcopy_handle_end(mis);
+
+    case MIG_CMD_POSTCOPY_RAM_DISCARD:
+        return loadvm_postcopy_ram_handle_discard(mis, len);
+
     default:
         error_report("VM_COMMAND 0x%x unknown (len 0x%x)", com, len);
         return -1;
diff --git a/trace-events b/trace-events
index 4ff55fe..050f553 100644
--- a/trace-events
+++ b/trace-events
@@ -1171,11 +1171,22 @@ qemu_loadvm_state_main(void) ""
 qemu_loadvm_state_main_quit_parent(void) ""
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
+loadvm_postcopy_handle_advise(void) ""
+loadvm_postcopy_handle_end(void) ""
+loadvm_postcopy_handle_listen(void) ""
+loadvm_postcopy_handle_run(void) ""
+loadvm_postcopy_ram_handle_discard(void) ""
+loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
 loadvm_process_command_ping(uint32_t val) "%x"
+qemu_savevm_send_postcopy_advise(void) ""
+qemu_savevm_send_postcopy_ram_discard(void) ""
 savevm_section_start(const char *id, unsigned int section_id) "%s, section_id %u"
 savevm_section_end(const char *id, unsigned int section_id, int ret) "%s, section_id %u -> %d"
 savevm_send_ping(uint32_t val) "%x"
+savevm_send_postcopy_end(void) ""
+savevm_send_postcopy_listen(void) ""
+savevm_send_postcopy_run(void) ""
 savevm_state_begin(void) ""
 savevm_state_header(void) ""
 savevm_state_iterate(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (16 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  0:55   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 19/45] migrate_init: Call from savevm Dr. David Alan Gilbert (git)
                   ` (26 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

MIG_CMD_PACKAGED is a migration command that allows a chunk
of migration stream to be sent in one go, and be received by
a separate instance of the loadvm loop while not interacting
with the migration stream.

This is used by postcopy to load device state (from the package)
while loading memory pages from the main stream.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/sysemu/sysemu.h |  4 +++
 savevm.c                | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
 trace-events            |  4 +++
 3 files changed, 90 insertions(+)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index d6a6d51..e83bf80 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -87,6 +87,7 @@ enum qemu_vm_cmd {
     MIG_CMD_INVALID = 0,       /* Must be 0 */
     MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
     MIG_CMD_PING,              /* Request a PONG on the RP */
+    MIG_CMD_PACKAGED,          /* Send a wrapped stream within this stream */
 
     MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
                                       warn we might want to do PC */
@@ -101,6 +102,8 @@ enum qemu_vm_cmd {
 
 };
 
+#define MAX_VM_CMD_PACKAGED_SIZE (1ul << 24)
+
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params);
@@ -113,6 +116,7 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
                               uint16_t len, uint8_t *data);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
 void qemu_savevm_send_open_return_path(QEMUFile *f);
+void qemu_savevm_send_packaged(QEMUFile *f, const QEMUSizedBuffer *qsb);
 void qemu_savevm_send_postcopy_advise(QEMUFile *f);
 void qemu_savevm_send_postcopy_listen(QEMUFile *f);
 void qemu_savevm_send_postcopy_run(QEMUFile *f);
diff --git a/savevm.c b/savevm.c
index e31ccb0..f65bff3 100644
--- a/savevm.c
+++ b/savevm.c
@@ -636,6 +636,38 @@ void qemu_savevm_send_open_return_path(QEMUFile *f)
     qemu_savevm_command_send(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
 }
 
+/* We have a buffer of data to send; we don't want that all to be loaded
+ * by the command itself, so the command contains just the length of the
+ * extra buffer that we then send straight after it.
+ * TODO: Must be a better way to organise that
+ */
+void qemu_savevm_send_packaged(QEMUFile *f, const QEMUSizedBuffer *qsb)
+{
+    size_t cur_iov;
+    size_t len = qsb_get_length(qsb);
+    uint32_t tmp;
+
+    tmp = cpu_to_be32(len);
+
+    trace_qemu_savevm_send_packaged();
+    qemu_savevm_command_send(f, MIG_CMD_PACKAGED, 4, (uint8_t *)&tmp);
+
+    /* all the data follows (concatinating the iov's) */
+    for (cur_iov = 0; cur_iov < qsb->n_iov; cur_iov++) {
+        /* The iov entries are partially filled */
+        size_t towrite = (qsb->iov[cur_iov].iov_len > len) ?
+                              len :
+                              qsb->iov[cur_iov].iov_len;
+        len -= towrite;
+
+        if (!towrite) {
+            break;
+        }
+
+        qemu_put_buffer(f, qsb->iov[cur_iov].iov_base, towrite);
+    }
+}
+
 /* Send prior to any postcopy transfer */
 void qemu_savevm_send_postcopy_advise(QEMUFile *f)
 {
@@ -1265,6 +1297,48 @@ static int loadvm_process_command_simple_lencheck(const char *name,
     return 0;
 }
 
+/* Immediately following this command is a blob of data containing an embedded
+ * chunk of migration stream; read it and load it.
+ */
+static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
+                                      uint32_t length)
+{
+    int ret;
+    uint8_t *buffer;
+    QEMUSizedBuffer *qsb;
+
+    trace_loadvm_handle_cmd_packaged(length);
+
+    if (length > MAX_VM_CMD_PACKAGED_SIZE) {
+        error_report("Unreasonably large packaged state: %u", length);
+        return -1;
+    }
+    buffer = g_malloc0(length);
+    ret = qemu_get_buffer(mis->file, buffer, (int)length);
+    if (ret != length) {
+        g_free(buffer);
+        error_report("CMD_PACKAGED: Buffer receive fail ret=%d length=%d\n",
+                ret, length);
+        return (ret < 0) ? ret : -EAGAIN;
+    }
+    trace_loadvm_handle_cmd_packaged_received(ret);
+
+    /* Setup a dummy QEMUFile that actually reads from the buffer */
+    qsb = qsb_create(buffer, length);
+    g_free(buffer); /* Because qsb_create copies */
+    if (!qsb) {
+        error_report("Unable to create qsb");
+    }
+    QEMUFile *packf = qemu_bufopen("r", qsb);
+
+    ret = qemu_loadvm_state_main(packf, mis);
+    trace_loadvm_handle_cmd_packaged_main(ret);
+    qemu_fclose(packf);
+    qsb_free(qsb);
+
+    return ret;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * negative return on error (will issue error message)
@@ -1315,6 +1389,14 @@ static int loadvm_process_command(QEMUFile *f)
         migrate_send_rp_pong(mis, tmp32);
         break;
 
+    case MIG_CMD_PACKAGED:
+        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_PACKAGED",
+            len, 4)) {
+            return -1;
+         }
+        tmp32 = qemu_get_be32(f);
+        return loadvm_handle_cmd_packaged(mis, tmp32);
+
     case MIG_CMD_POSTCOPY_ADVISE:
         if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_ADVISE",
                                                    len, 16)) {
diff --git a/trace-events b/trace-events
index 050f553..cbf995c 100644
--- a/trace-events
+++ b/trace-events
@@ -1171,6 +1171,10 @@ qemu_loadvm_state_main(void) ""
 qemu_loadvm_state_main_quit_parent(void) ""
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
+qemu_savevm_send_packaged(void) ""
+loadvm_handle_cmd_packaged(unsigned int length) "%u"
+loadvm_handle_cmd_packaged_main(int ret) "%d"
+loadvm_handle_cmd_packaged_received(int ret) "%d"
 loadvm_postcopy_handle_advise(void) ""
 loadvm_postcopy_handle_end(void) ""
 loadvm_postcopy_handle_listen(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 19/45] migrate_init: Call from savevm
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (17 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy Dr. David Alan Gilbert (git)
                   ` (25 subsequent siblings)
  44 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Suspend to file is very much like a migrate, and it makes life
easier if we have the Migration state available, so initialise it
in the savevm.c code for suspending.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 include/migration/migration.h | 3 +--
 include/qemu/typedefs.h       | 1 +
 migration/migration.c         | 2 +-
 savevm.c                      | 2 ++
 4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 81cd1f2..e6a814a 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -48,8 +48,6 @@ enum mig_rpcomm_cmd {
     MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
 };
 
-typedef struct MigrationState MigrationState;
-
 typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
 
 typedef enum {
@@ -146,6 +144,7 @@ int migrate_fd_close(MigrationState *s);
 
 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
+MigrationState *migrate_init(const MigrationParams *params);
 bool migration_in_setup(MigrationState *);
 bool migration_has_finished(MigrationState *);
 bool migration_has_failed(MigrationState *);
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 6fdcbcd..611db46 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -41,6 +41,7 @@ typedef struct MemoryRegion MemoryRegion;
 typedef struct MemoryRegionSection MemoryRegionSection;
 typedef struct MigrationIncomingState MigrationIncomingState;
 typedef struct MigrationParams MigrationParams;
+typedef struct MigrationState MigrationState;
 typedef struct Monitor Monitor;
 typedef struct MouseTransformInfo MouseTransformInfo;
 typedef struct MSIMessage MSIMessage;
diff --git a/migration/migration.c b/migration/migration.c
index 957115a..2e6adca 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -493,7 +493,7 @@ bool migration_has_failed(MigrationState *s)
             s->state == MIG_STATE_ERROR);
 }
 
-static MigrationState *migrate_init(const MigrationParams *params)
+MigrationState *migrate_init(const MigrationParams *params)
 {
     MigrationState *s = migrate_get_current();
     int64_t bandwidth_limit = s->bandwidth_limit;
diff --git a/savevm.c b/savevm.c
index f65bff3..df48ba8 100644
--- a/savevm.c
+++ b/savevm.c
@@ -982,6 +982,8 @@ static int qemu_savevm_state(QEMUFile *f)
         .blk = 0,
         .shared = 0
     };
+    MigrationState *ms = migrate_init(&params);
+    ms->file = f;
 
     if (qemu_savevm_state_blocked(NULL)) {
         return -EINVAL;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (18 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 19/45] migrate_init: Call from savevm Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  1:00   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 21/45] Add Linux userfaultfd header Dr. David Alan Gilbert (git)
                   ` (24 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Modify save_live_pending to return separate postcopiable and
non-postcopiable counts.

Add 'can_postcopy' to allow a device to state if it can postcopy

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c                 | 15 +++++++++++++--
 include/migration/vmstate.h | 10 ++++++++--
 include/sysemu/sysemu.h     |  4 +++-
 migration/block.c           |  7 +++++--
 migration/migration.c       |  9 +++++++--
 savevm.c                    | 21 +++++++++++++++++----
 trace-events                |  2 +-
 7 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index fe0df0d..7bc5fa6 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -997,7 +997,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
+static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+                             uint64_t *non_postcopiable_pending,
+                             uint64_t *postcopiable_pending)
 {
     uint64_t remaining_size;
 
@@ -1009,7 +1011,9 @@ static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
         qemu_mutex_unlock_iothread();
         remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
     }
-    return remaining_size;
+
+    *non_postcopiable_pending = 0;
+    *postcopiable_pending = remaining_size;
 }
 
 static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
@@ -1204,6 +1208,12 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+/* RAM's always up for postcopying */
+static bool ram_can_postcopy(void *opaque)
+{
+    return true;
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_live_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
@@ -1211,6 +1221,7 @@ static SaveVMHandlers savevm_ram_handlers = {
     .save_live_pending = ram_save_pending,
     .load_state = ram_load,
     .cancel = ram_migration_cancel,
+    .can_postcopy = ram_can_postcopy,
 };
 
 void ram_mig_init(void)
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 18da207..c9ec74a 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -54,8 +54,14 @@ typedef struct SaveVMHandlers {
 
     /* This runs outside the iothread lock!  */
     int (*save_live_setup)(QEMUFile *f, void *opaque);
-    uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size);
-
+    /*
+     * postcopiable_pending must return 0 unless the can_postcopy
+     * handler returns true.
+     */
+    void (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size,
+                              uint64_t *non_postcopiable_pending,
+                              uint64_t *postcopiable_pending);
+    bool (*can_postcopy)(void *opaque);
     LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index e83bf80..5f518b3 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -111,7 +111,9 @@ void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
 void qemu_savevm_state_cancel(void);
-uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
+void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
+                               uint64_t *res_non_postcopiable,
+                               uint64_t *res_postcopiable);
 void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
                               uint16_t len, uint8_t *data);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
diff --git a/migration/block.c b/migration/block.c
index 0c76106..0f6f209 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -754,7 +754,9 @@ static int block_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static uint64_t block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
+static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+                               uint64_t *non_postcopiable_pending,
+                               uint64_t *postcopiable_pending)
 {
     /* Estimate pending number of bytes to send */
     uint64_t pending;
@@ -773,7 +775,8 @@ static uint64_t block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
     qemu_mutex_unlock_iothread();
 
     DPRINTF("Enter save live pending  %" PRIu64 "\n", pending);
-    return pending;
+    *non_postcopiable_pending = pending;
+    *postcopiable_pending = 0;
 }
 
 static int block_load(QEMUFile *f, void *opaque, int version_id)
diff --git a/migration/migration.c b/migration/migration.c
index 2e6adca..a4fc7d7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -868,8 +868,13 @@ static void *migration_thread(void *opaque)
         uint64_t pending_size;
 
         if (!qemu_file_rate_limit(s->file)) {
-            pending_size = qemu_savevm_state_pending(s->file, max_size);
-            trace_migrate_pending(pending_size, max_size);
+            uint64_t pend_post, pend_nonpost;
+
+            qemu_savevm_state_pending(s->file, max_size, &pend_nonpost,
+                                      &pend_post);
+            pending_size = pend_nonpost + pend_post;
+            trace_migrate_pending(pending_size, max_size,
+                                  pend_post, pend_nonpost);
             if (pending_size && pending_size >= max_size) {
                 qemu_savevm_state_iterate(s->file);
             } else {
diff --git a/savevm.c b/savevm.c
index df48ba8..e301a0a 100644
--- a/savevm.c
+++ b/savevm.c
@@ -944,10 +944,20 @@ void qemu_savevm_state_complete(QEMUFile *f)
     qemu_fflush(f);
 }
 
-uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size)
+/* Give an estimate of the amount left to be transferred,
+ * the result is split into the amount for units that can and
+ * for units that can't do postcopy.
+ */
+void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
+                               uint64_t *res_non_postcopiable,
+                               uint64_t *res_postcopiable)
 {
     SaveStateEntry *se;
-    uint64_t ret = 0;
+    uint64_t tmp_non_postcopiable, tmp_postcopiable;
+
+    *res_non_postcopiable = 0;
+    *res_postcopiable = 0;
+
 
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         if (!se->ops || !se->ops->save_live_pending) {
@@ -958,9 +968,12 @@ uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size)
                 continue;
             }
         }
-        ret += se->ops->save_live_pending(f, se->opaque, max_size);
+        se->ops->save_live_pending(f, se->opaque, max_size,
+                                   &tmp_non_postcopiable, &tmp_postcopiable);
+
+        *res_postcopiable += tmp_postcopiable;
+        *res_non_postcopiable += tmp_non_postcopiable;
     }
-    return ret;
 }
 
 void qemu_savevm_state_cancel(void)
diff --git a/trace-events b/trace-events
index cbf995c..83312b6 100644
--- a/trace-events
+++ b/trace-events
@@ -1400,7 +1400,7 @@ migrate_fd_cleanup(void) ""
 migrate_fd_cleanup_src_rp(void) ""
 migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
-migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
+migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
 migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
 open_outgoing_return_path(void) ""
 open_outgoing_return_path_continue(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 21/45] Add Linux userfaultfd header
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (19 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test Dr. David Alan Gilbert (git)
                   ` (23 subsequent siblings)
  44 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

TODO: Update when kernel interface settles
      Update update-linux-headers.sh

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 linux-headers/linux/userfaultfd.h | 150 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)
 create mode 100644 linux-headers/linux/userfaultfd.h

diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
new file mode 100644
index 0000000..db6e99a
--- /dev/null
+++ b/linux-headers/linux/userfaultfd.h
@@ -0,0 +1,150 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#define UFFD_API ((__u64)0xAA)
+/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_IOCTLS				\
+	((__u64)1 << _UFFDIO_REGISTER |		\
+	 (__u64)1 << _UFFDIO_UNREGISTER |	\
+	 (__u64)1 << _UFFDIO_API)
+#define UFFD_API_RANGE_IOCTLS			\
+	((__u64)1 << _UFFDIO_WAKE |		\
+	 (__u64)1 << _UFFDIO_COPY |		\
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_REMAP)
+
+/*
+ * Valid ioctl command number range with this API is from 0x00 to
+ * 0x3F.  UFFDIO_API is the fixed number, everything else can be
+ * changed by implementing a different UFFD_API. If sticking to the
+ * same UFFD_API more ioctl can be added and userland will be aware of
+ * which ioctl the running kernel implements through the ioctl command
+ * bitmask written by the UFFDIO_API.
+ */
+#define _UFFDIO_REGISTER		(0x00)
+#define _UFFDIO_UNREGISTER		(0x01)
+#define _UFFDIO_WAKE			(0x02)
+#define _UFFDIO_COPY			(0x03)
+#define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_REMAP			(0x05)
+#define _UFFDIO_API			(0x3F)
+
+/* userfaultfd ioctl ids */
+#define UFFDIO 0xAA
+#define UFFDIO_API		_IOWR(UFFDIO, _UFFDIO_API,	\
+				      struct uffdio_api)
+#define UFFDIO_REGISTER		_IOWR(UFFDIO, _UFFDIO_REGISTER, \
+				      struct uffdio_register)
+#define UFFDIO_UNREGISTER	_IOR(UFFDIO, _UFFDIO_UNREGISTER,	\
+				     struct uffdio_range)
+#define UFFDIO_WAKE		_IOR(UFFDIO, _UFFDIO_WAKE,	\
+				     struct uffdio_range)
+#define UFFDIO_COPY		_IOWR(UFFDIO, _UFFDIO_COPY,	\
+				      struct uffdio_copy)
+#define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
+				      struct uffdio_zeropage)
+#define UFFDIO_REMAP		_IOWR(UFFDIO, _UFFDIO_REMAP,	\
+				      struct uffdio_remap)
+
+/*
+ * Valid bits below PAGE_SHIFT in the userfault address read through
+ * the read() syscall.
+ */
+#define UFFD_BIT_WRITE	(1<<0)	/* this was a write fault, MISSING or WP */
+#define UFFD_BIT_WP	(1<<1)	/* handle_userfault() reason VM_UFFD_WP */
+#define UFFD_BITS	2	/* two above bits used for UFFD_BIT_* mask */
+
+struct uffdio_api {
+	/* userland asks for an API number */
+	__u64 api;
+
+	/* kernel answers below with the available features for the API */
+	__u64 bits;
+	__u64 ioctls;
+};
+
+struct uffdio_range {
+	__u64 start;
+	__u64 len;
+};
+
+struct uffdio_register {
+	struct uffdio_range range;
+#define UFFDIO_REGISTER_MODE_MISSING	((__u64)1<<0)
+#define UFFDIO_REGISTER_MODE_WP		((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * kernel answers which ioctl commands are available for the
+	 * range, keep at the end as the last 8 bytes aren't read.
+	 */
+	__u64 ioctls;
+};
+
+struct uffdio_copy {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * There will be a wrprotection flag later that allows to map
+	 * pages wrprotected on the fly. And such a flag will be
+	 * available if the wrprotection ioctl are implemented for the
+	 * range according to the uffdio_register.ioctls.
+	 */
+#define UFFDIO_COPY_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * "copy" and "wake" are written by the ioctl and must be at
+	 * the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 copy;
+	__s64 wake;
+};
+
+struct uffdio_zeropage {
+	struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * "zeropage" and "wake" are written by the ioctl and must be
+	 * at the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 zeropage;
+	__s64 wake;
+};
+
+struct uffdio_remap {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * Especially if used to atomically remove memory from the
+	 * address space the wake on the dst range is not needed.
+	 */
+#define UFFDIO_REMAP_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES	((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * "remap" and "wake" are written by the ioctl and must be at
+	 * the end: the copy_from_user will not read the last 16
+	 * bytes.
+	 */
+	__s64 remap;
+	__s64 wake;
+};
+
+#endif /* _LINUX_USERFAULTFD_H */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (20 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 21/45] Add Linux userfaultfd header Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  1:23   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy Dr. David Alan Gilbert (git)
                   ` (22 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Provide a check to see if the OS we're running on has all the bits
needed for postcopy.

Creates postcopy-ram.c which will get most of the other helpers we need.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/postcopy-ram.h |  19 +++++
 migration/Makefile.objs          |   2 +-
 migration/postcopy-ram.c         | 161 +++++++++++++++++++++++++++++++++++++++
 savevm.c                         |   5 ++
 4 files changed, 186 insertions(+), 1 deletion(-)
 create mode 100644 include/migration/postcopy-ram.h
 create mode 100644 migration/postcopy-ram.c

diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
new file mode 100644
index 0000000..d81934f
--- /dev/null
+++ b/include/migration/postcopy-ram.h
@@ -0,0 +1,19 @@
+/*
+ * Postcopy migration for RAM
+ *
+ * Copyright 2013 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *  Dave Gilbert  <dgilbert@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#ifndef QEMU_POSTCOPY_RAM_H
+#define QEMU_POSTCOPY_RAM_H
+
+/* Return true if the host supports everything we need to do postcopy-ram */
+bool postcopy_ram_supported_by_host(void);
+
+#endif
diff --git a/migration/Makefile.objs b/migration/Makefile.objs
index d929e96..0cac6d7 100644
--- a/migration/Makefile.objs
+++ b/migration/Makefile.objs
@@ -1,7 +1,7 @@
 common-obj-y += migration.o tcp.o
 common-obj-y += vmstate.o
 common-obj-y += qemu-file.o qemu-file-buf.o qemu-file-unix.o qemu-file-stdio.o
-common-obj-y += xbzrle.o
+common-obj-y += xbzrle.o postcopy-ram.o
 
 common-obj-$(CONFIG_RDMA) += rdma.o
 common-obj-$(CONFIG_POSIX) += exec.o unix.o fd.o
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
new file mode 100644
index 0000000..a0e20b2
--- /dev/null
+++ b/migration/postcopy-ram.c
@@ -0,0 +1,161 @@
+/*
+ * Postcopy migration for RAM
+ *
+ * Copyright 2013-2014 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *  Dave Gilbert  <dgilbert@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/*
+ * Postcopy is a migration technique where the execution flips from the
+ * source to the destination before all the data has been copied.
+ */
+
+#include <glib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include "qemu-common.h"
+#include "migration/migration.h"
+#include "migration/postcopy-ram.h"
+#include "sysemu/sysemu.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+/* Postcopy needs to detect accesses to pages that haven't yet been copied
+ * across, and efficiently map new pages in, the techniques for doing this
+ * are target OS specific.
+ */
+#if defined(__linux__)
+
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <asm/types.h> /* for __u64 */
+#include <linux/userfaultfd.h>
+
+#ifdef HOST_X86_64
+#ifndef __NR_userfaultfd
+#define __NR_userfaultfd 323
+#endif
+#endif
+
+#endif
+
+#if defined(__linux__) && defined(__NR_userfaultfd)
+
+static bool ufd_version_check(int ufd)
+{
+    struct uffdio_api api_struct;
+    uint64_t feature_mask;
+
+    api_struct.api = UFFD_API;
+    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
+        perror("postcopy_ram_supported_by_host: UFFDIO_API failed");
+        return false;
+    }
+
+    feature_mask = (__u64)1 << _UFFDIO_REGISTER |
+                   (__u64)1 << _UFFDIO_UNREGISTER;
+    if ((api_struct.ioctls & feature_mask) != feature_mask) {
+        error_report("Missing userfault features: %" PRIu64,
+                     (uint64_t)(~api_struct.ioctls & feature_mask));
+        return false;
+    }
+
+    return true;
+}
+
+bool postcopy_ram_supported_by_host(void)
+{
+    long pagesize = getpagesize();
+    int ufd = -1;
+    bool ret = false; /* Error unless we change it */
+    void *testarea = NULL;
+    struct uffdio_register reg_struct;
+    struct uffdio_range range_struct;
+    uint64_t feature_mask;
+
+    if ((1ul << qemu_target_page_bits()) > pagesize) {
+        /* The PMI code doesn't yet deal with TPS>HPS */
+        error_report("Target page size bigger than host page size");
+        goto out;
+    }
+
+    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    if (ufd == -1) {
+        perror("postcopy_ram_supported_by_host: userfaultfd not available");
+        goto out;
+    }
+
+    /* Version and features check */
+    if (!ufd_version_check(ufd)) {
+        goto out;
+    }
+
+    /*
+     *  We need to check that the ops we need are supported on anon memory
+     *  To do that we need to register a chunk and see the flags that
+     *  are returned.
+     */
+    testarea = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_PRIVATE |
+                                    MAP_ANONYMOUS, -1, 0);
+    if (!testarea) {
+        perror("postcopy_ram_supported_by_host: Failed to map test area");
+        goto out;
+    }
+    g_assert(((size_t)testarea & (pagesize-1)) == 0);
+
+    reg_struct.range.start = (uint64_t)(uintptr_t)testarea;
+    reg_struct.range.len = (uint64_t)pagesize;
+    reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
+
+    if (ioctl(ufd, UFFDIO_REGISTER, &reg_struct)) {
+        perror("postcopy_ram_supported_by_host userfault register");
+        goto out;
+    }
+
+    range_struct.start = (uint64_t)(uintptr_t)testarea;
+    range_struct.len = (uint64_t)pagesize;
+    if (ioctl(ufd, UFFDIO_UNREGISTER, &range_struct)) {
+        perror("postcopy_ram_supported_by_host userfault unregister");
+        goto out;
+    }
+
+    feature_mask = (__u64)1 << _UFFDIO_WAKE |
+                   (__u64)1 << _UFFDIO_COPY |
+                   (__u64)1 << _UFFDIO_ZEROPAGE;
+    if ((reg_struct.ioctls & feature_mask) != feature_mask) {
+        error_report("Missing userfault map features: %" PRIu64,
+                     (uint64_t)(~reg_struct.ioctls & feature_mask));
+        goto out;
+    }
+
+    /* Success! */
+    ret = true;
+out:
+    if (testarea) {
+        munmap(testarea, pagesize);
+    }
+    if (ufd != -1) {
+        close(ufd);
+    }
+    return ret;
+}
+
+#else
+/* No target OS support, stubs just fail */
+
+bool postcopy_ram_supported_by_host(void)
+{
+    error_report("%s: No OS support", __func__);
+    return false;
+}
+
+#endif
+
diff --git a/savevm.c b/savevm.c
index e301a0a..2ea4c76 100644
--- a/savevm.c
+++ b/savevm.c
@@ -33,6 +33,7 @@
 #include "qemu/timer.h"
 #include "audio/audio.h"
 #include "migration/migration.h"
+#include "migration/postcopy-ram.h"
 #include "qemu/sockets.h"
 #include "qemu/queue.h"
 #include "sysemu/cpus.h"
@@ -1109,6 +1110,10 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
         return -1;
     }
 
+    if (!postcopy_ram_supported_by_host()) {
+        return -1;
+    }
+
     if (remote_hps != getpagesize())  {
         /*
          * Some combinations of mismatch are probably possible but it gets
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (21 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  1:26   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 24/45] MIG_STATE_POSTCOPY_ACTIVE: Add new migration state Dr. David Alan Gilbert (git)
                   ` (21 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Once postcopy is enabled (with migrate_set_capability), the migration
will still start on precopy mode.  To cause a transition into postcopy
the:

  migrate_start_postcopy

command must be issued.  Postcopy will start sometime after this
(when it's next checked in the migration loop).

Issuing the command before migration has started will error,
and issuing after it has finished is ignored.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 hmp-commands.hx               | 15 +++++++++++++++
 hmp.c                         |  7 +++++++
 hmp.h                         |  1 +
 include/migration/migration.h |  3 +++
 migration/migration.c         | 22 ++++++++++++++++++++++
 qapi-schema.json              |  8 ++++++++
 qmp-commands.hx               | 19 +++++++++++++++++++
 7 files changed, 75 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index e37bc8b..03b8b78 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
 ETEXI
 
     {
+        .name       = "migrate_start_postcopy",
+        .args_type  = "",
+        .params     = "",
+        .help       = "Switch migration to postcopy mode",
+        .mhandler.cmd = hmp_migrate_start_postcopy,
+    },
+
+STEXI
+@item migrate_start_postcopy
+@findex migrate_start_postcopy
+Switch in-progress migration to postcopy mode. Ignored after the end of
+migration (or once already in postcopy).
+ETEXI
+
+    {
         .name       = "client_migrate_info",
         .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
         .params     = "protocol hostname port tls-port cert-subject",
diff --git a/hmp.c b/hmp.c
index b47f331..df9736c 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
     }
 }
 
+void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    qmp_migrate_start_postcopy(&err);
+    hmp_handle_error(mon, &err);
+}
+
 void hmp_set_password(Monitor *mon, const QDict *qdict)
 {
     const char *protocol  = qdict_get_str(qdict, "protocol");
diff --git a/hmp.h b/hmp.h
index 4bb5dca..da1334f 100644
--- a/hmp.h
+++ b/hmp.h
@@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
+void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
 void hmp_set_password(Monitor *mon, const QDict *qdict);
 void hmp_expire_password(Monitor *mon, const QDict *qdict);
 void hmp_eject(Monitor *mon, const QDict *qdict);
diff --git a/include/migration/migration.h b/include/migration/migration.h
index e6a814a..293c83e 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -104,6 +104,9 @@ struct MigrationState
     int64_t xbzrle_cache_size;
     int64_t setup_time;
     int64_t dirty_sync_count;
+
+    /* Flag set once the migration has been asked to enter postcopy */
+    bool start_postcopy;
 };
 
 void process_incoming_migration(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index a4fc7d7..43ca656 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
     }
 }
 
+void qmp_migrate_start_postcopy(Error **errp)
+{
+    MigrationState *s = migrate_get_current();
+
+    if (!migrate_postcopy_ram()) {
+        error_setg(errp, "Enable postcopy with migration_set_capability before"
+                         " the start of migration");
+        return;
+    }
+
+    if (s->state == MIG_STATE_NONE) {
+        error_setg(errp, "Postcopy must be started after migration has been"
+                         " started");
+        return;
+    }
+    /*
+     * we don't error if migration has finished since that would be racy
+     * with issuing this command.
+     */
+    atomic_set(&s->start_postcopy, true);
+}
+
 /* shared migration helpers */
 
 static void migrate_set_state(MigrationState *s, int old_state, int new_state)
diff --git a/qapi-schema.json b/qapi-schema.json
index a8af1cb..7ff61e9 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -542,6 +542,14 @@
 { 'command': 'query-migrate-capabilities', 'returns':   ['MigrationCapabilityStatus']}
 
 ##
+# @migrate-start-postcopy
+#
+# Switch migration to postcopy mode
+#
+# Since: 2.3
+{ 'command': 'migrate-start-postcopy' }
+
+##
 # @MouseInfo:
 #
 # Information about a mouse device.
diff --git a/qmp-commands.hx b/qmp-commands.hx
index a85d847..25d2208 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -685,6 +685,25 @@ Example:
 
 EQMP
     {
+        .name       = "migrate-start-postcopy",
+        .args_type  = "",
+        .mhandler.cmd_new = qmp_marshal_input_migrate_start_postcopy,
+    },
+
+SQMP
+migrate-start-postcopy
+----------------------
+
+Switch an in-progress migration to postcopy mode. Ignored after the end of
+migration (or once already in postcopy).
+
+Example:
+-> { "execute": "migrate-start-postcopy" }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "query-migrate-cache-size",
         .args_type  = "",
         .mhandler.cmd_new = qmp_marshal_input_query_migrate_cache_size,
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 24/45] MIG_STATE_POSTCOPY_ACTIVE: Add new migration state
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (22 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  4:45   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes Dr. David Alan Gilbert (git)
                   ` (20 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

'MIG_STATE_POSTCOPY_ACTIVE' is entered after migrate_start_postcopy

'migration_postcopy_phase' is provided for other sections to know if
they're in postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  2 ++
 migration/migration.c         | 58 ++++++++++++++++++++++++++++++++++++++-----
 trace-events                  |  1 +
 3 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 293c83e..b44b9b2 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -151,6 +151,8 @@ MigrationState *migrate_init(const MigrationParams *params);
 bool migration_in_setup(MigrationState *);
 bool migration_has_finished(MigrationState *);
 bool migration_has_failed(MigrationState *);
+/* True if outgoing migration has entered postcopy phase */
+bool migration_postcopy_phase(MigrationState *);
 MigrationState *migrate_get_current(void);
 
 uint64_t ram_bytes_remaining(void);
diff --git a/migration/migration.c b/migration/migration.c
index 43ca656..6b20b56 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -26,13 +26,14 @@
 #include "qmp-commands.h"
 #include "trace.h"
 
-enum {
+enum MigrationPhase {
     MIG_STATE_ERROR = -1,
     MIG_STATE_NONE,
     MIG_STATE_SETUP,
     MIG_STATE_CANCELLING,
     MIG_STATE_CANCELLED,
     MIG_STATE_ACTIVE,
+    MIG_STATE_POSTCOPY_ACTIVE,
     MIG_STATE_COMPLETED,
 };
 
@@ -247,6 +248,7 @@ static bool migration_already_active(MigrationState *ms)
 {
     switch (ms->state) {
     case MIG_STATE_ACTIVE:
+    case MIG_STATE_POSTCOPY_ACTIVE:
     case MIG_STATE_SETUP:
         return true;
 
@@ -319,6 +321,40 @@ MigrationInfo *qmp_query_migrate(Error **errp)
 
         get_xbzrle_cache_stats(info);
         break;
+    case MIG_STATE_POSTCOPY_ACTIVE:
+        /* Mostly the same as active; TODO add some postcopy stats */
+        info->has_status = true;
+        info->status = g_strdup("postcopy-active");
+        info->has_total_time = true;
+        info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
+            - s->total_time;
+        info->has_expected_downtime = true;
+        info->expected_downtime = s->expected_downtime;
+        info->has_setup_time = true;
+        info->setup_time = s->setup_time;
+
+        info->has_ram = true;
+        info->ram = g_malloc0(sizeof(*info->ram));
+        info->ram->transferred = ram_bytes_transferred();
+        info->ram->remaining = ram_bytes_remaining();
+        info->ram->total = ram_bytes_total();
+        info->ram->duplicate = dup_mig_pages_transferred();
+        info->ram->skipped = skipped_mig_pages_transferred();
+        info->ram->normal = norm_mig_pages_transferred();
+        info->ram->normal_bytes = norm_mig_bytes_transferred();
+        info->ram->dirty_pages_rate = s->dirty_pages_rate;
+        info->ram->mbps = s->mbps;
+
+        if (blk_mig_active()) {
+            info->has_disk = true;
+            info->disk = g_malloc0(sizeof(*info->disk));
+            info->disk->transferred = blk_mig_bytes_transferred();
+            info->disk->remaining = blk_mig_bytes_remaining();
+            info->disk->total = blk_mig_bytes_total();
+        }
+
+        get_xbzrle_cache_stats(info);
+        break;
     case MIG_STATE_COMPLETED:
         get_xbzrle_cache_stats(info);
 
@@ -362,7 +398,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
     MigrationState *s = migrate_get_current();
     MigrationCapabilityStatusList *cap;
 
-    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
+    if (migration_already_active(s)) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
         return;
     }
@@ -437,7 +473,8 @@ static void migrate_fd_cleanup(void *opaque)
         s->file = NULL;
     }
 
-    assert(s->state != MIG_STATE_ACTIVE);
+    assert((s->state != MIG_STATE_ACTIVE) &&
+           (s->state != MIG_STATE_POSTCOPY_ACTIVE));
 
     if (s->state != MIG_STATE_COMPLETED) {
         qemu_savevm_state_cancel();
@@ -471,7 +508,8 @@ static void migrate_fd_cancel(MigrationState *s)
 
     do {
         old_state = s->state;
-        if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
+        if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE &&
+            old_state != MIG_STATE_POSTCOPY_ACTIVE) {
             break;
         }
         migrate_set_state(s, old_state, MIG_STATE_CANCELLING);
@@ -515,6 +553,11 @@ bool migration_has_failed(MigrationState *s)
             s->state == MIG_STATE_ERROR);
 }
 
+bool migration_postcopy_phase(MigrationState *s)
+{
+    return (s->state == MIG_STATE_POSTCOPY_ACTIVE);
+}
+
 MigrationState *migrate_init(const MigrationParams *params)
 {
     MigrationState *s = migrate_get_current();
@@ -563,7 +606,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
     params.blk = has_blk && blk;
     params.shared = has_inc && inc;
 
-    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP ||
+    if (migration_already_active(s) ||
         s->state == MIG_STATE_CANCELLING) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
         return;
@@ -885,7 +928,10 @@ static void *migration_thread(void *opaque)
     s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
     migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ACTIVE);
 
-    while (s->state == MIG_STATE_ACTIVE) {
+    trace_migration_thread_setup_complete();
+
+    while (s->state == MIG_STATE_ACTIVE ||
+           s->state == MIG_STATE_POSTCOPY_ACTIVE) {
         int64_t current_time;
         uint64_t pending_size;
 
diff --git a/trace-events b/trace-events
index 83312b6..941976a 100644
--- a/trace-events
+++ b/trace-events
@@ -1402,6 +1402,7 @@ migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
 migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
 migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
+migration_thread_setup_complete(void) ""
 open_outgoing_return_path(void) ""
 open_outgoing_return_path_continue(void) ""
 source_return_path_thread_bad_end(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (23 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 24/45] MIG_STATE_POSTCOPY_ACTIVE: Add new migration state Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  4:58   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure Dr. David Alan Gilbert (git)
                   ` (19 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When postcopy calls qemu_savevm_state_complete it's not really
the end of migration, so skip:
   a) Finishing postcopiable iterative devices - they'll carry on
   b) The termination byte on the end of the stream.

We then also add:
  qemu_savevm_state_postcopy_complete
which is called at the end of a postcopy migration to call the
complete methods on devices skipped in the _complete call.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/sysemu/sysemu.h |  1 +
 savevm.c                | 61 ++++++++++++++++++++++++++++++++++++++++++++-----
 trace-events            |  1 +
 3 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 5f518b3..0f2e4ed 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -114,6 +114,7 @@ void qemu_savevm_state_cancel(void);
 void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
                                uint64_t *res_non_postcopiable,
                                uint64_t *res_postcopiable);
+void qemu_savevm_state_postcopy_complete(QEMUFile *f);
 void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
                               uint16_t len, uint8_t *data);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
diff --git a/savevm.c b/savevm.c
index 2ea4c76..1e8d289 100644
--- a/savevm.c
+++ b/savevm.c
@@ -865,12 +865,54 @@ int qemu_savevm_state_iterate(QEMUFile *f)
     return ret;
 }
 
+/*
+ * Calls the complete routines just for those devices that are postcopiable;
+ * causing the last few pages to be sent immediately and doing any associated
+ * cleanup.
+ * Note postcopy also calls the plain qemu_savevm_state_complete to complete
+ * all the other devices, but that happens at the point we switch to postcopy.
+ */
+void qemu_savevm_state_postcopy_complete(QEMUFile *f)
+{
+    SaveStateEntry *se;
+    int ret;
+
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        if (!se->ops || !se->ops->save_live_complete ||
+            !(se->ops->can_postcopy &&
+              se->ops->can_postcopy(se->opaque))) {
+            continue;
+        }
+        if (se->ops && se->ops->is_active) {
+            if (!se->ops->is_active(se->opaque)) {
+                continue;
+            }
+        }
+        trace_savevm_section_start(se->idstr, se->section_id);
+        /* Section type */
+        qemu_put_byte(f, QEMU_VM_SECTION_END);
+        qemu_put_be32(f, se->section_id);
+
+        ret = se->ops->save_live_complete(f, se->opaque);
+        trace_savevm_section_end(se->idstr, se->section_id, ret);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return;
+        }
+    }
+
+    qemu_savevm_send_postcopy_end(f, 0 /* Good */);
+    qemu_put_byte(f, QEMU_VM_EOF);
+    qemu_fflush(f);
+}
+
 void qemu_savevm_state_complete(QEMUFile *f)
 {
     QJSON *vmdesc;
     int vmdesc_len;
     SaveStateEntry *se;
     int ret;
+    bool in_postcopy = migration_postcopy_phase(migrate_get_current());
 
     trace_savevm_state_complete();
 
@@ -885,6 +927,11 @@ void qemu_savevm_state_complete(QEMUFile *f)
                 continue;
             }
         }
+        if (in_postcopy && se->ops &&  se->ops->can_postcopy &&
+            se->ops->can_postcopy(se->opaque)) {
+            trace_qemu_savevm_state_complete_skip_for_postcopy(se->idstr);
+            continue;
+        }
         trace_savevm_section_start(se->idstr, se->section_id);
         /* Section type */
         qemu_put_byte(f, QEMU_VM_SECTION_END);
@@ -931,15 +978,17 @@ void qemu_savevm_state_complete(QEMUFile *f)
         trace_savevm_section_end(se->idstr, se->section_id, 0);
     }
 
-    qemu_put_byte(f, QEMU_VM_EOF);
-
     json_end_array(vmdesc);
     qjson_finish(vmdesc);
-    vmdesc_len = strlen(qjson_get_str(vmdesc));
+    if (!in_postcopy) {
+        /* Postcopy stream will still be going */
+        qemu_put_byte(f, QEMU_VM_EOF);
+        vmdesc_len = strlen(qjson_get_str(vmdesc));
 
-    qemu_put_byte(f, QEMU_VM_VMDESCRIPTION);
-    qemu_put_be32(f, vmdesc_len);
-    qemu_put_buffer(f, (uint8_t *)qjson_get_str(vmdesc), vmdesc_len);
+        qemu_put_byte(f, QEMU_VM_VMDESCRIPTION);
+        qemu_put_be32(f, vmdesc_len);
+        qemu_put_buffer(f, (uint8_t *)qjson_get_str(vmdesc), vmdesc_len);
+    }
     object_unref(OBJECT(vmdesc));
 
     qemu_fflush(f);
diff --git a/trace-events b/trace-events
index 941976a..a555b56 100644
--- a/trace-events
+++ b/trace-events
@@ -1185,6 +1185,7 @@ loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
 loadvm_process_command_ping(uint32_t val) "%x"
 qemu_savevm_send_postcopy_advise(void) ""
 qemu_savevm_send_postcopy_ram_discard(void) ""
+qemu_savevm_state_complete_skip_for_postcopy(const char *section) "skipping: %s"
 savevm_section_start(const char *id, unsigned int section_id) "%s, section_id %u"
 savevm_section_end(const char *id, unsigned int section_id, int ret) "%s, section_id %u -> %d"
 savevm_send_ping(uint32_t val) "%x"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (24 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-13  5:19   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard Dr. David Alan Gilbert (git)
                   ` (18 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The PMI holds the state of each page on the incoming side,
so that we can tell if the page is missing, already received
or there is a request outstanding for it.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h    |  18 ++++
 include/migration/postcopy-ram.h |  12 +++
 include/qemu/typedefs.h          |   1 +
 migration/postcopy-ram.c         | 223 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 254 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index b44b9b2..86200b9 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -48,6 +48,23 @@ enum mig_rpcomm_cmd {
     MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
 };
 
+/* Postcopy page-map-incoming - data about each page on the inbound side */
+typedef enum {
+   POSTCOPY_PMI_MISSING    = 0, /* page hasn't yet been received */
+   POSTCOPY_PMI_REQUESTED  = 1, /* Kernel asked for a page, not yet got it */
+   POSTCOPY_PMI_RECEIVED   = 2, /* We've got the page */
+} PostcopyPMIState;
+
+struct PostcopyPMI {
+    QemuMutex      mutex;
+    unsigned long *state0;        /* Together with state1 form a */
+    unsigned long *state1;        /* PostcopyPMIState */
+    unsigned long  host_mask;     /* A mask with enough bits set to cover one
+                                     host page in the PMI */
+    unsigned long  host_bits;     /* The number of bits in the map representing
+                                     one host page */
+};
+
 typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
 
 typedef enum {
@@ -69,6 +86,7 @@ struct MigrationIncomingState {
 
     QEMUFile *return_path;
     QemuMutex      rp_mutex;    /* We send replies from multiple threads */
+    PostcopyPMI    postcopy_pmi;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index d81934f..e93ee8a 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -13,7 +13,19 @@
 #ifndef QEMU_POSTCOPY_RAM_H
 #define QEMU_POSTCOPY_RAM_H
 
+#include "migration/migration.h"
+
 /* Return true if the host supports everything we need to do postcopy-ram */
 bool postcopy_ram_supported_by_host(void);
 
+/*
+ * In 'advise' mode record that a page has been received.
+ */
+void postcopy_hook_early_receive(MigrationIncomingState *mis,
+                                 size_t bitmap_index);
+
+void postcopy_pmi_destroy(MigrationIncomingState *mis);
+void postcopy_pmi_discard_range(MigrationIncomingState *mis,
+                                size_t start, size_t npages);
+void postcopy_pmi_dump(MigrationIncomingState *mis);
 #endif
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 611db46..924eeb6 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -61,6 +61,7 @@ typedef struct PCIExpressHost PCIExpressHost;
 typedef struct PCIHostState PCIHostState;
 typedef struct PCMCIACardState PCMCIACardState;
 typedef struct PixelFormat PixelFormat;
+typedef struct PostcopyPMI PostcopyPMI;
 typedef struct PropertyInfo PropertyInfo;
 typedef struct Property Property;
 typedef struct QEMUBH QEMUBH;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a0e20b2..4f29055 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -24,6 +24,7 @@
 #include "migration/migration.h"
 #include "migration/postcopy-ram.h"
 #include "sysemu/sysemu.h"
+#include "qemu/bitmap.h"
 #include "qemu/error-report.h"
 #include "trace.h"
 
@@ -49,6 +50,220 @@
 
 #if defined(__linux__) && defined(__NR_userfaultfd)
 
+/* ---------------------------------------------------------------------- */
+/* Postcopy pagemap-inbound (pmi) - data structures that record the       */
+/* state of each page used by the inbound postcopy                        */
+/* It's a pair of bitmaps (of the same structure as the migration bitmaps)*/
+/* holding one bit per target-page, although most operations work on host */
+/* pages, the exception being a hook that receives incoming pages off the */
+/* migration stream which come in a TP at a time, although the source     */
+/* _should_ guarantee it sends a sequence of TPs representing HPs during  */
+/* the postcopy phase, there is no such guarantee during precopy.  We     */
+/* could boil this down to only holding one bit per-host page, but we lose*/
+/* sanity checking that we really do get whole host-pages from the source.*/
+__attribute__ (( unused )) /* Until later in patch series */
+static void postcopy_pmi_init(MigrationIncomingState *mis, size_t ram_pages)
+{
+    unsigned int tpb = qemu_target_page_bits();
+    unsigned long host_bits;
+
+    qemu_mutex_init(&mis->postcopy_pmi.mutex);
+    mis->postcopy_pmi.state0 = bitmap_new(ram_pages);
+    mis->postcopy_pmi.state1 = bitmap_new(ram_pages);
+    bitmap_clear(mis->postcopy_pmi.state0, 0, ram_pages);
+    bitmap_clear(mis->postcopy_pmi.state1, 0, ram_pages);
+    /*
+     * Each bit in the map represents one 'target page' which is no bigger
+     * than a host page but can be smaller.  It's useful to have some
+     * convenience masks for later
+     */
+
+    /*
+     * The number of bits one host page takes up in the bitmap
+     * e.g. on a 64k host page, 4k Target page, host_bits=64/4=16
+     */
+    host_bits = getpagesize() / (1ul << tpb);
+    assert(is_power_of_2(host_bits));
+
+    mis->postcopy_pmi.host_bits = host_bits;
+
+    if (host_bits < BITS_PER_LONG) {
+        /* A mask starting at bit 0 containing host_bits continuous set bits */
+        mis->postcopy_pmi.host_mask =  (1ul << host_bits) - 1;
+    } else {
+        /*
+         * This is a host where the ratio between host and target pages is
+         * bigger than the size of our longs, so we can't make a mask
+         * but we are only losing sanity checking if we just check one long's
+         * worth of bits.
+         */
+        mis->postcopy_pmi.host_mask = ~0l;
+    }
+
+
+    assert((ram_pages % host_bits) == 0);
+}
+
+void postcopy_pmi_destroy(MigrationIncomingState *mis)
+{
+    g_free(mis->postcopy_pmi.state0);
+    mis->postcopy_pmi.state0 = NULL;
+    g_free(mis->postcopy_pmi.state1);
+    mis->postcopy_pmi.state1 = NULL;
+    qemu_mutex_destroy(&mis->postcopy_pmi.mutex);
+}
+
+/*
+ * Mark a set of pages in the PMI as being clear; this is used by the discard
+ * at the start of postcopy, and before the postcopy stream starts.
+ */
+void postcopy_pmi_discard_range(MigrationIncomingState *mis,
+                                size_t start, size_t npages)
+{
+    /* Clear to state 0 = missing */
+    bitmap_clear(mis->postcopy_pmi.state0, start, npages);
+    bitmap_clear(mis->postcopy_pmi.state1, start, npages);
+}
+
+/*
+ * Test a host-page worth of bits in the map starting at bitmap_index
+ * The bits should all be consistent
+ */
+static bool test_hpbits(MigrationIncomingState *mis,
+                        size_t bitmap_index, unsigned long *map)
+{
+    long masked;
+
+    assert((bitmap_index & (mis->postcopy_pmi.host_bits-1)) == 0);
+
+    masked = (map[BIT_WORD(bitmap_index)] >>
+               (bitmap_index % BITS_PER_LONG)) &
+             mis->postcopy_pmi.host_mask;
+
+    assert((masked == 0) || (masked == mis->postcopy_pmi.host_mask));
+    return !!masked;
+}
+
+/*
+ * Set host-page worth of bits in the map starting at bitmap_index
+ * to the given state
+ */
+static void set_hp(MigrationIncomingState *mis,
+                   size_t bitmap_index, PostcopyPMIState state)
+{
+    long shifted_mask = mis->postcopy_pmi.host_mask <<
+                        (bitmap_index % BITS_PER_LONG);
+
+    assert((bitmap_index & (mis->postcopy_pmi.host_bits-1)) == 0);
+
+    if (state & 1) {
+        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] |= shifted_mask;
+    } else {
+        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] &= ~shifted_mask;
+    }
+    if (state & 2) {
+        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] |= shifted_mask;
+    } else {
+        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] &= ~shifted_mask;
+    }
+}
+
+/*
+ * Retrieve the state of the given page
+ * Note: This version for use by callers already holding the lock
+ */
+static PostcopyPMIState postcopy_pmi_get_state_nolock(
+                            MigrationIncomingState *mis,
+                            size_t bitmap_index)
+{
+    bool b0, b1;
+
+    b0 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state0);
+    b1 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state1);
+
+    return (b0 ? 1 : 0) + (b1 ? 2 : 0);
+}
+
+/* Retrieve the state of the given page */
+__attribute__ (( unused )) /* Until later in patch series */
+static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
+                                               size_t bitmap_index)
+{
+    PostcopyPMIState ret;
+    qemu_mutex_lock(&mis->postcopy_pmi.mutex);
+    ret = postcopy_pmi_get_state_nolock(mis, bitmap_index);
+    qemu_mutex_unlock(&mis->postcopy_pmi.mutex);
+
+    return ret;
+}
+
+/*
+ * Set the page state to the given state if the previous state was as expected
+ * Return the actual previous state.
+ */
+__attribute__ (( unused )) /* Until later in patch series */
+static PostcopyPMIState postcopy_pmi_change_state(MigrationIncomingState *mis,
+                                           size_t bitmap_index,
+                                           PostcopyPMIState expected_state,
+                                           PostcopyPMIState new_state)
+{
+    PostcopyPMIState old_state;
+
+    qemu_mutex_lock(&mis->postcopy_pmi.mutex);
+    old_state = postcopy_pmi_get_state_nolock(mis, bitmap_index);
+
+    if (old_state == expected_state) {
+        switch (new_state) {
+        case POSTCOPY_PMI_MISSING:
+            assert(0); /* This shouldn't happen - use discard_range */
+            break;
+
+        case POSTCOPY_PMI_REQUESTED:
+            assert(old_state == POSTCOPY_PMI_MISSING);
+            /* missing -> requested */
+            set_hp(mis, bitmap_index, POSTCOPY_PMI_REQUESTED);
+            break;
+
+        case POSTCOPY_PMI_RECEIVED:
+            assert(old_state == POSTCOPY_PMI_MISSING ||
+                   old_state == POSTCOPY_PMI_REQUESTED);
+            /* -> received */
+            set_hp(mis, bitmap_index, POSTCOPY_PMI_RECEIVED);
+            break;
+        }
+    }
+
+    qemu_mutex_unlock(&mis->postcopy_pmi.mutex);
+    return old_state;
+}
+
+/*
+ * Useful when debugging postcopy, although if it failed early the
+ * received map can be quite sparse and thus big when dumped.
+ */
+void postcopy_pmi_dump(MigrationIncomingState *mis)
+{
+    fprintf(stderr, "postcopy_pmi_dump: bit 0\n");
+    ram_debug_dump_bitmap(mis->postcopy_pmi.state0, false);
+    fprintf(stderr, "postcopy_pmi_dump: bit 1\n");
+    ram_debug_dump_bitmap(mis->postcopy_pmi.state1, true);
+    fprintf(stderr, "postcopy_pmi_dump: end\n");
+}
+
+/* Called by ram_load prior to mapping the page */
+void postcopy_hook_early_receive(MigrationIncomingState *mis,
+                                 size_t bitmap_index)
+{
+    if (mis->postcopy_state == POSTCOPY_INCOMING_ADVISE) {
+        /*
+         * If we're in precopy-advise mode we need to track received pages even
+         * though we don't need to place pages atomically yet.
+         * In advise mode there's only a single thread, so don't need locks
+         */
+        set_bit(bitmap_index, mis->postcopy_pmi.state1); /* 2=received */
+    }
+}
+
 static bool ufd_version_check(int ufd)
 {
     struct uffdio_api api_struct;
@@ -71,6 +286,7 @@ static bool ufd_version_check(int ufd)
     return true;
 }
 
+
 bool postcopy_ram_supported_by_host(void)
 {
     long pagesize = getpagesize();
@@ -157,5 +373,12 @@ bool postcopy_ram_supported_by_host(void)
     return false;
 }
 
+/* Called by ram_load prior to mapping the page */
+void postcopy_hook_early_receive(MigrationIncomingState *mis,
+                                 size_t bitmap_index)
+{
+    /* We don't support postcopy so don't care */
+}
+
 #endif
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (25 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-23  3:30   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation Dr. David Alan Gilbert (git)
                   ` (17 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Where postcopy is preceeded by a period of precopy, the destination will
have received pages that may have been dirtied on the source after the
page was sent.  The destination must throw these pages away before
starting it's CPUs.

Maintain a 'sentmap' of pages that have already been sent.
Calculate list of sent & dirty pages
Provide helpers on the destination side to discard these.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c                      | 275 ++++++++++++++++++++++++++++++++++++++-
 include/migration/migration.h    |  12 ++
 include/migration/postcopy-ram.h |  34 +++++
 include/qemu/typedefs.h          |   1 +
 migration/migration.c            |   1 +
 migration/postcopy-ram.c         | 111 ++++++++++++++++
 savevm.c                         |   3 -
 trace-events                     |   4 +
 8 files changed, 435 insertions(+), 6 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 7bc5fa6..21e7ebe 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -40,6 +40,7 @@
 #include "hw/audio/audio.h"
 #include "sysemu/kvm.h"
 #include "migration/migration.h"
+#include "migration/postcopy-ram.h"
 #include "hw/i386/smbios.h"
 #include "exec/address-spaces.h"
 #include "hw/audio/pcspk.h"
@@ -414,9 +415,17 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t **current_data,
     return bytes_sent;
 }
 
+/* mr: The region to search for dirty pages in
+ * start: Start address (typically so we can continue from previous page)
+ * ram_addr_abs: Pointer into which to store the address of the dirty page
+ *               within the global ram_addr space
+ *
+ * Returns: byte offset within memory region of the start of a dirty page
+ */
 static inline
 ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
-                                                 ram_addr_t start)
+                                                 ram_addr_t start,
+                                                 ram_addr_t *ram_addr_abs)
 {
     unsigned long base = mr->ram_addr >> TARGET_PAGE_BITS;
     unsigned long nr = base + (start >> TARGET_PAGE_BITS);
@@ -435,6 +444,7 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
         clear_bit(next, migration_bitmap);
         migration_dirty_pages--;
     }
+    *ram_addr_abs = next << TARGET_PAGE_BITS;
     return (next - base) << TARGET_PAGE_BITS;
 }
 
@@ -571,6 +581,19 @@ static void migration_bitmap_sync(void)
     }
 }
 
+static RAMBlock *ram_find_block(const char *id)
+{
+    RAMBlock *block;
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        if (!strcmp(id, block->idstr)) {
+            return block;
+        }
+    }
+
+    return NULL;
+}
+
 /*
  * ram_save_page: Send the given page to the stream
  *
@@ -659,13 +682,16 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
     bool complete_round = false;
     int bytes_sent = 0;
     MemoryRegion *mr;
+    ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
+                                 ram_addr_t space */
 
     if (!block)
         block = QTAILQ_FIRST(&ram_list.blocks);
 
     while (true) {
         mr = block->mr;
-        offset = migration_bitmap_find_and_reset_dirty(mr, offset);
+        offset = migration_bitmap_find_and_reset_dirty(mr, offset,
+                                                       &dirty_ram_abs);
         if (complete_round && block == last_seen_block &&
             offset >= last_offset) {
             break;
@@ -683,6 +709,11 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
 
             /* if page is unmodified, continue to the next */
             if (bytes_sent > 0) {
+                MigrationState *ms = migrate_get_current();
+                if (ms->sentmap) {
+                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
+                }
+
                 last_sent_block = block;
                 break;
             }
@@ -742,12 +773,19 @@ void free_xbzrle_decoded_buf(void)
 
 static void migration_end(void)
 {
+    MigrationState *s = migrate_get_current();
+
     if (migration_bitmap) {
         memory_global_dirty_log_stop();
         g_free(migration_bitmap);
         migration_bitmap = NULL;
     }
 
+    if (s->sentmap) {
+        g_free(s->sentmap);
+        s->sentmap = NULL;
+    }
+
     XBZRLE_cache_lock();
     if (XBZRLE.cache) {
         cache_fini(XBZRLE.cache);
@@ -815,6 +853,232 @@ void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
     }
 }
 
+/* **** functions for postcopy ***** */
+
+/*
+ * A helper to get 32 bits from a bit map; trivial for HOST_LONG_BITS=32
+ * messier for 64; the bitmaps are actually long's that are 32 or 64bit
+ */
+static uint32_t get_32bits_map(unsigned long *map, int64_t start)
+{
+#if HOST_LONG_BITS == 64
+    uint64_t tmp64;
+
+    tmp64 = map[start / 64];
+    return (start & 32) ? (tmp64 >> 32) : (tmp64 & 0xffffffffu);
+#elif HOST_LONG_BITS == 32
+    /*
+     * Irrespective of host endianness, sentmap[n] is for pages earlier
+     * than sentmap[n+1] so we can't just cast up
+     */
+    return map[start / 32];
+#else
+#error "Host long other than 64/32 not supported"
+#endif
+}
+
+/*
+ * A helper to put 32 bits into a bit map; trivial for HOST_LONG_BITS=32
+ * messier for 64; the bitmaps are actually long's that are 32 or 64bit
+ */
+__attribute__ (( unused )) /* Until later in patch series */
+static void put_32bits_map(unsigned long *map, int64_t start,
+                           uint32_t v)
+{
+#if HOST_LONG_BITS == 64
+    uint64_t tmp64 = v;
+    uint64_t mask = 0xffffffffu;
+
+    if (start & 32) {
+        tmp64 = tmp64 << 32;
+        mask =  mask << 32;
+    }
+
+    map[start / 64] = (map[start / 64] & ~mask) | tmp64;
+#elif HOST_LONG_BITS == 32
+    /*
+     * Irrespective of host endianness, sentmap[n] is for pages earlier
+     * than sentmap[n+1] so we can't just cast up
+     */
+    map[start / 32] = v;
+#else
+#error "Host long other than 64/32 not supported"
+#endif
+}
+
+/*
+ * When working on 32bit chunks of a bitmap where the only valid section
+ * is between start..end (inclusive), generate a mask with only those
+ * valid bits set for the current 32bit word within that bitmask.
+ */
+static int make_32bit_mask(unsigned long start, unsigned long end,
+                           unsigned long cur32)
+{
+    unsigned long first32, last32;
+    uint32_t mask = ~(uint32_t)0;
+    first32 = start / 32;
+    last32 = end / 32;
+
+    if ((cur32 == first32) && (start & 31)) {
+        /* e.g. (start & 31) = 3
+         *         1 << .    -> 2^3
+         *         . - 1     -> 2^3 - 1 i.e. mask 2..0
+         *         ~.        -> mask 31..3
+         */
+        mask &= ~((((uint32_t)1) << (start & 31)) - 1);
+    }
+
+    if ((cur32 == last32) && ((end & 31) != 31)) {
+        /* e.g. (end & 31) = 3
+         *            .   +1 -> 4
+         *         1 << .    -> 2^4
+         *         . -1      -> 2^4 - 1
+         *                   = mask set 3..0
+         */
+        mask &= (((uint32_t)1) << ((end & 31) + 1)) - 1;
+    }
+
+    return mask;
+}
+
+/*
+ * Callback from ram_postcopy_each_ram_discard for each RAMBlock
+ * start,end: Indexes into the bitmap for the first and last bit
+ *            representing the named block
+ */
+static int pc_send_discard_bm_ram(MigrationState *ms,
+                                  PostcopyDiscardState *pds,
+                                  unsigned long start, unsigned long end)
+{
+    /*
+     * There is no guarantee that start, end are on convenient 32bit multiples
+     * (We always send 32bit chunks over the wire, irrespective of long size)
+     */
+    unsigned long first32, last32, cur32;
+    first32 = start / 32;
+    last32 = end / 32;
+
+    for (cur32 = first32; cur32 <= last32; cur32++) {
+        /* Deal with start/end not on alignment */
+        uint32_t mask = make_32bit_mask(start, end, cur32);
+
+        uint32_t data = get_32bits_map(ms->sentmap, cur32 * 32);
+        data &= mask;
+
+        if (data) {
+            postcopy_discard_send_chunk(ms, pds, (cur32-first32) * 32, data);
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Utility for the outgoing postcopy code.
+ *   Calls postcopy_send_discard_bm_ram for each RAMBlock
+ *   passing it bitmap indexes and name.
+ * Returns: 0 on success
+ * (qemu_ram_foreach_block ends up passing unscaled lengths
+ *  which would mean postcopy code would have to deal with target page)
+ */
+static int pc_each_ram_discard(MigrationState *ms)
+{
+    struct RAMBlock *block;
+    int ret;
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        unsigned long first = block->offset >> TARGET_PAGE_BITS;
+        unsigned long last = (block->offset + (block->max_length-1))
+                                >> TARGET_PAGE_BITS;
+        PostcopyDiscardState *pds = postcopy_discard_send_init(ms,
+                                                               first & 31,
+                                                               block->idstr);
+
+        /*
+         * Postcopy sends chunks of bitmap over the wire, but it
+         * just needs indexes at this point, avoids it having
+         * target page specific code.
+         */
+        ret = pc_send_discard_bm_ram(ms, pds, first, last);
+        postcopy_discard_send_finish(ms, pds);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Transmit the set of pages to be discarded after precopy to the target
+ * these are pages that have been sent previously but have been dirtied
+ * Hopefully this is pretty sparse
+ */
+int ram_postcopy_send_discard_bitmap(MigrationState *ms)
+{
+    /* This should be our last sync, the src is now paused */
+    migration_bitmap_sync();
+
+    /*
+     * Update the sentmap to be  sentmap&=dirty
+     */
+    bitmap_and(ms->sentmap, ms->sentmap, migration_bitmap,
+               last_ram_offset() >> TARGET_PAGE_BITS);
+
+
+    trace_ram_postcopy_send_discard_bitmap();
+#ifdef DEBUG_POSTCOPY
+    ram_debug_dump_bitmap(ms->sentmap, false);
+#endif
+
+    return pc_each_ram_discard(ms);
+}
+
+/*
+ * At the start of the postcopy phase of migration, any now-dirty
+ * precopied pages are discarded.
+ *
+ * start..end is an inclusive range of bits indexed in the source
+ *    VMs bitmap for this RAMBlock, source_target_page_bits tells
+ *    us what one of those bits represents.
+ *
+ * start/end are offsets from the start of the bitmap for RAMBlock 'block_name'
+ *
+ * Returns 0 on success.
+ */
+int ram_discard_range(MigrationIncomingState *mis,
+                      const char *block_name,
+                      uint64_t start, uint64_t end)
+{
+    assert(end >= start);
+
+    RAMBlock *rb = ram_find_block(block_name);
+
+    if (!rb) {
+        error_report("ram_discard_range: Failed to find block '%s'",
+                     block_name);
+        return -1;
+    }
+
+    uint64_t index_offset = rb->offset >> TARGET_PAGE_BITS;
+    postcopy_pmi_discard_range(mis, start + index_offset, (end - start) + 1);
+
+    /* +1 gives the byte after the end of the last page to be discarded */
+    ram_addr_t end_offset = (end+1) << TARGET_PAGE_BITS;
+    uint8_t *host_startaddr = rb->host + (start << TARGET_PAGE_BITS);
+    uint8_t *host_endaddr;
+
+    if (end_offset <= rb->used_length) {
+        host_endaddr   = rb->host + (end_offset-1);
+        return postcopy_ram_discard_range(mis, host_startaddr, host_endaddr);
+    } else {
+        error_report("ram_discard_range: Overrun block '%s' (%" PRIu64
+                     "/%" PRIu64 "/%zu)",
+                     block_name, start, end, rb->used_length);
+        return -1;
+    }
+}
+
 static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMBlock *block;
@@ -854,7 +1118,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 
         acct_clear();
     }
-
     qemu_mutex_lock_iothread();
     qemu_mutex_lock_ramlist();
     bytes_transferred = 0;
@@ -864,6 +1127,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     migration_bitmap = bitmap_new(ram_bitmap_pages);
     bitmap_set(migration_bitmap, 0, ram_bitmap_pages);
 
+    if (migrate_postcopy_ram()) {
+        MigrationState *s = migrate_get_current();
+        s->sentmap = bitmap_new(ram_bitmap_pages);
+        bitmap_clear(s->sentmap, 0, ram_bitmap_pages);
+    }
+
     /*
      * Count the total number of pages used by ram blocks not including any
      * gaps due to alignment or unplugs.
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 86200b9..e749f4c 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -125,6 +125,13 @@ struct MigrationState
 
     /* Flag set once the migration has been asked to enter postcopy */
     bool start_postcopy;
+
+    /* bitmap of pages that have been sent at least once
+     * only maintained and used in postcopy at the moment
+     * where it's used to send the dirtymap at the start
+     * of the postcopy phase
+     */
+    unsigned long *sentmap;
 };
 
 void process_incoming_migration(QEMUFile *f);
@@ -194,6 +201,11 @@ double xbzrle_mig_cache_miss_rate(void);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
+/* For outgoing discard bitmap */
+int ram_postcopy_send_discard_bitmap(MigrationState *ms);
+/* For incoming postcopy discard */
+int ram_discard_range(MigrationIncomingState *mis, const char *block_name,
+                      uint64_t start, uint64_t end);
 
 /**
  * @migrate_add_blocker - prevent migration from proceeding
diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index e93ee8a..1fec1c1 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -28,4 +28,38 @@ void postcopy_pmi_destroy(MigrationIncomingState *mis);
 void postcopy_pmi_discard_range(MigrationIncomingState *mis,
                                 size_t start, size_t npages);
 void postcopy_pmi_dump(MigrationIncomingState *mis);
+
+/*
+ * Discard the contents of memory start..end inclusive.
+ * We can assume that if we've been called postcopy_ram_hosttest returned true
+ */
+int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
+                               uint8_t *end);
+
+
+/*
+ * Called at the start of each RAMBlock by the bitmap code
+ * offset is the bit within the first 32bit chunk of mask
+ * that represents the first page of the RAM Block
+ * Returns a new PDS
+ */
+PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
+                                                 uint8_t offset,
+                                                 const char *name);
+
+/*
+ * Called by the bitmap code for each chunk to discard
+ * May send a discard message, may just leave it queued to
+ * be sent later
+ */
+void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
+                                unsigned long pos, uint32_t bitmap);
+
+/*
+ * Called at the end of each RAMBlock by the bitmap code
+ * Sends any outstanding discard messages, frees the PDS
+ */
+void postcopy_discard_send_finish(MigrationState *ms,
+                                  PostcopyDiscardState *pds);
+
 #endif
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 924eeb6..0651275 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -61,6 +61,7 @@ typedef struct PCIExpressHost PCIExpressHost;
 typedef struct PCIHostState PCIHostState;
 typedef struct PCMCIACardState PCMCIACardState;
 typedef struct PixelFormat PixelFormat;
+typedef struct PostcopyDiscardState PostcopyDiscardState;
 typedef struct PostcopyPMI PostcopyPMI;
 typedef struct PropertyInfo PropertyInfo;
 typedef struct Property Property;
diff --git a/migration/migration.c b/migration/migration.c
index 6b20b56..850fe1a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -22,6 +22,7 @@
 #include "block/block.h"
 #include "qemu/sockets.h"
 #include "migration/block.h"
+#include "migration/postcopy-ram.h"
 #include "qemu/thread.h"
 #include "qmp-commands.h"
 #include "trace.h"
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 4f29055..391e9c6 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -28,6 +28,19 @@
 #include "qemu/error-report.h"
 #include "trace.h"
 
+#define MAX_DISCARDS_PER_COMMAND 12
+
+struct PostcopyDiscardState {
+    const char *name;
+    uint16_t cur_entry;
+    uint64_t addrlist[MAX_DISCARDS_PER_COMMAND];
+    uint32_t masklist[MAX_DISCARDS_PER_COMMAND];
+    uint8_t  offset;  /* Offset within 32bit mask at addr0 representing 1st
+                         page of block */
+    unsigned int nsentwords;
+    unsigned int nsentcmds;
+};
+
 /* Postcopy needs to detect accesses to pages that haven't yet been copied
  * across, and efficiently map new pages in, the techniques for doing this
  * are target OS specific.
@@ -364,6 +377,21 @@ out:
     return ret;
 }
 
+/*
+ * Discard the contents of memory start..end inclusive.
+ * We can assume that if we've been called postcopy_ram_hosttest returned true
+ */
+int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
+                               uint8_t *end)
+{
+    if (madvise(start, (end-start)+1, MADV_DONTNEED)) {
+        perror("postcopy_ram_discard_range MADV_DONTNEED");
+        return -1;
+    }
+
+    return 0;
+}
+
 #else
 /* No target OS support, stubs just fail */
 
@@ -380,5 +408,88 @@ void postcopy_hook_early_receive(MigrationIncomingState *mis,
     /* We don't support postcopy so don't care */
 }
 
+void postcopy_pmi_destroy(MigrationIncomingState *mis)
+{
+    /* Called in normal cleanup path - so it's OK */
+}
+
+void postcopy_pmi_discard_range(MigrationIncomingState *mis,
+                                size_t start, size_t npages)
+{
+    assert(0);
+}
+
+int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
+                               uint8_t *end)
+{
+    assert(0);
+}
 #endif
 
+/* ------------------------------------------------------------------------- */
+
+/*
+ * Called at the start of each RAMBlock by the bitmap code
+ * offset is the bit within the first 64bit chunk of mask
+ * that represents the first page of the RAM Block
+ * Returns a new PDS
+ */
+PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
+                                                 uint8_t offset,
+                                                 const char *name)
+{
+    PostcopyDiscardState *res = g_try_malloc(sizeof(PostcopyDiscardState));
+
+    if (res) {
+        res->name = name;
+        res->cur_entry = 0;
+        res->nsentwords = 0;
+        res->nsentcmds = 0;
+        res->offset = offset;
+    }
+
+    return res;
+}
+
+/*
+ * Called by the bitmap code for each chunk to discard
+ * May send a discard message, may just leave it queued to
+ * be sent later
+ */
+void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
+                                unsigned long pos, uint32_t bitmap)
+{
+    pds->addrlist[pds->cur_entry] = pos;
+    pds->masklist[pds->cur_entry] = bitmap;
+    pds->cur_entry++;
+    pds->nsentwords++;
+
+    if (pds->cur_entry == MAX_DISCARDS_PER_COMMAND) {
+        /* Full set, ship it! */
+        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
+                                              pds->cur_entry, pds->offset,
+                                              pds->addrlist, pds->masklist);
+        pds->nsentcmds++;
+        pds->cur_entry = 0;
+    }
+}
+
+/*
+ * Called at the end of each RAMBlock by the bitmap code
+ * Sends any outstanding discard messages, frees the PDS
+ */
+void postcopy_discard_send_finish(MigrationState *ms, PostcopyDiscardState *pds)
+{
+    /* Anything unsent? */
+    if (pds->cur_entry) {
+        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
+                                              pds->cur_entry, pds->offset,
+                                              pds->addrlist, pds->masklist);
+        pds->nsentcmds++;
+    }
+
+    trace_postcopy_discard_send_finish(pds->name, pds->nsentwords,
+                                       pds->nsentcmds);
+
+    g_free(pds);
+}
diff --git a/savevm.c b/savevm.c
index 1e8d289..2589b8c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1282,15 +1282,12 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
              * we know there must be at least 1 bit set due to the loop entry
              * If there is no 0 firstzero will be 32
              */
-            /* TODO - ram_discard_range gets added in a later patch
             int ret = ram_discard_range(mis, ramid,
                                 startaddr + firstset - first_bit_offset,
                                 startaddr + (firstzero - 1) - first_bit_offset);
-            ret = -1;
             if (ret) {
                 return ret;
             }
-            */
 
             /* mask= .?0000000000 */
             /*         ^fz ^fs    */
diff --git a/trace-events b/trace-events
index a555b56..f985117 100644
--- a/trace-events
+++ b/trace-events
@@ -1217,6 +1217,7 @@ qemu_file_fclose(void) ""
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
 migration_throttle(void) ""
+ram_postcopy_send_discard_bitmap(void) ""
 
 # hw/display/qxl.c
 disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
@@ -1478,6 +1479,9 @@ rdma_start_incoming_migration_after_rdma_listen(void) ""
 rdma_start_outgoing_migration_after_rdma_connect(void) ""
 rdma_start_outgoing_migration_after_rdma_source_init(void) ""
 
+# migration/postcopy-ram.c
+postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
+
 # kvm-all.c
 kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
 kvm_vm_ioctl(int type, void *arg) "type 0x%x, arg %p"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (26 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-23  3:41   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 29/45] postcopy: ram_enable_notify to switch on userfault Dr. David Alan Gilbert (git)
                   ` (16 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c                      |  11 ++++
 include/migration/migration.h    |   3 +
 include/migration/postcopy-ram.h |  12 ++++
 migration/migration.c            |   1 +
 migration/postcopy-ram.c         | 119 ++++++++++++++++++++++++++++++++++++++-
 savevm.c                         |   4 ++
 trace-events                     |   2 +
 7 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index 21e7ebe..d2c4457 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -1363,6 +1363,17 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
     }
 }
 
+/*
+ * Allocate data structures etc needed by incoming migration with postcopy-ram
+ * postcopy-ram's similarly names postcopy_ram_incoming_init does the work
+ */
+int ram_postcopy_incoming_init(MigrationIncomingState *mis)
+{
+    size_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
+
+    return postcopy_ram_incoming_init(mis, ram_pages);
+}
+
 static int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     int flags = 0, ret = 0;
diff --git a/include/migration/migration.h b/include/migration/migration.h
index e749f4c..d09561e 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -84,6 +84,8 @@ struct MigrationIncomingState {
 
     PostcopyState postcopy_state;
 
+    /* For the kernel to send us notifications */
+    int            userfault_fd;
     QEMUFile *return_path;
     QemuMutex      rp_mutex;    /* We send replies from multiple threads */
     PostcopyPMI    postcopy_pmi;
@@ -206,6 +208,7 @@ int ram_postcopy_send_discard_bitmap(MigrationState *ms);
 /* For incoming postcopy discard */
 int ram_discard_range(MigrationIncomingState *mis, const char *block_name,
                       uint64_t start, uint64_t end);
+int ram_postcopy_incoming_init(MigrationIncomingState *mis);
 
 /**
  * @migrate_add_blocker - prevent migration from proceeding
diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index 1fec1c1..305c26b 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -19,6 +19,18 @@
 bool postcopy_ram_supported_by_host(void);
 
 /*
+ * Initialise postcopy-ram, setting the RAM to a state where we can go into
+ * postcopy later; must be called prior to any precopy.
+ * called from arch_init's similarly named ram_postcopy_incoming_init
+ */
+int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages);
+
+/*
+ * At the end of a migration where postcopy_ram_incoming_init was called.
+ */
+int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis);
+
+/*
  * In 'advise' mode record that a page has been received.
  */
 void postcopy_hook_early_receive(MigrationIncomingState *mis,
diff --git a/migration/migration.c b/migration/migration.c
index 850fe1a..b1ad7b1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -88,6 +88,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
 
 void migration_incoming_state_destroy(void)
 {
+    postcopy_pmi_destroy(mis_current);
     loadvm_free_handlers(mis_current);
     g_free(mis_current);
     mis_current = NULL;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 391e9c6..dbe1892 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -74,7 +74,6 @@ struct PostcopyDiscardState {
 /* the postcopy phase, there is no such guarantee during precopy.  We     */
 /* could boil this down to only holding one bit per-host page, but we lose*/
 /* sanity checking that we really do get whole host-pages from the source.*/
-__attribute__ (( unused )) /* Until later in patch series */
 static void postcopy_pmi_init(MigrationIncomingState *mis, size_t ram_pages)
 {
     unsigned int tpb = qemu_target_page_bits();
@@ -392,6 +391,113 @@ int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
     return 0;
 }
 
+/*
+ * Setup an area of RAM so that it *can* be used for postcopy later; this
+ * must be done right at the start prior to pre-copy.
+ * opaque should be the MIS.
+ */
+static int init_area(const char *block_name, void *host_addr,
+                     ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    MigrationIncomingState *mis = opaque;
+
+    trace_postcopy_init_area(block_name, host_addr, offset, length);
+
+    /*
+     * We need the whole of RAM to be truly empty for postcopy, so things
+     * like ROMs and any data tables built during init must be zero'd
+     * - we're going to get the copy from the source anyway.
+     * (Precopy will just overwrite this data, so doesn't need the discard)
+     */
+    if (postcopy_ram_discard_range(mis, host_addr, (host_addr + length - 1))) {
+        return -1;
+    }
+
+    /*
+     * We also need the area to be normal 4k pages, not huge pages
+     * (otherwise we can't be sure we can atopically place the
+     * 4k page in later).  THP might come along and map a 2MB page
+     * and when it's partially accessed in precopy it might not break
+     * it down, but leave a 2MB zero'd page.
+     */
+#ifdef MADV_NOHUGEPAGE
+    if (madvise(host_addr, length, MADV_NOHUGEPAGE)) {
+        perror("init_area: NOHUGEPAGE");
+        return -1;
+    }
+#endif
+
+    return 0;
+}
+
+/*
+ * At the end of migration, undo the effects of init_area
+ * opaque should be the MIS.
+ */
+static int cleanup_area(const char *block_name, void *host_addr,
+                        ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    MigrationIncomingState *mis = opaque;
+    struct uffdio_range range_struct;
+    trace_postcopy_cleanup_area(block_name, host_addr, offset, length);
+
+    /*
+     * We turned off hugepage for the precopy stage with postcopy enabled
+     * we can turn it back on now.
+     */
+#ifdef MADV_HUGEPAGE
+    if (madvise(host_addr, length, MADV_HUGEPAGE)) {
+        perror("cleanup_area: HUGEPAGE");
+        return -1;
+    }
+#endif
+
+    /*
+     * We can also turn off userfault now since we should have all the
+     * pages.   It can be useful to leave it on to debug postcopy
+     * if you're not sure it's always getting every page.
+     */
+    range_struct.start = (uint64_t)(uintptr_t)host_addr;
+    range_struct.len = (uint64_t)length;
+
+    if (ioctl(mis->userfault_fd, UFFDIO_UNREGISTER, &range_struct)) {
+        perror("cleanup_area: userfault unregister");
+
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Initialise postcopy-ram, setting the RAM to a state where we can go into
+ * postcopy later; must be called prior to any precopy.
+ * called from arch_init's similarly named ram_postcopy_incoming_init
+ */
+int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
+{
+    postcopy_pmi_init(mis, ram_pages);
+
+    if (qemu_ram_foreach_block(init_area, mis)) {
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * At the end of a migration where postcopy_ram_incoming_init was called.
+ */
+int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
+{
+    /* TODO: Join the fault thread once we're sure it will exit */
+    if (qemu_ram_foreach_block(cleanup_area, mis)) {
+        return -1;
+    }
+
+    return 0;
+}
+
 #else
 /* No target OS support, stubs just fail */
 
@@ -408,6 +514,17 @@ void postcopy_hook_early_receive(MigrationIncomingState *mis,
     /* We don't support postcopy so don't care */
 }
 
+int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
+{
+    error_report("postcopy_ram_incoming_init: No OS support");
+    return -1;
+}
+
+int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
+{
+    assert(0);
+}
+
 void postcopy_pmi_destroy(MigrationIncomingState *mis)
 {
     /* Called in normal cleanup path - so it's OK */
diff --git a/savevm.c b/savevm.c
index 2589b8c..6857660 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1186,6 +1186,10 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
         return -1;
     }
 
+    if (ram_postcopy_incoming_init(mis)) {
+        return -1;
+    }
+
     postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
 
     return 0;
diff --git a/trace-events b/trace-events
index f985117..59dea4c 100644
--- a/trace-events
+++ b/trace-events
@@ -1481,6 +1481,8 @@ rdma_start_outgoing_migration_after_rdma_source_init(void) ""
 
 # migration/postcopy-ram.c
 postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
+postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
+postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
 
 # kvm-all.c
 kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 29/45] postcopy: ram_enable_notify to switch on userfault
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (27 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-23  3:45   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread Dr. David Alan Gilbert (git)
                   ` (15 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Mark the area of RAM as 'userfault'
Start up a fault-thread to handle any userfaults we might receive
from it (to be filled in later)

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h    |  3 ++
 include/migration/postcopy-ram.h |  6 ++++
 migration/postcopy-ram.c         | 69 +++++++++++++++++++++++++++++++++++++++-
 savevm.c                         |  9 ++++++
 4 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index d09561e..821d561 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -84,6 +84,9 @@ struct MigrationIncomingState {
 
     PostcopyState postcopy_state;
 
+    QemuThread     fault_thread;
+    QemuSemaphore  fault_thread_sem;
+
     /* For the kernel to send us notifications */
     int            userfault_fd;
     QEMUFile *return_path;
diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index 305c26b..fbb2a93 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -19,6 +19,12 @@
 bool postcopy_ram_supported_by_host(void);
 
 /*
+ * Make all of RAM sensitive to accesses to areas that haven't yet been written
+ * and wire up anything necessary to deal with it.
+ */
+int postcopy_ram_enable_notify(MigrationIncomingState *mis);
+
+/*
  * Initialise postcopy-ram, setting the RAM to a state where we can go into
  * postcopy later; must be called prior to any precopy.
  * called from arch_init's similarly named ram_postcopy_incoming_init
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index dbe1892..33dd332 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -498,9 +498,71 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
     return 0;
 }
 
+/*
+ * Mark the given area of RAM as requiring notification to unwritten areas
+ * Used as a  callback on qemu_ram_foreach_block.
+ *   host_addr: Base of area to mark
+ *   offset: Offset in the whole ram arena
+ *   length: Length of the section
+ *   opaque: MigrationIncomingState pointer
+ * Returns 0 on success
+ */
+static int ram_block_enable_notify(const char *block_name, void *host_addr,
+                                   ram_addr_t offset, ram_addr_t length,
+                                   void *opaque)
+{
+    MigrationIncomingState *mis = opaque;
+    struct uffdio_register reg_struct;
+
+    reg_struct.range.start = (uint64_t)(uintptr_t)host_addr;
+    reg_struct.range.len = (uint64_t)length;
+    reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
+
+    /* Now tell our userfault_fd that it's responsible for this area */
+    if (ioctl(mis->userfault_fd, UFFDIO_REGISTER, &reg_struct)) {
+        perror("ram_block_enable_notify userfault register");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Handle faults detected by the USERFAULT markings
+ */
+static void *postcopy_ram_fault_thread(void *opaque)
+{
+    MigrationIncomingState *mis = (MigrationIncomingState *)opaque;
+
+    fprintf(stderr, "postcopy_ram_fault_thread\n");
+    /* TODO: In later patch */
+    qemu_sem_post(&mis->fault_thread_sem);
+    while (1) {
+        /* TODO: In later patch */
+    }
+
+    return NULL;
+}
+
+int postcopy_ram_enable_notify(MigrationIncomingState *mis)
+{
+    /* Create the fault handler thread and wait for it to be ready */
+    qemu_sem_init(&mis->fault_thread_sem, 0);
+    qemu_thread_create(&mis->fault_thread, "postcopy/fault",
+                       postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
+    qemu_sem_wait(&mis->fault_thread_sem);
+    qemu_sem_destroy(&mis->fault_thread_sem);
+
+    /* Mark so that we get notified of accesses to unwritten areas */
+    if (qemu_ram_foreach_block(ram_block_enable_notify, mis)) {
+        return -1;
+    }
+
+    return 0;
+}
+
 #else
 /* No target OS support, stubs just fail */
-
 bool postcopy_ram_supported_by_host(void)
 {
     error_report("%s: No OS support", __func__);
@@ -541,6 +603,11 @@ int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
 {
     assert(0);
 }
+
+int postcopy_ram_enable_notify(MigrationIncomingState *mis)
+{
+    assert(0);
+}
 #endif
 
 /* ------------------------------------------------------------------------- */
diff --git a/savevm.c b/savevm.c
index 6857660..014ba08 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1317,6 +1317,15 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         return -1;
     }
 
+    /*
+     * Sensitise RAM - can now generate requests for blocks that don't exist
+     * However, at this point the CPU shouldn't be running, and the IO
+     * shouldn't be doing anything yet so don't actually expect requests
+     */
+    if (postcopy_ram_enable_notify(mis)) {
+        return -1;
+    }
+
     /* TODO start up the postcopy listening thread */
     return 0;
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (28 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 29/45] postcopy: ram_enable_notify to switch on userfault Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-23  4:20   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 31/45] Postcopy end in migration_thread Dr. David Alan Gilbert (git)
                   ` (14 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Rework the migration thread to setup and start postcopy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |   3 +
 migration/migration.c         | 161 ++++++++++++++++++++++++++++++++++++++++--
 trace-events                  |   4 ++
 3 files changed, 164 insertions(+), 4 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 821d561..2c607e7 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -131,6 +131,9 @@ struct MigrationState
     /* Flag set once the migration has been asked to enter postcopy */
     bool start_postcopy;
 
+    /* Flag set once the migration thread is running (and needs joining) */
+    bool started_migration_thread;
+
     /* bitmap of pages that have been sent at least once
      * only maintained and used in postcopy at the moment
      * where it's used to send the dirtymap at the start
diff --git a/migration/migration.c b/migration/migration.c
index b1ad7b1..6bf9c8d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -468,7 +468,10 @@ static void migrate_fd_cleanup(void *opaque)
     if (s->file) {
         trace_migrate_fd_cleanup();
         qemu_mutex_unlock_iothread();
-        qemu_thread_join(&s->thread);
+        if (s->started_migration_thread) {
+            qemu_thread_join(&s->thread);
+            s->started_migration_thread = false;
+        }
         qemu_mutex_lock_iothread();
 
         qemu_fclose(s->file);
@@ -874,7 +877,6 @@ out:
     return NULL;
 }
 
-__attribute__ (( unused )) /* Until later in patch series */
 static int open_outgoing_return_path(MigrationState *ms)
 {
 
@@ -911,23 +913,141 @@ static void await_outgoing_return_path_close(MigrationState *ms)
 }
 
 /*
+ * Switch from normal iteration to postcopy
+ * Returns non-0 on error
+ */
+static int postcopy_start(MigrationState *ms, bool *old_vm_running)
+{
+    int ret;
+    const QEMUSizedBuffer *qsb;
+    int64_t time_at_stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    migrate_set_state(ms, MIG_STATE_ACTIVE, MIG_STATE_POSTCOPY_ACTIVE);
+
+    trace_postcopy_start();
+    qemu_mutex_lock_iothread();
+    trace_postcopy_start_set_run();
+
+    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
+    *old_vm_running = runstate_is_running();
+
+    ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
+
+    if (ret < 0) {
+        goto fail;
+    }
+
+    /*
+     * in Finish migrate and with the io-lock held everything should
+     * be quiet, but we've potentially still got dirty pages and we
+     * need to tell the destination to throw any pages it's already received
+     * that are dirty
+     */
+    if (ram_postcopy_send_discard_bitmap(ms)) {
+        error_report("postcopy send discard bitmap failed");
+        goto fail;
+    }
+
+    /*
+     * send rest of state - note things that are doing postcopy
+     * will notice we're in MIG_STATE_POSTCOPY_ACTIVE and not actually
+     * wrap their state up here
+     */
+    qemu_file_set_rate_limit(ms->file, INT64_MAX);
+    /* Ping just for debugging, helps line traces up */
+    qemu_savevm_send_ping(ms->file, 2);
+
+    /*
+     * We need to leave the fd free for page transfers during the
+     * loading of the device state, so wrap all the remaining
+     * commands and state into a package that gets sent in one go
+     */
+    QEMUFile *fb = qemu_bufopen("w", NULL);
+    if (!fb) {
+        error_report("Failed to create buffered file");
+        goto fail;
+    }
+
+    qemu_savevm_state_complete(fb);
+    qemu_savevm_send_ping(fb, 3);
+
+    qemu_savevm_send_postcopy_run(fb);
+
+    /* <><> end of stuff going into the package */
+    qsb = qemu_buf_get(fb);
+
+    /* Now send that blob */
+    if (qsb_get_length(qsb) > MAX_VM_CMD_PACKAGED_SIZE) {
+        error_report("postcopy_start: Unreasonably large packaged state: %lu",
+                     (unsigned long)(qsb_get_length(qsb)));
+        goto fail_closefb;
+    }
+    qemu_savevm_send_packaged(ms->file, qsb);
+    qemu_fclose(fb);
+    ms->downtime =  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - time_at_stop;
+
+    qemu_mutex_unlock_iothread();
+
+    /*
+     * Although this ping is just for debug, it could potentially be
+     * used for getting a better measurement of downtime at the source.
+     */
+    qemu_savevm_send_ping(ms->file, 4);
+
+    ret = qemu_file_get_error(ms->file);
+    if (ret) {
+        error_report("postcopy_start: Migration stream errored");
+        migrate_set_state(ms, MIG_STATE_POSTCOPY_ACTIVE, MIG_STATE_ERROR);
+    }
+
+    return ret;
+
+fail_closefb:
+    qemu_fclose(fb);
+fail:
+    migrate_set_state(ms, MIG_STATE_POSTCOPY_ACTIVE, MIG_STATE_ERROR);
+    qemu_mutex_unlock_iothread();
+    return -1;
+}
+
+/*
  * Master migration thread on the source VM.
  * It drives the migration and pumps the data down the outgoing channel.
  */
 static void *migration_thread(void *opaque)
 {
     MigrationState *s = opaque;
+    /* Used by the bandwidth calcs, updated later */
     int64_t initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
     int64_t initial_bytes = 0;
     int64_t max_size = 0;
     int64_t start_time = initial_time;
     bool old_vm_running = false;
+    bool entered_postcopy = false;
+    /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */
+    enum MigrationPhase current_active_type = MIG_STATE_ACTIVE;
 
     qemu_savevm_state_header(s->file);
+
+    if (migrate_postcopy_ram()) {
+        /* Now tell the dest that it should open its end so it can reply */
+        qemu_savevm_send_open_return_path(s->file);
+
+        /* And do a ping that will make stuff easier to debug */
+        qemu_savevm_send_ping(s->file, 1);
+
+        /*
+         * Tell the destination that we *might* want to do postcopy later;
+         * if the other end can't do postcopy it should fail now, nice and
+         * early.
+         */
+        qemu_savevm_send_postcopy_advise(s->file);
+    }
+
     qemu_savevm_state_begin(s->file, &s->params);
 
     s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
+    current_active_type = MIG_STATE_ACTIVE;
     migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ACTIVE);
 
     trace_migration_thread_setup_complete();
@@ -946,6 +1066,22 @@ static void *migration_thread(void *opaque)
             trace_migrate_pending(pending_size, max_size,
                                   pend_post, pend_nonpost);
             if (pending_size && pending_size >= max_size) {
+                /* Still a significant amount to transfer */
+
+                current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+                if (migrate_postcopy_ram() &&
+                    s->state != MIG_STATE_POSTCOPY_ACTIVE &&
+                    pend_nonpost <= max_size &&
+                    atomic_read(&s->start_postcopy)) {
+
+                    if (!postcopy_start(s, &old_vm_running)) {
+                        current_active_type = MIG_STATE_POSTCOPY_ACTIVE;
+                        entered_postcopy = true;
+                    }
+
+                    continue;
+                }
+                /* Just another iteration step */
                 qemu_savevm_state_iterate(s->file);
             } else {
                 int ret;
@@ -975,7 +1111,8 @@ static void *migration_thread(void *opaque)
         }
 
         if (qemu_file_get_error(s->file)) {
-            migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_ERROR);
+            migrate_set_state(s, current_active_type, MIG_STATE_ERROR);
+            trace_migration_thread_file_err();
             break;
         }
         current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
@@ -1006,12 +1143,15 @@ static void *migration_thread(void *opaque)
         }
     }
 
+    trace_migration_thread_after_loop();
     qemu_mutex_lock_iothread();
     if (s->state == MIG_STATE_COMPLETED) {
         int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
         uint64_t transferred_bytes = qemu_ftell(s->file);
         s->total_time = end_time - s->total_time;
-        s->downtime = end_time - start_time;
+        if (!entered_postcopy) {
+            s->downtime = end_time - start_time;
+        }
         if (s->total_time) {
             s->mbps = (((double) transferred_bytes * 8.0) /
                        ((double) s->total_time)) / 1000;
@@ -1043,8 +1183,21 @@ void migrate_fd_connect(MigrationState *s)
     /* Notify before starting migration thread */
     notifier_list_notify(&migration_state_notifiers, s);
 
+    /* Open the return path; currently for postcopy but other things might
+     * also want it.
+     */
+    if (migrate_postcopy_ram()) {
+        if (open_outgoing_return_path(s)) {
+            error_report("Unable to open return-path for postcopy");
+            migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ERROR);
+            migrate_fd_cleanup(s);
+            return;
+        }
+    }
+
     qemu_thread_create(&s->thread, "migration", migration_thread, s,
                        QEMU_THREAD_JOINABLE);
+    s->started_migration_thread = true;
 }
 
 PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
diff --git a/trace-events b/trace-events
index 59dea4c..ed8bbe2 100644
--- a/trace-events
+++ b/trace-events
@@ -1404,9 +1404,13 @@ migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
 migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
 migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
+migration_thread_after_loop(void) ""
+migration_thread_file_err(void) ""
 migration_thread_setup_complete(void) ""
 open_outgoing_return_path(void) ""
 open_outgoing_return_path_continue(void) ""
+postcopy_start(void) ""
+postcopy_start_set_run(void) ""
 source_return_path_thread_bad_end(void) ""
 source_return_path_bad_header_com(void) ""
 source_return_path_thread_end(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 31/45] Postcopy end in migration_thread
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (29 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command Dr. David Alan Gilbert (git)
                   ` (13 subsequent siblings)
  44 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The end of migration in postcopy is a bit different since some of
the things normally done at the end of migration have already been
done on the transition to postcopy.

The end of migration code is getting a bit complciated now, so
move out into its own function.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c | 85 +++++++++++++++++++++++++++++++++++++--------------
 trace-events          |  6 ++++
 2 files changed, 68 insertions(+), 23 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 6bf9c8d..bd066f6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -894,7 +894,6 @@ static int open_outgoing_return_path(MigrationState *ms)
     return 0;
 }
 
-__attribute__ (( unused )) /* Until later in patch series */
 static void await_outgoing_return_path_close(MigrationState *ms)
 {
     /*
@@ -1010,6 +1009,64 @@ fail:
 }
 
 /*
+ * Used by migration_thread when there's not much left pending.
+ * The caller 'breaks' the loop when this returns.
+ */
+static void migration_thread_end_of_iteration(MigrationState *s,
+                                              int current_active_state,
+                                              bool *old_vm_running,
+                                              int64_t *start_time)
+{
+    int ret;
+    if (s->state == MIG_STATE_ACTIVE) {
+        qemu_mutex_lock_iothread();
+        *start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
+        *old_vm_running = runstate_is_running();
+
+        ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
+        if (ret >= 0) {
+            qemu_file_set_rate_limit(s->file, INT64_MAX);
+            qemu_savevm_state_complete(s->file);
+        }
+        qemu_mutex_unlock_iothread();
+
+        if (ret < 0) {
+            goto fail;
+        }
+    } else if (s->state == MIG_STATE_POSTCOPY_ACTIVE) {
+        trace_migration_thread_end_of_iteration_postcopy_end();
+
+        qemu_savevm_state_postcopy_complete(s->file);
+        trace_migration_thread_end_of_iteration_postcopy_end_after_complete();
+    }
+
+    /*
+     * If rp was opened we must clean up the thread before
+     * cleaning everything else up (since if there are no failures
+     * it will wait for the destination to send it's status in
+     * a SHUT command).
+     * Postcopy opens rp if enabled (even if it's not avtivated)
+     */
+    if (migrate_postcopy_ram()) {
+        trace_migration_thread_end_of_iteration_postcopy_end_before_rp();
+        await_outgoing_return_path_close(s);
+        trace_migration_thread_end_of_iteration_postcopy_end_after_rp();
+    }
+
+    if (qemu_file_get_error(s->file)) {
+        trace_migration_thread_end_of_iteration_file_err();
+        goto fail;
+    }
+
+    migrate_set_state(s, current_active_state, MIG_STATE_COMPLETED);
+    return;
+
+fail:
+    migrate_set_state(s, current_active_state, MIG_STATE_ERROR);
+}
+
+/*
  * Master migration thread on the source VM.
  * It drives the migration and pumps the data down the outgoing channel.
  */
@@ -1084,29 +1141,11 @@ static void *migration_thread(void *opaque)
                 /* Just another iteration step */
                 qemu_savevm_state_iterate(s->file);
             } else {
-                int ret;
-
-                qemu_mutex_lock_iothread();
-                start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
-                qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
-                old_vm_running = runstate_is_running();
-
-                ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
-                if (ret >= 0) {
-                    qemu_file_set_rate_limit(s->file, INT64_MAX);
-                    qemu_savevm_state_complete(s->file);
-                }
-                qemu_mutex_unlock_iothread();
+                trace_migration_thread_low_pending(pending_size);
 
-                if (ret < 0) {
-                    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_ERROR);
-                    break;
-                }
-
-                if (!qemu_file_get_error(s->file)) {
-                    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
-                    break;
-                }
+                migration_thread_end_of_iteration(s, current_active_type,
+                    &old_vm_running, &start_time);
+                break;
             }
         }
 
diff --git a/trace-events b/trace-events
index ed8bbe2..bcbdef8 100644
--- a/trace-events
+++ b/trace-events
@@ -1407,6 +1407,12 @@ migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
 migration_thread_after_loop(void) ""
 migration_thread_file_err(void) ""
 migration_thread_setup_complete(void) ""
+migration_thread_low_pending(uint64_t pending) "%" PRIu64
+migration_thread_end_of_iteration_file_err(void) ""
+migration_thread_end_of_iteration_postcopy_end(void) ""
+migration_thread_end_of_iteration_postcopy_end_after_complete(void) ""
+migration_thread_end_of_iteration_postcopy_end_before_rp(void) ""
+migration_thread_end_of_iteration_postcopy_end_after_rp(void) ""
 open_outgoing_return_path(void) ""
 open_outgoing_return_path_continue(void) ""
 postcopy_start(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (30 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 31/45] Postcopy end in migration_thread Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-23  5:00   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request Dr. David Alan Gilbert (git)
                   ` (12 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add MIG_RP_CMD_REQ_PAGES command on Return path for the postcopy
destination to request a page from the source.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  4 +++
 migration/migration.c         | 70 +++++++++++++++++++++++++++++++++++++++++++
 trace-events                  |  1 +
 3 files changed, 75 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 2c607e7..2c15d63 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -46,6 +46,8 @@ enum mig_rpcomm_cmd {
     MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
     MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
     MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
+
+    MIG_RP_CMD_REQ_PAGES,    /* data (start: be64, len: be64) */
 };
 
 /* Postcopy page-map-incoming - data about each page on the inbound side */
@@ -253,6 +255,8 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
                           uint32_t value);
 void migrate_send_rp_pong(MigrationIncomingState *mis,
                           uint32_t value);
+void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char* rbname,
+                              ram_addr_t start, ram_addr_t len);
 
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
diff --git a/migration/migration.c b/migration/migration.c
index bd066f6..2e9d0dd 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -138,6 +138,36 @@ void migrate_send_rp_pong(MigrationIncomingState *mis,
     migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
 }
 
+/* Request a range of pages from the source VM at the given
+ * start address.
+ *   rbname: Name of the RAMBlock to request the page in, if NULL it's the same
+ *           as the last request (a name must have been given previously)
+ *   Start: Address offset within the RB
+ *   Len: Length in bytes required - must be a multiple of pagesize
+ */
+void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char *rbname,
+                               ram_addr_t start, ram_addr_t len)
+{
+    uint8_t bufc[16+1+255]; /* start (8 byte), len (8 byte), rbname upto 256 */
+    uint64_t *buf64 = (uint64_t *)bufc;
+    size_t msglen = 16; /* start + len */
+
+    assert(!(len & 1));
+    if (rbname) {
+        int rbname_len = strlen(rbname);
+        assert(rbname_len < 256);
+
+        len |= 1; /* Flag to say we've got a name */
+        bufc[msglen++] = rbname_len;
+        memcpy(bufc + msglen, rbname, rbname_len);
+        msglen += rbname_len;
+    }
+
+    buf64[0] = cpu_to_be64((uint64_t)start);
+    buf64[1] = cpu_to_be64((uint64_t)len);
+    migrate_send_rp_message(mis, MIG_RP_CMD_REQ_PAGES, msglen, bufc);
+}
+
 void qemu_start_incoming_migration(const char *uri, Error **errp)
 {
     const char *p;
@@ -789,6 +819,17 @@ static void source_return_path_bad(MigrationState *s)
 }
 
 /*
+ * Process a request for pages received on the return path,
+ * We're allowed to send more than requested (e.g. to round to our page size)
+ * and we don't need to send pages that have already been sent.
+ */
+static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
+                                       ram_addr_t start, ram_addr_t len)
+{
+    trace_migrate_handle_rp_req_pages(start, len);
+}
+
+/*
  * Handles messages sent on the return path towards the source VM
  *
  */
@@ -800,6 +841,8 @@ static void *source_return_path_thread(void *opaque)
     const int max_len = 512;
     uint8_t buf[max_len];
     uint32_t tmp32;
+    ram_addr_t start, len;
+    char *tmpstr;
     int res;
 
     trace_source_return_path_thread_entry();
@@ -815,6 +858,11 @@ static void *source_return_path_thread(void *opaque)
             expected_len = 4;
             break;
 
+        case MIG_RP_CMD_REQ_PAGES:
+            /* 16 byte start/len _possibly_ plus an id str */
+            expected_len = 16 + 256;
+            break;
+
         default:
             error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
                     header_com, header_len);
@@ -860,6 +908,28 @@ static void *source_return_path_thread(void *opaque)
             trace_source_return_path_thread_pong(tmp32);
             break;
 
+        case MIG_RP_CMD_REQ_PAGES:
+            start = be64_to_cpup((uint64_t *)buf);
+            len = be64_to_cpup(((uint64_t *)buf)+1);
+            tmpstr = NULL;
+            if (len & 1) {
+                len -= 1; /* Remove the flag */
+                /* Now we expect an idstr */
+                tmp32 = buf[16]; /* Length of the following idstr */
+                tmpstr = (char *)&buf[17];
+                buf[17+tmp32] = '\0';
+                expected_len = 16+1+tmp32;
+            } else {
+                expected_len = 16;
+            }
+            if (header_len != expected_len) {
+                error_report("RP: Req_Page with length %d expecting %d",
+                        header_len, expected_len);
+                source_return_path_bad(ms);
+            }
+            migrate_handle_rp_req_pages(ms, tmpstr, start, len);
+            break;
+
         default:
             /* This shouldn't happen because we should catch this above */
             trace_source_return_path_bad_header_com();
diff --git a/trace-events b/trace-events
index bcbdef8..9bedee4 100644
--- a/trace-events
+++ b/trace-events
@@ -1404,6 +1404,7 @@ migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
 migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
 migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
+migrate_handle_rp_req_pages(size_t start, size_t len) "at %zx for len %zx"
 migration_thread_after_loop(void) ""
 migration_thread_file_err(void) ""
 migration_thread_setup_complete(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (31 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-24  1:53   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue Dr. David Alan Gilbert (git)
                   ` (11 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

On receiving MIG_RPCOMM_REQ_PAGES look up the address and
queue the page.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c                   | 55 +++++++++++++++++++++++++++++++++++++++++++
 include/exec/cpu-all.h        |  2 --
 include/migration/migration.h | 21 +++++++++++++++++
 include/qemu/typedefs.h       |  1 +
 migration/migration.c         | 33 +++++++++++++++++++++++++-
 trace-events                  |  3 ++-
 6 files changed, 111 insertions(+), 4 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index d2c4457..9d8fc6b 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -669,6 +669,61 @@ static int ram_save_page(QEMUFile *f, RAMBlock* block, ram_addr_t offset,
 }
 
 /*
+ * Queue the pages for transmission, e.g. a request from postcopy destination
+ *   ms: MigrationStatus in which the queue is held
+ *   rbname: The RAMBlock the request is for - may be NULL (to mean reuse last)
+ *   start: Offset from the start of the RAMBlock
+ *   len: Length (in bytes) to send
+ *   Return: 0 on success
+ */
+int ram_save_queue_pages(MigrationState *ms, const char *rbname,
+                         ram_addr_t start, ram_addr_t len)
+{
+    RAMBlock *ramblock;
+
+    if (!rbname) {
+        /* Reuse last RAMBlock */
+        ramblock = ms->last_req_rb;
+
+        if (!ramblock) {
+            /*
+             * Shouldn't happen, we can't reuse the last RAMBlock if
+             * it's the 1st request.
+             */
+            error_report("ram_save_queue_pages no previous block");
+            return -1;
+        }
+    } else {
+        ramblock = ram_find_block(rbname);
+
+        if (!ramblock) {
+            /* We shouldn't be asked for a non-existent RAMBlock */
+            error_report("ram_save_queue_pages no block '%s'", rbname);
+            return -1;
+        }
+    }
+    trace_ram_save_queue_pages(ramblock->idstr, start, len);
+    if (start+len > ramblock->used_length) {
+        error_report("%s request overrun start=%zx len=%zx blocklen=%zx",
+                     __func__, start, len, ramblock->used_length);
+        return -1;
+    }
+
+    struct MigrationSrcPageRequest *new_entry =
+        g_malloc0(sizeof(struct MigrationSrcPageRequest));
+    new_entry->rb = ramblock;
+    new_entry->offset = start;
+    new_entry->len = len;
+    ms->last_req_rb = ramblock;
+
+    qemu_mutex_lock(&ms->src_page_req_mutex);
+    QSIMPLEQ_INSERT_TAIL(&ms->src_page_requests, new_entry, next_req);
+    qemu_mutex_unlock(&ms->src_page_req_mutex);
+
+    return 0;
+}
+
+/*
  * ram_find_and_save_block: Finds a page to send and sends it to f
  *
  * Returns:  The number of bytes written.
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 2c48286..3088000 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -265,8 +265,6 @@ CPUArchState *cpu_copy(CPUArchState *env);
 
 /* memory API */
 
-typedef struct RAMBlock RAMBlock;
-
 struct RAMBlock {
     struct MemoryRegion *mr;
     uint8_t *host;
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 2c15d63..b1c7cad 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -100,6 +100,18 @@ MigrationIncomingState *migration_incoming_get_current(void);
 MigrationIncomingState *migration_incoming_state_new(QEMUFile *f);
 void migration_incoming_state_destroy(void);
 
+/*
+ * An outstanding page request, on the source, having been received
+ * and queued
+ */
+struct MigrationSrcPageRequest {
+    RAMBlock *rb;
+    hwaddr    offset;
+    hwaddr    len;
+
+    QSIMPLEQ_ENTRY(MigrationSrcPageRequest) next_req;
+};
+
 struct MigrationState
 {
     int64_t bandwidth_limit;
@@ -142,6 +154,12 @@ struct MigrationState
      * of the postcopy phase
      */
     unsigned long *sentmap;
+
+    /* Queue of outstanding page requests from the destination */
+    QemuMutex src_page_req_mutex;
+    QSIMPLEQ_HEAD(src_page_requests, MigrationSrcPageRequest) src_page_requests;
+    /* The RAMBlock used in the last src_page_request */
+    RAMBlock *last_req_rb;
 };
 
 void process_incoming_migration(QEMUFile *f);
@@ -276,6 +294,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
                              ram_addr_t offset, size_t size,
                              int *bytes_sent);
 
+int ram_save_queue_pages(MigrationState *ms, const char *rbname,
+                         ram_addr_t start, ram_addr_t len);
+
 PostcopyState postcopy_state_get(MigrationIncomingState *mis);
 
 /* Set the state and return the old state */
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index 0651275..396044d 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -73,6 +73,7 @@ typedef struct QEMUSGList QEMUSGList;
 typedef struct QEMUSizedBuffer QEMUSizedBuffer;
 typedef struct QEMUTimerListGroup QEMUTimerListGroup;
 typedef struct QEMUTimer QEMUTimer;
+typedef struct RAMBlock RAMBlock;
 typedef struct Range Range;
 typedef struct SerialState SerialState;
 typedef struct SHPCDevice SHPCDevice;
diff --git a/migration/migration.c b/migration/migration.c
index 2e9d0dd..939f426 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -26,6 +26,8 @@
 #include "qemu/thread.h"
 #include "qmp-commands.h"
 #include "trace.h"
+#include "exec/memory.h"
+#include "exec/address-spaces.h"
 
 enum MigrationPhase {
     MIG_STATE_ERROR = -1,
@@ -495,6 +497,15 @@ static void migrate_fd_cleanup(void *opaque)
 
     migrate_fd_cleanup_src_rp(s);
 
+    /* This queue generally should be empty - but in the case of a failed
+     * migration might have some droppings in.
+     */
+    struct MigrationSrcPageRequest *mspr, *next_mspr;
+    QSIMPLEQ_FOREACH_SAFE(mspr, &s->src_page_requests, next_req, next_mspr) {
+        QSIMPLEQ_REMOVE_HEAD(&s->src_page_requests, next_req);
+        g_free(mspr);
+    }
+
     if (s->file) {
         trace_migrate_fd_cleanup();
         qemu_mutex_unlock_iothread();
@@ -613,6 +624,9 @@ MigrationState *migrate_init(const MigrationParams *params)
     s->state = MIG_STATE_SETUP;
     trace_migrate_set_state(MIG_STATE_SETUP);
 
+    qemu_mutex_init(&s->src_page_req_mutex);
+    QSIMPLEQ_INIT(&s->src_page_requests);
+
     s->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     return s;
 }
@@ -826,7 +840,24 @@ static void source_return_path_bad(MigrationState *s)
 static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
                                        ram_addr_t start, ram_addr_t len)
 {
-    trace_migrate_handle_rp_req_pages(start, len);
+    trace_migrate_handle_rp_req_pages(rbname, start, len);
+
+    /* Round everything up to our host page size */
+    long our_host_ps = getpagesize();
+    if (start & (our_host_ps-1)) {
+        long roundings = start & (our_host_ps-1);
+        start -= roundings;
+        len += roundings;
+    }
+    if (len & (our_host_ps-1)) {
+        long roundings = len & (our_host_ps-1);
+        len -= roundings;
+        len += our_host_ps;
+    }
+
+    if (ram_save_queue_pages(ms, rbname, start, len)) {
+        source_return_path_bad(ms);
+    }
 }
 
 /*
diff --git a/trace-events b/trace-events
index 9bedee4..8a0d70d 100644
--- a/trace-events
+++ b/trace-events
@@ -1218,6 +1218,7 @@ migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
 migration_throttle(void) ""
 ram_postcopy_send_discard_bitmap(void) ""
+ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
 
 # hw/display/qxl.c
 disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
@@ -1404,7 +1405,7 @@ migrate_fd_error(void) ""
 migrate_fd_cancel(void) ""
 migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
 migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
-migrate_handle_rp_req_pages(size_t start, size_t len) "at %zx for len %zx"
+migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at %zx len %zx"
 migration_thread_after_loop(void) ""
 migration_thread_file_err(void) ""
 migration_thread_setup_complete(void) ""
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (32 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-24  2:15   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers Dr. David Alan Gilbert (git)
                   ` (10 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When transmitting RAM pages, consume pages that have been queued by
MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.

Note:
  a) After a queued page the linear walk carries on from after the
unqueued page; there is a reasonable chance that the destination
was about to ask for other closeby pages anyway.

  b) We have to be careful of any assumptions that the page walking
code makes, in particular it does some short cuts on its first linear
walk that break as soon as we do a queued page.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c  | 154 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 trace-events |   2 +
 2 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 9d8fc6b..acf65e1 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -328,6 +328,7 @@ static RAMBlock *last_seen_block;
 /* This is the last block from where we have sent data */
 static RAMBlock *last_sent_block;
 static ram_addr_t last_offset;
+static bool last_was_from_queue;
 static unsigned long *migration_bitmap;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
@@ -461,6 +462,19 @@ static inline bool migration_bitmap_set_dirty(ram_addr_t addr)
     return ret;
 }
 
+static inline bool migration_bitmap_clear_dirty(ram_addr_t addr)
+{
+    bool ret;
+    int nr = addr >> TARGET_PAGE_BITS;
+
+    ret = test_and_clear_bit(nr, migration_bitmap);
+
+    if (ret) {
+        migration_dirty_pages--;
+    }
+    return ret;
+}
+
 static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
 {
     ram_addr_t addr;
@@ -669,6 +683,39 @@ static int ram_save_page(QEMUFile *f, RAMBlock* block, ram_addr_t offset,
 }
 
 /*
+ * Unqueue a page from the queue fed by postcopy page requests
+ *
+ * Returns:      The RAMBlock* to transmit from (or NULL if the queue is empty)
+ *      ms:      MigrationState in
+ *  offset:      the byte offset within the RAMBlock for the start of the page
+ * ram_addr_abs: global offset in the dirty/sent bitmaps
+ */
+static RAMBlock *ram_save_unqueue_page(MigrationState *ms, ram_addr_t *offset,
+                                       ram_addr_t *ram_addr_abs)
+{
+    RAMBlock *result = NULL;
+    qemu_mutex_lock(&ms->src_page_req_mutex);
+    if (!QSIMPLEQ_EMPTY(&ms->src_page_requests)) {
+        struct MigrationSrcPageRequest *entry =
+                                    QSIMPLEQ_FIRST(&ms->src_page_requests);
+        result = entry->rb;
+        *offset = entry->offset;
+        *ram_addr_abs = (entry->offset + entry->rb->offset) & TARGET_PAGE_MASK;
+
+        if (entry->len > TARGET_PAGE_SIZE) {
+            entry->len -= TARGET_PAGE_SIZE;
+            entry->offset += TARGET_PAGE_SIZE;
+        } else {
+            QSIMPLEQ_REMOVE_HEAD(&ms->src_page_requests, next_req);
+            g_free(entry);
+        }
+    }
+    qemu_mutex_unlock(&ms->src_page_req_mutex);
+
+    return result;
+}
+
+/*
  * Queue the pages for transmission, e.g. a request from postcopy destination
  *   ms: MigrationStatus in which the queue is held
  *   rbname: The RAMBlock the request is for - may be NULL (to mean reuse last)
@@ -732,46 +779,102 @@ int ram_save_queue_pages(MigrationState *ms, const char *rbname,
 
 static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
 {
+    MigrationState *ms = migrate_get_current();
     RAMBlock *block = last_seen_block;
+    RAMBlock *tmpblock;
     ram_addr_t offset = last_offset;
+    ram_addr_t tmpoffset;
     bool complete_round = false;
     int bytes_sent = 0;
-    MemoryRegion *mr;
     ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
                                  ram_addr_t space */
+    unsigned long hps = sysconf(_SC_PAGESIZE);
 
-    if (!block)
+    if (!block) {
         block = QTAILQ_FIRST(&ram_list.blocks);
+        last_was_from_queue = false;
+    }
 
-    while (true) {
-        mr = block->mr;
-        offset = migration_bitmap_find_and_reset_dirty(mr, offset,
-                                                       &dirty_ram_abs);
-        if (complete_round && block == last_seen_block &&
-            offset >= last_offset) {
-            break;
+    while (true) { /* Until we send a block or run out of stuff to send */
+        tmpblock = NULL;
+
+        /*
+         * Don't break host-page chunks up with queue items
+         * so only unqueue if,
+         *   a) The last item came from the queue anyway
+         *   b) The last sent item was the last target-page in a host page
+         */
+        if (last_was_from_queue || !last_sent_block ||
+            ((last_offset & (hps - 1)) == (hps - TARGET_PAGE_SIZE))) {
+            tmpblock = ram_save_unqueue_page(ms, &tmpoffset, &dirty_ram_abs);
         }
-        if (offset >= block->used_length) {
-            offset = 0;
-            block = QTAILQ_NEXT(block, next);
-            if (!block) {
-                block = QTAILQ_FIRST(&ram_list.blocks);
-                complete_round = true;
-                ram_bulk_stage = false;
+
+        if (tmpblock) {
+            /* We've got a block from the postcopy queue */
+            trace_ram_find_and_save_block_postcopy(tmpblock->idstr,
+                                                   (uint64_t)tmpoffset,
+                                                   (uint64_t)dirty_ram_abs);
+            /*
+             * We're sending this page, and since it's postcopy nothing else
+             * will dirty it, and we must make sure it doesn't get sent again.
+             */
+            if (!migration_bitmap_clear_dirty(dirty_ram_abs)) {
+                trace_ram_find_and_save_block_postcopy_not_dirty(
+                    tmpblock->idstr, (uint64_t)tmpoffset,
+                    (uint64_t)dirty_ram_abs,
+                    test_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap));
+
+                continue;
             }
+            /*
+             * As soon as we start servicing pages out of order, then we have
+             * to kill the bulk stage, since the bulk stage assumes
+             * in (migration_bitmap_find_and_reset_dirty) that every page is
+             * dirty, that's no longer true.
+             */
+            ram_bulk_stage = false;
+            /*
+             * We mustn't change block/offset unless it's to a valid one
+             * otherwise we can go down some of the exit cases in the normal
+             * path.
+             */
+            block = tmpblock;
+            offset = tmpoffset;
+            last_was_from_queue = true;
         } else {
-            bytes_sent = ram_save_page(f, block, offset, last_stage);
-
-            /* if page is unmodified, continue to the next */
-            if (bytes_sent > 0) {
-                MigrationState *ms = migrate_get_current();
-                if (ms->sentmap) {
-                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
+            MemoryRegion *mr;
+            /* priority queue empty, so just search for something dirty */
+            mr = block->mr;
+            offset = migration_bitmap_find_and_reset_dirty(mr, offset,
+                                                           &dirty_ram_abs);
+            if (complete_round && block == last_seen_block &&
+                offset >= last_offset) {
+                break;
+            }
+            if (offset >= block->used_length) {
+                offset = 0;
+                block = QTAILQ_NEXT(block, next);
+                if (!block) {
+                    block = QTAILQ_FIRST(&ram_list.blocks);
+                    complete_round = true;
+                    ram_bulk_stage = false;
                 }
+                continue; /* pick an offset in the new block */
+            }
+            last_was_from_queue = false;
+        }
 
-                last_sent_block = block;
-                break;
+        /* We have a page to send, so send it */
+        bytes_sent = ram_save_page(f, block, offset, last_stage);
+
+        /* if page is unmodified, continue to the next */
+        if (bytes_sent > 0) {
+            if (ms->sentmap) {
+                set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
             }
+
+            last_sent_block = block;
+            break;
         }
     }
     last_seen_block = block;
@@ -865,6 +968,7 @@ static void reset_ram_globals(void)
     last_offset = 0;
     last_version = ram_list.version;
     ram_bulk_stage = true;
+    last_was_from_queue = false;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
diff --git a/trace-events b/trace-events
index 8a0d70d..781cf5c 100644
--- a/trace-events
+++ b/trace-events
@@ -1217,6 +1217,8 @@ qemu_file_fclose(void) ""
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
 migration_throttle(void) ""
+ram_find_and_save_block_postcopy(const char *block_name, uint64_t tmp_offset, uint64_t ram_addr) "%s/%" PRIx64 " ram_addr=%" PRIx64
+ram_find_and_save_block_postcopy_not_dirty(const char *block_name, uint64_t tmp_offset, uint64_t ram_addr, int sent) "%s/%" PRIx64 " ram_addr=%" PRIx64 " (sent=%d)"
 ram_postcopy_send_discard_bitmap(void) ""
 ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (33 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-24  2:33   ` David Gibson
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 36/45] Postcopy: Use helpers to map pages during migration Dr. David Alan Gilbert (git)
                   ` (9 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

postcopy_place_page (etc) provide a way for postcopy to place a page
into guests memory atomically (using the copy ioctl on the ufd).

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h    |   2 +
 include/migration/postcopy-ram.h |  16 ++++++
 migration/postcopy-ram.c         | 113 ++++++++++++++++++++++++++++++++++++++-
 trace-events                     |   1 +
 4 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index b1c7cad..139bb1b 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -94,6 +94,8 @@ struct MigrationIncomingState {
     QEMUFile *return_path;
     QemuMutex      rp_mutex;    /* We send replies from multiple threads */
     PostcopyPMI    postcopy_pmi;
+    void          *postcopy_tmp_page;
+    long           postcopy_place_skipped; /* Check for incorrect place ops */
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
index fbb2a93..3d30280 100644
--- a/include/migration/postcopy-ram.h
+++ b/include/migration/postcopy-ram.h
@@ -80,4 +80,20 @@ void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
 void postcopy_discard_send_finish(MigrationState *ms,
                                   PostcopyDiscardState *pds);
 
+/*
+ * Place a page (from) at (host) efficiently
+ *    There are restrictions on how 'from' must be mapped, in general best
+ *    to use other postcopy_ routines to allocate.
+ * returns 0 on success
+ */
+int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
+                        long bitmap_offset, bool all_zero);
+
+/*
+ * Allocate a page of memory that can be mapped at a later point in time
+ * using postcopy_place_page
+ * Returns: Pointer to allocated page
+ */
+void *postcopy_get_tmp_page(MigrationIncomingState *mis);
+
 #endif
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 33dd332..86fa5a0 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -197,7 +197,6 @@ static PostcopyPMIState postcopy_pmi_get_state_nolock(
 }
 
 /* Retrieve the state of the given page */
-__attribute__ (( unused )) /* Until later in patch series */
 static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
                                                size_t bitmap_index)
 {
@@ -213,7 +212,6 @@ static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
  * Set the page state to the given state if the previous state was as expected
  * Return the actual previous state.
  */
-__attribute__ (( unused )) /* Until later in patch series */
 static PostcopyPMIState postcopy_pmi_change_state(MigrationIncomingState *mis,
                                            size_t bitmap_index,
                                            PostcopyPMIState expected_state,
@@ -477,6 +475,7 @@ static int cleanup_area(const char *block_name, void *host_addr,
 int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
 {
     postcopy_pmi_init(mis, ram_pages);
+    mis->postcopy_place_skipped = -1;
 
     if (qemu_ram_foreach_block(init_area, mis)) {
         return -1;
@@ -495,6 +494,10 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         return -1;
     }
 
+    if (mis->postcopy_tmp_page) {
+        munmap(mis->postcopy_tmp_page, getpagesize());
+        mis->postcopy_tmp_page = NULL;
+    }
     return 0;
 }
 
@@ -561,6 +564,100 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
     return 0;
 }
 
+/*
+ * Place a host page (from) at (host) tomically
+ *    There are restrictions on how 'from' must be mapped, in general best
+ *    to use other postcopy_ routines to allocate.
+ * all_zero: Hint that the page being placed is 0 throughout
+ * returns 0 on success
+ * bitmap_offset: Index into the migration bitmaps
+ *
+ * State changes:
+ *   none -> received
+ *   requested -> received (ack)
+ *
+ * Note the UF thread is also updating the state, and maybe none->requested
+ * at the same time.
+ */
+int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
+                        long bitmap_offset, bool all_zero)
+{
+    PostcopyPMIState old_state, tmp_state, new_state;
+
+    if (!all_zero) {
+        struct uffdio_copy copy_struct;
+
+        copy_struct.dst = (uint64_t)(uintptr_t)host;
+        copy_struct.src = (uint64_t)(uintptr_t)from;
+        copy_struct.len = getpagesize();
+        copy_struct.mode = 0;
+
+        /* copy also acks to the kernel waking the stalled thread up
+         * TODO: We can inhibit that ack and only do it if it was requested
+         * which would be slightly cheaper, but we'd have to be careful
+         * of the order of updating our page state.
+         */
+        if (ioctl(mis->userfault_fd, UFFDIO_COPY, &copy_struct)) {
+            int e = errno;
+            error_report("%s: %s copy host: %p from: %p pmi=%d",
+                         __func__, strerror(e), host, from,
+                         postcopy_pmi_get_state(mis, bitmap_offset));
+
+            return -e;
+        }
+    } else {
+        struct uffdio_zeropage zero_struct;
+
+        zero_struct.range.start = (uint64_t)(uintptr_t)host;
+        zero_struct.range.len = getpagesize();
+        zero_struct.mode = 0;
+
+        if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
+            int e = errno;
+            error_report("%s: %s zero host: %p from: %p pmi=%d",
+                         __func__, strerror(e), host, from,
+                         postcopy_pmi_get_state(mis, bitmap_offset));
+
+            return -e;
+        }
+    }
+
+    bitmap_offset &= ~(mis->postcopy_pmi.host_bits-1);
+    new_state = POSTCOPY_PMI_RECEIVED;
+    tmp_state = postcopy_pmi_get_state(mis, bitmap_offset);
+    do {
+        old_state = tmp_state;
+        tmp_state = postcopy_pmi_change_state(mis, bitmap_offset, old_state,
+                                              new_state);
+    } while (old_state != tmp_state);
+    trace_postcopy_place_page(bitmap_offset, host, all_zero, old_state);
+
+    return 0;
+}
+
+/*
+ * Returns a target page of memory that can be mapped at a later point in time
+ * using postcopy_place_page
+ * The same address is used repeatedly, postcopy_place_page just takes the
+ * backing page away.
+ * Returns: Pointer to allocated page
+ *
+ */
+void *postcopy_get_tmp_page(MigrationIncomingState *mis)
+{
+    if (!mis->postcopy_tmp_page) {
+        mis->postcopy_tmp_page = mmap(NULL, getpagesize(),
+                             PROT_READ | PROT_WRITE, MAP_PRIVATE |
+                             MAP_ANONYMOUS, -1, 0);
+        if (!mis->postcopy_tmp_page) {
+            perror("mapping postcopy tmp page");
+            return NULL;
+        }
+    }
+
+    return mis->postcopy_tmp_page;
+}
+
 #else
 /* No target OS support, stubs just fail */
 bool postcopy_ram_supported_by_host(void)
@@ -608,6 +705,18 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
 {
     assert(0);
 }
+
+int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
+                        long bitmap_offset, bool all_zero)
+{
+    assert(0);
+}
+
+void *postcopy_get_tmp_page(MigrationIncomingState *mis)
+{
+    assert(0);
+}
+
 #endif
 
 /* ------------------------------------------------------------------------- */
diff --git a/trace-events b/trace-events
index 781cf5c..16a91d9 100644
--- a/trace-events
+++ b/trace-events
@@ -1497,6 +1497,7 @@ rdma_start_outgoing_migration_after_rdma_source_init(void) ""
 postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
 postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
 postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
+postcopy_place_page(unsigned long offset, void *host_addr, bool all_zero, int old_state) "offset=%lx host=%p all_zero=%d old_state=%d"
 
 # kvm-all.c
 kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 36/45] Postcopy: Use helpers to map pages during migration
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (34 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers Dr. David Alan Gilbert (git)
@ 2015-02-25 16:51 ` Dr. David Alan Gilbert (git)
  2015-03-24  4:51   ` David Gibson
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 37/45] qemu_ram_block_from_host Dr. David Alan Gilbert (git)
                   ` (8 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

In postcopy, the destination guest is running at the same time
as it's receiving pages; as we receive new pages we must put
them into the guests address space atomically to avoid a running
CPU accessing a partially written page.

Use the helpers in postcopy-ram.c to map these pages.

Note, gcc 4.9.2 is giving me false uninitialized warnings in ram_load's
switch, so anything conditionally set at the start of the switch needs
initializing; filed as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64614

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 111 insertions(+), 26 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index acf65e1..7a1c9ea 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -1479,9 +1479,39 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
     return 0;
 }
 
+/*
+ * Helper for host_from_stream_offset in the succesful case.
+ * Returns the host pointer for the given block and offset
+ *   calling the postcopy hook and filling in *rb.
+ */
+static void *host_with_postcopy_hook(MigrationIncomingState *mis,
+                                     ram_addr_t offset,
+                                     RAMBlock *block,
+                                     RAMBlock **rb)
+{
+    if (rb) {
+        *rb = block;
+    }
+
+    postcopy_hook_early_receive(mis,
+        (offset + block->offset) >> TARGET_PAGE_BITS);
+    return memory_region_get_ram_ptr(block->mr) + offset;
+}
+
+/*
+ * Read a RAMBlock ID from the stream f, find the host address of the
+ * start of that block and add on 'offset'
+ *
+ * f: Stream to read from
+ * mis: MigrationIncomingState
+ * offset: Offset within the block
+ * flags: Page flags (mostly to see if it's a continuation of previous block)
+ * rb: Pointer to RAMBlock* that gets filled in with the RB we find
+ */
 static inline void *host_from_stream_offset(QEMUFile *f,
+                                            MigrationIncomingState *mis,
                                             ram_addr_t offset,
-                                            int flags)
+                                            int flags, RAMBlock **rb)
 {
     static RAMBlock *block = NULL;
     char id[256];
@@ -1492,8 +1522,7 @@ static inline void *host_from_stream_offset(QEMUFile *f,
             error_report("Ack, bad migration stream!");
             return NULL;
         }
-
-        return memory_region_get_ram_ptr(block->mr) + offset;
+        return host_with_postcopy_hook(mis, offset, block, rb);
     }
 
     len = qemu_get_byte(f);
@@ -1503,7 +1532,7 @@ static inline void *host_from_stream_offset(QEMUFile *f,
     QTAILQ_FOREACH(block, &ram_list.blocks, next) {
         if (!strncmp(id, block->idstr, sizeof(id)) &&
             block->max_length > offset) {
-            return memory_region_get_ram_ptr(block->mr) + offset;
+            return host_with_postcopy_hook(mis, offset, block, rb);
         }
     }
 
@@ -1537,6 +1566,15 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     int flags = 0, ret = 0;
     static uint64_t seq_iter;
+    /*
+     * System is running in postcopy mode, page inserts to host memory must be
+     * atomic
+     */
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    bool postcopy_running = postcopy_state_get(mis) >=
+                            POSTCOPY_INCOMING_LISTENING;
+    void *postcopy_host_page = NULL;
+    bool postcopy_place_needed = false;
 
     seq_iter++;
 
@@ -1545,14 +1583,57 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     }
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
+        RAMBlock *rb = 0; /* =0 needed to silence compiler */
         ram_addr_t addr, total_ram_bytes;
-        void *host;
+        void *host = 0;
+        void *page_buffer = 0;
         uint8_t ch;
+        bool all_zero = false;
 
         addr = qemu_get_be64(f);
         flags = addr & ~TARGET_PAGE_MASK;
         addr &= TARGET_PAGE_MASK;
 
+        if (flags & (RAM_SAVE_FLAG_COMPRESS | RAM_SAVE_FLAG_PAGE |
+                     RAM_SAVE_FLAG_XBZRLE)) {
+            host = host_from_stream_offset(f, mis, addr, flags, &rb);
+            if (!host) {
+                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
+                ret = -EINVAL;
+                break;
+            }
+            if (!postcopy_running) {
+                page_buffer = host;
+            } else {
+                /*
+                 * Postcopy requires that we place whole host pages atomically.
+                 * To make it atomic, the data is read into a temporary page
+                 * that's moved into place later.
+                 * The migration protocol uses,  possibly smaller, target-pages
+                 * however the source ensures it always sends all the components
+                 * of a host page in order.
+                 */
+                if (!postcopy_host_page) {
+                    postcopy_host_page = postcopy_get_tmp_page(mis);
+                }
+                page_buffer = postcopy_host_page +
+                              ((uintptr_t)host & ~qemu_host_page_mask);
+                /* If all TP are zero then we can optimise the place */
+                if (!((uintptr_t)host & ~qemu_host_page_mask)) {
+                    all_zero = true;
+                }
+
+                /*
+                 * If it's the last part of a host page then we place the host
+                 * page
+                 */
+                postcopy_place_needed = (((uintptr_t)host + TARGET_PAGE_SIZE) &
+                                         ~qemu_host_page_mask) == 0;
+            }
+        } else {
+            postcopy_place_needed = false;
+        }
+
         switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
         case RAM_SAVE_FLAG_MEM_SIZE:
             /* Synchronize RAM block list */
@@ -1590,32 +1671,27 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         case RAM_SAVE_FLAG_COMPRESS:
-            host = host_from_stream_offset(f, addr, flags);
-            if (!host) {
-                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
-                ret = -EINVAL;
-                break;
-            }
-
             ch = qemu_get_byte(f);
-            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
-            break;
-        case RAM_SAVE_FLAG_PAGE:
-            host = host_from_stream_offset(f, addr, flags);
-            if (!host) {
-                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
-                ret = -EINVAL;
-                break;
+            if (!postcopy_running) {
+                ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
+            } else {
+                memset(page_buffer, ch, TARGET_PAGE_SIZE);
+                if (ch) {
+                    all_zero = false;
+                }
             }
+            break;
 
-            qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+        case RAM_SAVE_FLAG_PAGE:
+            all_zero = false;
+            qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
             break;
+
         case RAM_SAVE_FLAG_XBZRLE:
-            host = host_from_stream_offset(f, addr, flags);
-            if (!host) {
-                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
-                ret = -EINVAL;
-                break;
+            all_zero = false;
+            if (postcopy_running) {
+                error_report("XBZRLE RAM block in postcopy mode @%zx\n", addr);
+                return -EINVAL;
             }
 
             if (load_xbzrle(f, addr, host) < 0) {
@@ -1637,6 +1713,15 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                 ret = -EINVAL;
             }
         }
+
+        if (postcopy_place_needed) {
+            /* This gets called at the last target page in the host page */
+            ret = postcopy_place_page(mis, host + TARGET_PAGE_SIZE -
+                                           qemu_host_page_size,
+                                      postcopy_host_page,
+                                      (addr + rb->offset) >> TARGET_PAGE_BITS,
+                                      all_zero);
+        }
         if (!ret) {
             ret = qemu_file_get_error(f);
         }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 37/45] qemu_ram_block_from_host
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (35 preceding siblings ...)
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 36/45] Postcopy: Use helpers to map pages during migration Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-03-24  4:55   ` David Gibson
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy Dr. David Alan Gilbert (git)
                   ` (7 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Postcopy sends RAMBlock names and offsets over the wire (since it can't
rely on the order of ramaddr being the same), and it starts out with
HVA fault addresses from the kernel.

qemu_ram_block_from_host translates a HVA into a RAMBlock, an offset
in the RAMBlock, the global ram_addr_t value and it's bitmap position.

Rewrite qemu_ram_addr_from_host to use qemu_ram_block_from_host.

Provide qemu_ram_get_idstr since its the actual name text sent on the
wire.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 exec.c                    | 56 ++++++++++++++++++++++++++++++++++++++++++-----
 include/exec/cpu-common.h |  4 ++++
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/exec.c b/exec.c
index eafd964..0b02464 100644
--- a/exec.c
+++ b/exec.c
@@ -1237,6 +1237,11 @@ static RAMBlock *find_ram_block(ram_addr_t addr)
     return NULL;
 }
 
+const char *qemu_ram_get_idstr(RAMBlock *rb)
+{
+    return rb->idstr;
+}
+
 void qemu_ram_set_idstr(ram_addr_t addr, const char *name, DeviceState *dev)
 {
     RAMBlock *new_block = find_ram_block(addr);
@@ -1669,16 +1674,35 @@ static void *qemu_ram_ptr_length(ram_addr_t addr, hwaddr *size)
     }
 }
 
-/* Some of the softmmu routines need to translate from a host pointer
-   (typically a TLB entry) back to a ram offset.  */
-MemoryRegion *qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr)
+/*
+ * Translates a host ptr back to a RAMBlock, a ram_addr and an offset
+ * in that RAMBlock.
+ *
+ * ptr: Host pointer to look up
+ * round_offset: If true round the result offset down to a page boundary
+ * *ram_addr: set to result ram_addr
+ * *offset: set to result offset within the RAMBlock
+ * *bm_index: bitmap index (i.e. scaled ram_addr for use where the scale
+ *                          isn't available)
+ *
+ * Returns: RAMBlock (or NULL if not found)
+ */
+RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
+                                   ram_addr_t *ram_addr,
+                                   ram_addr_t *offset,
+                                   unsigned long *bm_index)
 {
     RAMBlock *block;
     uint8_t *host = ptr;
 
     if (xen_enabled()) {
         *ram_addr = xen_ram_addr_from_mapcache(ptr);
-        return qemu_get_ram_block(*ram_addr)->mr;
+        block = qemu_get_ram_block(*ram_addr);
+        if (!block) {
+            return NULL;
+        }
+        *offset = (host - block->host);
+        return block;
     }
 
     block = ram_list.mru_block;
@@ -1699,7 +1723,29 @@ MemoryRegion *qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr)
     return NULL;
 
 found:
-    *ram_addr = block->offset + (host - block->host);
+    *offset = (host - block->host);
+    if (round_offset) {
+        *offset &= TARGET_PAGE_MASK;
+    }
+    *ram_addr = block->offset + *offset;
+    *bm_index = *ram_addr >> TARGET_PAGE_BITS;
+    return block;
+}
+
+/* Some of the softmmu routines need to translate from a host pointer
+   (typically a TLB entry) back to a ram offset.  */
+MemoryRegion *qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr)
+{
+    RAMBlock *block;
+    ram_addr_t offset; /* Not used */
+    unsigned long index; /* Not used */
+
+    block = qemu_ram_block_from_host(ptr, false, ram_addr, &offset, &index);
+
+    if (!block) {
+        return NULL;
+    }
+
     return block->mr;
 }
 
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index a31300c..d23a97f 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -61,8 +61,12 @@ typedef uint32_t CPUReadMemoryFunc(void *opaque, hwaddr addr);
 void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should not be used by devices.  */
 MemoryRegion *qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr);
+RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
+                                   ram_addr_t *ram_addr, ram_addr_t *offset,
+                                   unsigned long *bm_index);
 void qemu_ram_set_idstr(ram_addr_t addr, const char *name, DeviceState *dev);
 void qemu_ram_unset_idstr(ram_addr_t addr);
+const char *qemu_ram_get_idstr(RAMBlock *rb);
 
 void cpu_physical_memory_rw(hwaddr addr, uint8_t *buf,
                             int len, int is_write);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (36 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 37/45] qemu_ram_block_from_host Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-03-24  4:58   ` David Gibson
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 39/45] Host page!=target page: Cleanup bitmaps Dr. David Alan Gilbert (git)
                   ` (6 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Once we're in postcopy the source processors are stopped and memory
shouldn't change any more, so there's no need to look at the dirty
map.

There are two notes to this:
  1) If we do resync and a page had changed then the page would get
     sent again, which the destination wouldn't allow (since it might
     have also modified the page)
  2) Before disabling this I'd seen very rare cases where a page had been
     marked dirtied although the memory contents are apparently identical

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 7a1c9ea..9d8ca95 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -1398,7 +1398,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 static int ram_save_complete(QEMUFile *f, void *opaque)
 {
     qemu_mutex_lock_ramlist();
-    migration_bitmap_sync();
+
+    if (!migration_postcopy_phase(migrate_get_current())) {
+        migration_bitmap_sync();
+    }
 
     ram_control_before_iterate(f, RAM_CONTROL_FINISH);
 
@@ -1433,7 +1436,8 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
 
     remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
 
-    if (remaining_size < max_size) {
+    if (!migration_postcopy_phase(migrate_get_current()) &&
+        remaining_size < max_size) {
         qemu_mutex_lock_iothread();
         migration_bitmap_sync();
         qemu_mutex_unlock_iothread();
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 39/45] Host page!=target page: Cleanup bitmaps
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (37 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-03-24  5:23   ` David Gibson
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests Dr. David Alan Gilbert (git)
                   ` (5 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Prior to the start of postcopy, ensure that everything that will
be transferred later is a whole host-page in size.

This is accomplished by discarding partially transferred host pages
and marking any that are partially dirty as fully dirty.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 arch_init.c | 227 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 224 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 9d8ca95..9bc799b 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -788,7 +788,6 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
     int bytes_sent = 0;
     ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
                                  ram_addr_t space */
-    unsigned long hps = sysconf(_SC_PAGESIZE);
 
     if (!block) {
         block = QTAILQ_FIRST(&ram_list.blocks);
@@ -805,7 +804,8 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
          *   b) The last sent item was the last target-page in a host page
          */
         if (last_was_from_queue || !last_sent_block ||
-            ((last_offset & (hps - 1)) == (hps - TARGET_PAGE_SIZE))) {
+            ((last_offset & ~qemu_host_page_mask) ==
+             (qemu_host_page_size - TARGET_PAGE_SIZE))) {
             tmpblock = ram_save_unqueue_page(ms, &tmpoffset, &dirty_ram_abs);
         }
 
@@ -1040,7 +1040,6 @@ static uint32_t get_32bits_map(unsigned long *map, int64_t start)
  * A helper to put 32 bits into a bit map; trivial for HOST_LONG_BITS=32
  * messier for 64; the bitmaps are actually long's that are 32 or 64bit
  */
-__attribute__ (( unused )) /* Until later in patch series */
 static void put_32bits_map(unsigned long *map, int64_t start,
                            uint32_t v)
 {
@@ -1169,15 +1168,237 @@ static int pc_each_ram_discard(MigrationState *ms)
 }
 
 /*
+ * Helper for postcopy_chunk_hostpages where HPS/TPS >=32
+ *
+ * !! Untested !!
+ */
+static int hostpage_big_chunk_helper(const char *block_name, void *host_addr,
+                                     ram_addr_t offset, ram_addr_t length,
+                                     void *opaque)
+{
+    MigrationState *ms = opaque;
+    unsigned int host_len = (qemu_host_page_size / TARGET_PAGE_SIZE) / 32;
+    unsigned long first32, last32, cur32, current_hp;
+    unsigned long first = offset >> TARGET_PAGE_BITS;
+    unsigned long last = (offset + (length - 1)) >> TARGET_PAGE_BITS;
+
+    PostcopyDiscardState *pds = postcopy_discard_send_init(ms,
+                                                           first & 31,
+                                                           block_name);
+    first32 = first / 32;
+    last32 = last / 32;
+
+    /*
+     * I'm assuming RAMBlocks must start at the start of host pages,
+     * but I guess they might not use the whole of the host page
+     */
+
+    /* Work along one host page at a time */
+    for (current_hp = first32; current_hp <= last32; current_hp += host_len) {
+        bool discard = 0;
+        bool redirty = 0;
+        bool has_some_dirty = false;
+        bool has_some_undirty = false;
+        bool has_some_sent = false;
+        bool has_some_unsent = false;
+
+        /*
+         * Check all the 32bit chunks of mask for this hp, and see if anything
+         * needs updating.
+         */
+        for (cur32 = current_hp; cur32 < (current_hp + host_len); cur32++) {
+            /* a chunk of sent pages */
+            uint32_t sdata = get_32bits_map(ms->sentmap, cur32 * 32);
+            /* a chunk of dirty pages */
+            uint32_t ddata = get_32bits_map(migration_bitmap, cur32 * 32);
+
+            if (sdata) {
+                has_some_sent = true;
+            }
+            if (sdata != 0xfffffffful) {
+                has_some_unsent = true;
+            }
+            if (ddata) {
+                has_some_dirty = true;
+            }
+            if (ddata != 0xfffffffful) {
+                has_some_undirty = true;
+            }
+
+        }
+
+        if (has_some_sent && has_some_unsent) {
+            /* Partially sent host page */
+            discard = true;
+            redirty = true;
+        }
+
+        if (has_some_dirty && has_some_undirty) {
+            /* Partially dirty host page */
+            redirty = true;
+        }
+
+        if (!discard && !redirty) {
+            /* All consistent - next host page */
+            continue;
+        }
+
+
+        /* Now walk the 32bit chunks again, sending discards etc */
+        for (cur32 = current_hp; cur32 < (current_hp + host_len); cur32++) {
+            /* a chunk of sent pages */
+            uint32_t sdata = get_32bits_map(ms->sentmap, cur32 * 32);
+            /* a chunk of dirty pages */
+            uint32_t ddata = get_32bits_map(migration_bitmap, cur32 * 32);
+
+            if (discard && sdata) {
+                /* Tell the destination to discard these pages */
+                postcopy_discard_send_chunk(ms, pds, (cur32-first32) * 32,
+                                            sdata);
+                /* And clear them in the sent data structure */
+                put_32bits_map(ms->sentmap, cur32 * 32, 0);
+            }
+
+            if (redirty) {
+                put_32bits_map(migration_bitmap, cur32 * 32,
+                               0xffffffffu);
+                /* Inc the count of dirty pages */
+                migration_dirty_pages += ctpop32(~ddata);
+            }
+        }
+    }
+
+    postcopy_discard_send_finish(ms, pds);
+
+    return 0;
+}
+
+
+/*
+ * Utility for the outgoing postcopy code.
+ *
+ * Discard any partially sent host-page size chunks, mark any partially
+ * dirty host-page size chunks as all dirty.
+ *
+ * Returns: 0 on success
+ */
+static int postcopy_chunk_hostpages(MigrationState *ms)
+{
+    struct RAMBlock *block;
+    unsigned int host_bits = qemu_host_page_size / TARGET_PAGE_SIZE;
+    uint32_t host_mask;
+
+    assert(is_power_of_2(host_bits));
+
+    if (qemu_host_page_size == TARGET_PAGE_SIZE) {
+        /* Easy case - TPS==HPS - nothing to be done */
+        return 0;
+    }
+
+    /* Easiest way to make sure we don't resume in the middle of a host-page */
+    last_seen_block = NULL;
+    last_sent_block = NULL;
+
+    /*
+     * The currently worst known ratio is ARM that has 1kB target pages, and
+     * can have 64kB host pages, which is thus inconveniently larger than our
+     * 32bit chunks, but there again the migration bitmap that we're reworking
+     * are 'long's and those can be 32bit so it's also inconvenient to work
+     * in 64bit chunks.
+     */
+    if (host_bits >= 32) {
+        /* Deal with the odd case separately */
+        return qemu_ram_foreach_block(hostpage_big_chunk_helper, ms);
+    } else {
+        host_mask =  (1u << host_bits) - 1;
+    }
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        unsigned long first32, last32, cur32;
+        unsigned long first = block->offset >> TARGET_PAGE_BITS;
+        unsigned long last = (block->offset + (block->used_length - 1))
+                                >> TARGET_PAGE_BITS;
+        PostcopyDiscardState *pds = postcopy_discard_send_init(ms,
+                                                               first & 31,
+                                                               block->idstr);
+
+        first32 = first / 32;
+        last32 = last / 32;
+        for (cur32 = first32; cur32 <= last32; cur32++) {
+            unsigned int current_hp;
+            /* Deal with start/end not on alignment */
+            uint32_t mask = make_32bit_mask(first, last, cur32);
+
+            /* a chunk of sent pages */
+            uint32_t sdata = get_32bits_map(ms->sentmap, cur32 * 32);
+            /* a chunk of dirty pages */
+            uint32_t ddata = get_32bits_map(migration_bitmap, cur32 * 32);
+            uint32_t discard = 0;
+            uint32_t redirty = 0;
+            sdata &= mask;
+            ddata &= mask;
+
+            for (current_hp = 0; current_hp < 32; current_hp += host_bits) {
+                uint32_t host_sent = (sdata >> current_hp) & host_mask;
+                uint32_t host_dirty = (ddata >> current_hp) & host_mask;
+
+                if (host_sent && (host_sent != host_mask)) {
+                    /* Partially sent host page */
+                    redirty |= host_mask << current_hp;
+                    discard |= host_mask << current_hp;
+
+                } else if (host_dirty && (host_dirty != host_mask)) {
+                    /* Partially dirty host page */
+                    redirty |= host_mask << current_hp;
+                }
+            }
+            if (discard) {
+                /* Tell the destination to discard these pages */
+                postcopy_discard_send_chunk(ms, pds, (cur32-first32) * 32,
+                                            discard);
+                /* And clear them in the sent data structure */
+                sdata = get_32bits_map(ms->sentmap, cur32 * 32);
+                put_32bits_map(ms->sentmap, cur32 * 32, sdata & ~discard);
+            }
+            if (redirty) {
+                /*
+                 * Reread original dirty bits and OR in ones we clear; we
+                 * must reread since we might be at the start or end of
+                 * a RAMBlock that the original 'mask' discarded some
+                 * bits from
+                */
+                ddata = get_32bits_map(migration_bitmap, cur32 * 32);
+                put_32bits_map(migration_bitmap, cur32 * 32,
+                           ddata | redirty);
+                /* Inc the count of dirty pages */
+                migration_dirty_pages += ctpop32(redirty - (ddata & redirty));
+            }
+        }
+
+        postcopy_discard_send_finish(ms, pds);
+    }
+
+    return 0;
+}
+
+/*
  * Transmit the set of pages to be discarded after precopy to the target
  * these are pages that have been sent previously but have been dirtied
  * Hopefully this is pretty sparse
  */
 int ram_postcopy_send_discard_bitmap(MigrationState *ms)
 {
+    int ret;
+
     /* This should be our last sync, the src is now paused */
     migration_bitmap_sync();
 
+    /* Deal with TPS != HPS */
+    ret = postcopy_chunk_hostpages(ms);
+    if (ret) {
+        return ret;
+    }
+
     /*
      * Update the sentmap to be  sentmap&=dirty
      */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (38 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 39/45] Host page!=target page: Cleanup bitmaps Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-03-24  5:38   ` David Gibson
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 41/45] Start up a postcopy/listener thread ready for incoming page data Dr. David Alan Gilbert (git)
                   ` (4 subsequent siblings)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

userfaultfd is a Linux syscall that gives an fd that receives a stream
of notifications of accesses to pages registered with it and allows
the program to acknowledge those stalls and tell the accessing
thread to carry on.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |   4 +
 migration/postcopy-ram.c      | 217 ++++++++++++++++++++++++++++++++++++++++--
 trace-events                  |  12 +++
 3 files changed, 223 insertions(+), 10 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 139bb1b..cec064f 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -86,11 +86,15 @@ struct MigrationIncomingState {
 
     PostcopyState postcopy_state;
 
+    bool           have_fault_thread;
     QemuThread     fault_thread;
     QemuSemaphore  fault_thread_sem;
 
     /* For the kernel to send us notifications */
     int            userfault_fd;
+    /* To tell the fault_thread to quit */
+    int            userfault_quit_fd;
+
     QEMUFile *return_path;
     QemuMutex      rp_mutex;    /* We send replies from multiple threads */
     PostcopyPMI    postcopy_pmi;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 86fa5a0..abc039e 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -47,6 +47,8 @@ struct PostcopyDiscardState {
  */
 #if defined(__linux__)
 
+#include <poll.h>
+#include <sys/eventfd.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
 #include <sys/types.h>
@@ -264,7 +266,7 @@ void postcopy_pmi_dump(MigrationIncomingState *mis)
 void postcopy_hook_early_receive(MigrationIncomingState *mis,
                                  size_t bitmap_index)
 {
-    if (mis->postcopy_state == POSTCOPY_INCOMING_ADVISE) {
+    if (postcopy_state_get(mis) == POSTCOPY_INCOMING_ADVISE) {
         /*
          * If we're in precopy-advise mode we need to track received pages even
          * though we don't need to place pages atomically yet.
@@ -489,15 +491,40 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
  */
 int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
 {
-    /* TODO: Join the fault thread once we're sure it will exit */
-    if (qemu_ram_foreach_block(cleanup_area, mis)) {
-        return -1;
+    trace_postcopy_ram_incoming_cleanup_entry();
+
+    if (mis->have_fault_thread) {
+        uint64_t tmp64;
+
+        if (qemu_ram_foreach_block(cleanup_area, mis)) {
+            return -1;
+        }
+        /*
+         * Tell the fault_thread to exit, it's an eventfd that should
+         * currently be at 0, we're going to inc it to 1
+         */
+        tmp64 = 1;
+        if (write(mis->userfault_quit_fd, &tmp64, 8) == 8) {
+            trace_postcopy_ram_incoming_cleanup_join();
+            qemu_thread_join(&mis->fault_thread);
+        } else {
+            /* Not much we can do here, but may as well report it */
+            perror("incing userfault_quit_fd");
+        }
+        trace_postcopy_ram_incoming_cleanup_closeuf();
+        close(mis->userfault_fd);
+        close(mis->userfault_quit_fd);
+        mis->have_fault_thread = false;
     }
 
+    postcopy_state_set(mis, POSTCOPY_INCOMING_END);
+    migrate_send_rp_shut(mis, qemu_file_get_error(mis->file) != 0);
+
     if (mis->postcopy_tmp_page) {
         munmap(mis->postcopy_tmp_page, getpagesize());
         mis->postcopy_tmp_page = NULL;
     }
+    trace_postcopy_ram_incoming_cleanup_exit();
     return 0;
 }
 
@@ -531,36 +558,206 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
 }
 
 /*
+ * Tell the kernel that we've now got some memory it previously asked for.
+ */
+static int ack_userfault(MigrationIncomingState *mis, void *start, size_t len)
+{
+    struct uffdio_range range_struct;
+
+    range_struct.start = (uint64_t)(uintptr_t)start;
+    range_struct.len = (uint64_t)len;
+
+    errno = 0;
+    if (ioctl(mis->userfault_fd, UFFDIO_WAKE, &range_struct)) {
+        int e = errno;
+
+        if (e == ENOENT) {
+            /* Kernel said it wasn't waiting - one case where this can
+             * happen is where two threads triggered the userfault
+             * and we receive the page and ack it just after we received
+             * the 2nd request and that ends up deciding it should ack it
+             * We could optimise it out, but it's rare.
+             */
+            /*fprintf(stderr, "ack_userfault: %p/%zx ENOENT\n", start, len); */
+            return 0;
+        }
+        error_report("postcopy_ram: Failed to notify kernel for %p/%zx (%d)",
+                     start, len, e);
+        return -e;
+    }
+
+    return 0;
+}
+
+/*
  * Handle faults detected by the USERFAULT markings
  */
 static void *postcopy_ram_fault_thread(void *opaque)
 {
     MigrationIncomingState *mis = (MigrationIncomingState *)opaque;
-
-    fprintf(stderr, "postcopy_ram_fault_thread\n");
-    /* TODO: In later patch */
+    uint64_t hostaddr; /* The kernel always gives us 64 bit, not a pointer */
+    int ret;
+    size_t hostpagesize = getpagesize();
+    RAMBlock *rb = NULL;
+    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
+    uint8_t *local_tmp_page;
+
+    trace_postcopy_ram_fault_thread_entry();
     qemu_sem_post(&mis->fault_thread_sem);
-    while (1) {
-        /* TODO: In later patch */
+
+    local_tmp_page = mmap(NULL, getpagesize(),
+                          PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,
+                          -1, 0);
+    if (!local_tmp_page) {
+        perror("mapping local tmp page");
+        return NULL;
     }
+    if (madvise(local_tmp_page, getpagesize(), MADV_DONTFORK)) {
+        munmap(local_tmp_page, getpagesize());
+        perror("postcpy local page DONTFORK");
+        return NULL;
+    }
+
+    while (true) {
+        PostcopyPMIState old_state, tmp_state;
+        ram_addr_t rb_offset;
+        ram_addr_t in_raspace;
+        unsigned long bitmap_index;
+        struct pollfd pfd[2];
+
+        /*
+         * We're mainly waiting for the kernel to give us a faulting HVA,
+         * however we can be told to quit via userfault_quit_fd which is
+         * an eventfd
+         */
+        pfd[0].fd = mis->userfault_fd;
+        pfd[0].events = POLLIN;
+        pfd[0].revents = 0;
+        pfd[1].fd = mis->userfault_quit_fd;
+        pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
+        pfd[1].revents = 0;
+
+        if (poll(pfd, 2, -1 /* Wait forever */) == -1) {
+            perror("userfault poll");
+            break;
+        }
+
+        if (pfd[1].revents) {
+            trace_postcopy_ram_fault_thread_quit();
+            break;
+        }
+
+        ret = read(mis->userfault_fd, &hostaddr, sizeof(hostaddr));
+        if (ret != sizeof(hostaddr)) {
+            if (ret < 0) {
+                perror("Failed to read full userfault hostaddr");
+                break;
+            } else {
+                error_report("%s: Read %d bytes from userfaultfd expected %zd",
+                             __func__, ret, sizeof(hostaddr));
+                break; /* Lost alignment, don't know what we'd read next */
+            }
+        }
+
+        rb = qemu_ram_block_from_host((void *)(uintptr_t)hostaddr, true,
+                                      &in_raspace, &rb_offset, &bitmap_index);
+        if (!rb) {
+            error_report("postcopy_ram_fault_thread: Fault outside guest: %"
+                         PRIx64, hostaddr);
+            break;
+        }
 
+        trace_postcopy_ram_fault_thread_request(hostaddr, bitmap_index,
+                                                qemu_ram_get_idstr(rb),
+                                                rb_offset);
+
+        tmp_state = postcopy_pmi_get_state(mis, bitmap_index);
+        do {
+            old_state = tmp_state;
+
+            switch (old_state) {
+            case POSTCOPY_PMI_REQUESTED:
+                /* Do nothing - it's already requested */
+                break;
+
+            case POSTCOPY_PMI_RECEIVED:
+                /* Already arrived - no state change, just kick the kernel */
+                trace_postcopy_ram_fault_thread_notify_pre(hostaddr);
+                if (ack_userfault(mis,
+                                  (void *)((uintptr_t)hostaddr
+                                           & ~(hostpagesize - 1)),
+                                  hostpagesize)) {
+                    assert(0);
+                }
+                break;
+
+            case POSTCOPY_PMI_MISSING:
+                tmp_state = postcopy_pmi_change_state(mis, bitmap_index,
+                                           old_state, POSTCOPY_PMI_REQUESTED);
+                if (tmp_state == POSTCOPY_PMI_MISSING) {
+                    /*
+                     * Send the request to the source - we want to request one
+                     * of our host page sizes (which is >= TPS)
+                     */
+                    if (rb != last_rb) {
+                        last_rb = rb;
+                        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
+                                                 rb_offset, hostpagesize);
+                    } else {
+                        /* Save some space */
+                        migrate_send_rp_req_pages(mis, NULL,
+                                                 rb_offset, hostpagesize);
+                    }
+                } /* else it just arrived from the source and the kernel will
+                     be kicked during the receive */
+                break;
+           }
+        } while (tmp_state != old_state);
+    }
+    munmap(local_tmp_page, getpagesize());
+    trace_postcopy_ram_fault_thread_exit();
     return NULL;
 }
 
 int postcopy_ram_enable_notify(MigrationIncomingState *mis)
 {
-    /* Create the fault handler thread and wait for it to be ready */
+    /* Open the fd for the kernel to give us userfaults */
+    mis->userfault_fd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    if (mis->userfault_fd == -1) {
+        perror("Failed to open userfault fd");
+        return -1;
+    }
+
+    /*
+     * Although the host check already tested the API, we need to
+     * do the check again as an ABI handshake on the new fd.
+     */
+    if (!ufd_version_check(mis->userfault_fd)) {
+        return -1;
+    }
+
+    /* Now an eventfd we use to tell the fault-thread to quit */
+    mis->userfault_quit_fd = eventfd(0, EFD_CLOEXEC);
+    if (mis->userfault_quit_fd == -1) {
+        perror("Opening userfault_quit_fd");
+        close(mis->userfault_fd);
+        return -1;
+    }
+
     qemu_sem_init(&mis->fault_thread_sem, 0);
     qemu_thread_create(&mis->fault_thread, "postcopy/fault",
                        postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
     qemu_sem_wait(&mis->fault_thread_sem);
     qemu_sem_destroy(&mis->fault_thread_sem);
+    mis->have_fault_thread = true;
 
     /* Mark so that we get notified of accesses to unwritten areas */
     if (qemu_ram_foreach_block(ram_block_enable_notify, mis)) {
         return -1;
     }
 
+    trace_postcopy_ram_enable_notify();
+
     return 0;
 }
 
diff --git a/trace-events b/trace-events
index 16a91d9..d955a28 100644
--- a/trace-events
+++ b/trace-events
@@ -1498,6 +1498,18 @@ postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s ma
 postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
 postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
 postcopy_place_page(unsigned long offset, void *host_addr, bool all_zero, int old_state) "offset=%lx host=%p all_zero=%d old_state=%d"
+postcopy_ram_enable_notify(void) ""
+postcopy_ram_fault_thread_entry(void) ""
+postcopy_ram_fault_thread_exit(void) ""
+postcopy_ram_fault_thread_quit(void) ""
+postcopy_ram_fault_thread_request(uint64_t hostaddr, unsigned long index, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " index=%lx rb=%s offset=%zx"
+postcopy_ram_fault_thread_notify_pre(uint64_t hostaddr) "%" PRIx64
+postcopy_ram_fault_thread_notify_zero(void *hostaddr) "%p"
+postcopy_ram_fault_thread_notify_zero_ack(void *hostaddr, unsigned long bitmap_index) "%p %lx"
+postcopy_ram_incoming_cleanup_closeuf(void) ""
+postcopy_ram_incoming_cleanup_entry(void) ""
+postcopy_ram_incoming_cleanup_exit(void) ""
+postcopy_ram_incoming_cleanup_join(void) ""
 
 # kvm-all.c
 kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 41/45] Start up a postcopy/listener thread ready for incoming page data
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (39 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 42/45] postcopy: Wire up loadvm_postcopy_handle_{run, end} commands Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The loading of a device state (during postcopy) may access guest
memory that's still on the source machine and thus might need
a page fill; split off a separate thread that handles the incoming
page data so that the original incoming migration code can finish
off the device data.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  4 +++
 migration/migration.c         |  6 ++++
 savevm.c                      | 71 +++++++++++++++++++++++++++++++++++++++++--
 trace-events                  |  2 ++
 4 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index cec064f..c2af2ef 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -90,6 +90,10 @@ struct MigrationIncomingState {
     QemuThread     fault_thread;
     QemuSemaphore  fault_thread_sem;
 
+    bool           have_listen_thread;
+    QemuThread     listen_thread;
+    QemuSemaphore  listen_thread_sem;
+
     /* For the kernel to send us notifications */
     int            userfault_fd;
     /* To tell the fault_thread to quit */
diff --git a/migration/migration.c b/migration/migration.c
index 939f426..c108851 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1067,6 +1067,12 @@ static int postcopy_start(MigrationState *ms, bool *old_vm_running)
         goto fail;
     }
 
+    /*
+     * Make sure the receiver can get incoming pages before we send the rest
+     * of the state
+     */
+    qemu_savevm_send_postcopy_listen(fb);
+
     qemu_savevm_state_complete(fb);
     qemu_savevm_send_ping(fb, 3);
 
diff --git a/savevm.c b/savevm.c
index 014ba08..eb22410 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1307,6 +1307,51 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
     return 0;
 }
 
+/*
+ * Triggered by a postcopy_listen command; this thread takes over reading
+ * the input stream, leaving the main thread free to carry on loading the rest
+ * of the device state (from RAM).
+ * (TODO:This could do with being in a postcopy file - but there again it's
+ * just another input loop, not that postcopy specific)
+ */
+static void *postcopy_ram_listen_thread(void *opaque)
+{
+    QEMUFile *f = opaque;
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    int load_res;
+
+    qemu_sem_post(&mis->listen_thread_sem);
+    trace_postcopy_ram_listen_thread_start();
+
+    load_res = qemu_loadvm_state_main(f, mis);
+
+    trace_postcopy_ram_listen_thread_exit();
+    if (load_res < 0) {
+        error_report("%s: loadvm failed: %d", __func__, load_res);
+        qemu_file_set_error(f, load_res);
+    }
+    postcopy_ram_incoming_cleanup(mis);
+    /*
+     * If everything has worked fine, then the main thread has waited
+     * for us to start, and we're the last use of the mis.
+     * (If something broke then qemu will have to exit anyway since it's
+     * got a bad migration state).
+     */
+    migration_incoming_state_destroy();
+
+    if (load_res < 0) {
+        /*
+         * If something went wrong then we have a bad state so exit;
+         * depending how far we got it might be possible at this point
+         * to leave the guest running and fire MCEs for pages that never
+         * arrived as a desperate recovery step.
+         */
+        exit(EXIT_FAILURE);
+    }
+
+    return NULL;
+}
+
 /* After this message we must be able to immediately receive postcopy data */
 static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
 {
@@ -1326,8 +1371,25 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
         return -1;
     }
 
-    /* TODO start up the postcopy listening thread */
-    return 0;
+    if (mis->have_listen_thread) {
+        error_report("CMD_POSTCOPY_RAM_LISTEN already has a listen thread");
+        return -1;
+    }
+
+    mis->have_listen_thread = true;
+    /* Start up the listening thread and wait for it to signal ready */
+    qemu_sem_init(&mis->listen_thread_sem, 0);
+    qemu_thread_create(&mis->listen_thread, "postcopy/listen",
+                       postcopy_ram_listen_thread, mis->file,
+                       QEMU_THREAD_JOINABLE);
+    qemu_sem_wait(&mis->listen_thread_sem);
+    qemu_sem_destroy(&mis->listen_thread_sem);
+
+    /*
+     * all good - cause the loop that handled this command to exit because
+     * the new thread is taking over
+     */
+    return LOADVM_QUIT_PARENT;
 }
 
 /* After all discards we can start running and asking for pages */
@@ -1670,6 +1732,11 @@ int qemu_loadvm_state(QEMUFile *f)
     ret = qemu_loadvm_state_main(f, mis);
 
     trace_qemu_loadvm_state_post_main(ret);
+    if (mis->have_listen_thread) {
+        /* Listen thread still going, can't clean up yet */
+        return ret;
+    }
+
     if (ret == 0) {
         cpu_synchronize_all_post_init();
         ret = qemu_file_get_error(f);
diff --git a/trace-events b/trace-events
index d955a28..82b3631 100644
--- a/trace-events
+++ b/trace-events
@@ -1183,6 +1183,8 @@ loadvm_postcopy_ram_handle_discard(void) ""
 loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
 loadvm_process_command_ping(uint32_t val) "%x"
+postcopy_ram_listen_thread_exit(void) ""
+postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
 qemu_savevm_send_postcopy_ram_discard(void) ""
 qemu_savevm_state_complete_skip_for_postcopy(const char *section) "skipping: %s"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 42/45] postcopy: Wire up loadvm_postcopy_handle_{run, end} commands
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (40 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 41/45] Start up a postcopy/listener thread ready for incoming page data Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 43/45] End of migration for postcopy Dr. David Alan Gilbert (git)
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Wire up more of the handlers for the commands on the destination side,
in particular loadvm_postcopy_handle_run now has enough to start the
guest running.

handle_end has to wait for the main thread to finish.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/migration/migration.h |  6 +++++
 migration/migration.c         |  2 ++
 savevm.c                      | 56 ++++++++++++++++++++++++++++++++++++++-----
 trace-events                  |  4 +++-
 4 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index c2af2ef..02376df 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -86,6 +86,12 @@ struct MigrationIncomingState {
 
     PostcopyState postcopy_state;
 
+    /*
+     * Free at the start of the main state load, set as the main thread finishes
+     * loading state.
+     */
+    QemuEvent      main_thread_load_event;
+
     bool           have_fault_thread;
     QemuThread     fault_thread;
     QemuSemaphore  fault_thread_sem;
diff --git a/migration/migration.c b/migration/migration.c
index c108851..97b86cc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -84,6 +84,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
     mis_current->file = f;
     QLIST_INIT(&mis_current->loadvm_handlers);
     qemu_mutex_init(&mis_current->rp_mutex);
+    qemu_event_init(&mis_current->main_thread_load_event, false);
 
     return mis_current;
 }
@@ -91,6 +92,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
 void migration_incoming_state_destroy(void)
 {
     postcopy_pmi_destroy(mis_current);
+    qemu_event_destroy(&mis_current->main_thread_load_event);
     loadvm_free_handlers(mis_current);
     g_free(mis_current);
     mis_current = NULL;
diff --git a/savevm.c b/savevm.c
index eb22410..45d46db 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1396,12 +1396,34 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
 static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
 {
     PostcopyState ps = postcopy_state_set(mis, POSTCOPY_INCOMING_RUNNING);
+    Error *local_err = NULL;
+
     trace_loadvm_postcopy_handle_run();
     if (ps != POSTCOPY_INCOMING_LISTENING) {
         error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps);
         return -1;
     }
 
+    /* TODO we should move all of this lot into postcopy_ram.c or a shared code
+     * in migration.c
+     */
+    cpu_synchronize_all_post_init();
+
+    qemu_announce_self();
+
+    /* Make sure all file formats flush their mutable metadata */
+    bdrv_invalidate_cache_all(&local_err);
+    if (local_err) {
+        qerror_report_err(local_err);
+        error_free(local_err);
+        return -1;
+    }
+
+    trace_loadvm_postcopy_handle_run_cpu_sync();
+    cpu_synchronize_all_post_init();
+
+    trace_loadvm_postcopy_handle_run_vmstart();
+
     if (autostart) {
         /* Hold onto your hats, starting the CPU */
         vm_start();
@@ -1410,19 +1432,40 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
         runstate_set(RUN_STATE_PAUSED);
     }
 
-    return 0;
+    return LOADVM_QUIT_LOOP;
 }
 
-/* The end - with a byte from the source which can tell us to fail. */
-static int loadvm_postcopy_handle_end(MigrationIncomingState *mis)
+/* The end - with a byte from the source which can tell us to fail.
+ * The source sends this either if there is a failure, or if it believes it's
+ * sent everything
+ */
+static int loadvm_postcopy_handle_end(MigrationIncomingState *mis,
+                                          uint8_t status)
 {
     PostcopyState ps = postcopy_state_get(mis);
-    trace_loadvm_postcopy_handle_end();
+    trace_loadvm_postcopy_handle_end(status);
     if (ps == POSTCOPY_INCOMING_NONE) {
         error_report("CMD_POSTCOPY_END in wrong postcopy state (%d)", ps);
         return -1;
     }
-    return -1; /* TODO - expecting 1 byte good/fail */
+
+    if (!status) {
+        /*
+         * This looks good, but it's possible that the device loading in the
+         * main thread hasn't finished yet, and so we might not be in 'RUN'
+         * state yet; wait for the end of the main thread.
+         */
+        qemu_event_wait(&mis->main_thread_load_event);
+    }
+
+    if (status) {
+        error_report("CMD_POSTCOPY_END: error on source host (%d)",
+                     status);
+        qemu_file_set_error(mis->file, -EPIPE);
+    }
+
+    /* This will cause the listen thread to exit and call cleanup */
+    return LOADVM_QUIT_LOOP;
 }
 
 static int loadvm_process_command_simple_lencheck(const char *name,
@@ -1566,7 +1609,7 @@ static int loadvm_process_command(QEMUFile *f)
                                                    len, 1)) {
             return -1;
         }
-        return loadvm_postcopy_handle_end(mis);
+        return loadvm_postcopy_handle_end(mis, qemu_get_byte(f));
 
     case MIG_CMD_POSTCOPY_RAM_DISCARD:
         return loadvm_postcopy_ram_handle_discard(mis, len);
@@ -1730,6 +1773,7 @@ int qemu_loadvm_state(QEMUFile *f)
     }
 
     ret = qemu_loadvm_state_main(f, mis);
+    qemu_event_set(&mis->main_thread_load_event);
 
     trace_qemu_loadvm_state_post_main(ret);
     if (mis->have_listen_thread) {
diff --git a/trace-events b/trace-events
index 82b3631..869988b 100644
--- a/trace-events
+++ b/trace-events
@@ -1176,9 +1176,11 @@ loadvm_handle_cmd_packaged(unsigned int length) "%u"
 loadvm_handle_cmd_packaged_main(int ret) "%d"
 loadvm_handle_cmd_packaged_received(int ret) "%d"
 loadvm_postcopy_handle_advise(void) ""
-loadvm_postcopy_handle_end(void) ""
+loadvm_postcopy_handle_end(uint8_t status) "%d"
 loadvm_postcopy_handle_listen(void) ""
 loadvm_postcopy_handle_run(void) ""
+loadvm_postcopy_handle_run_cpu_sync(void) ""
+loadvm_postcopy_handle_run_vmstart(void) ""
 loadvm_postcopy_ram_handle_discard(void) ""
 loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 43/45] End of migration for postcopy
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (41 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 42/45] postcopy: Wire up loadvm_postcopy_handle_{run, end} commands Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 44/45] Disable mlock around incoming postcopy Dr. David Alan Gilbert (git)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy Dr. David Alan Gilbert (git)
  44 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Tweak the end of migration cleanup; we don't want to close stuff down
at the end of the main stream, since the postcopy is still sending pages
on the other thread.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c | 25 ++++++++++++++++++++++++-
 trace-events          |  2 ++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 97b86cc..a137444 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -199,12 +199,35 @@ static void process_incoming_migration_co(void *opaque)
 {
     QEMUFile *f = opaque;
     Error *local_err = NULL;
+    MigrationIncomingState *mis;
+    PostcopyState ps;
     int ret;
 
-    migration_incoming_state_new(f);
+    mis = migration_incoming_state_new(f);
 
     ret = qemu_loadvm_state(f);
 
+    ps = postcopy_state_get(mis);
+    trace_process_incoming_migration_co_end(ret, ps);
+    if (ps != POSTCOPY_INCOMING_NONE) {
+        if (ps == POSTCOPY_INCOMING_ADVISE) {
+            /*
+             * Where a migration had postcopy enabled (and thus went to advise)
+             * but managed to complete within the precopy period, we can use
+             * the normal exit.
+             */
+            postcopy_ram_incoming_cleanup(mis);
+        } else if (ret >= 0) {
+            /*
+             * Postcopy was started, cleanup should happen at the end of the
+             * postcopy thread.
+             */
+            trace_process_incoming_migration_co_postcopy_end_main();
+            return;
+        }
+        /* Else if something went wrong then just fall out of the normal exit */
+    }
+
     qemu_fclose(f);
     free_xbzrle_decoded_buf();
     migration_incoming_state_destroy();
diff --git a/trace-events b/trace-events
index 869988b..cbf4f88 100644
--- a/trace-events
+++ b/trace-events
@@ -1434,6 +1434,8 @@ source_return_path_thread_loop_top(void) ""
 source_return_path_thread_pong(uint32_t val) "%x"
 source_return_path_thread_shut(uint32_t val) "%x"
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
+process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
+process_incoming_migration_co_postcopy_end_main(void) ""
 
 # migration/rdma.c
 __qemu_rdma_add_block(int block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Added Block: %d, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 44/45] Disable mlock around incoming postcopy
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (42 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 43/45] End of migration for postcopy Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-03-23  4:33   ` David Gibson
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy Dr. David Alan Gilbert (git)
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Userfault doesn't work with mlock; mlock is designed to nail down pages
so they don't move, userfault is designed to tell you when they're not
there.

munlock the pages we userfault protect before postcopy.
mlock everything again at the end if mlock is enabled.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/sysemu/sysemu.h  |  1 +
 migration/postcopy-ram.c | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 0f2e4ed..25cc791 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -176,6 +176,7 @@ extern int boot_menu;
 extern bool boot_strict;
 extern uint8_t *boot_splash_filedata;
 extern size_t boot_splash_filedata_size;
+extern bool enable_mlock;
 extern uint8_t qemu_extra_params_fw[2];
 extern QEMUClockType rtc_clock;
 extern const char *mem_path;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index abc039e..d8f5ccd 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -299,6 +299,11 @@ static bool ufd_version_check(int ufd)
 }
 
 
+/*
+ * Note: This has the side effect of munlock'ing all of RAM, that's
+ * normally fine since if the postcopy succeeds it gets turned back on at the
+ * end.
+ */
 bool postcopy_ram_supported_by_host(void)
 {
     long pagesize = getpagesize();
@@ -327,6 +332,15 @@ bool postcopy_ram_supported_by_host(void)
     }
 
     /*
+     * userfault and mlock don't go together; we'll put it back later if
+     * it was enabled.
+     */
+    if (munlockall()) {
+        perror("postcopy_ram_incoming_init: munlockall");
+        return -1;
+    }
+
+    /*
      *  We need to check that the ops we need are supported on anon memory
      *  To do that we need to register a chunk and see the flags that
      *  are returned.
@@ -517,6 +531,16 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         mis->have_fault_thread = false;
     }
 
+    if (enable_mlock) {
+        if (os_mlock() < 0) {
+            error_report("mlock: %s", strerror(errno));
+            /*
+             * It doesn't feel right to fail at this point, we have a valid
+             * VM state.
+             */
+        }
+    }
+
     postcopy_state_set(mis, POSTCOPY_INCOMING_END);
     migrate_send_rp_shut(mis, qemu_file_get_error(mis->file) != 0);
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy
  2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
                   ` (43 preceding siblings ...)
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 44/45] Disable mlock around incoming postcopy Dr. David Alan Gilbert (git)
@ 2015-02-25 16:52 ` Dr. David Alan Gilbert (git)
  2015-03-23  4:32   ` David Gibson
  44 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-02-25 16:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The userfault mechanism used for postcopy generates faults
for us on pages that are 'not present', inflating a balloon in
the guest causes host pages to be marked as 'not present'; doing
this during a postcopy, as potentially the same pages were being
received from the source, would confuse the state of the received
page -> disable ballooning during postcopy.

When disabled we drop balloon requests from the guest.  Since ballooning
is generally initiated by the host, the management system should avoid
initiating any balloon instructions to the guest during migration,
although it's not possible to know how long it would take a guest to
process a request made prior to the start of migration.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 balloon.c                  | 11 +++++++++++
 hw/virtio/virtio-balloon.c |  4 +++-
 include/sysemu/balloon.h   |  2 ++
 migration/postcopy-ram.c   |  9 +++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/balloon.c b/balloon.c
index dea19a4..faedb60 100644
--- a/balloon.c
+++ b/balloon.c
@@ -35,6 +35,17 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
+static bool balloon_inhibited;
+
+bool qemu_balloon_is_inhibited(void)
+{
+    return balloon_inhibited;
+}
+
+void qemu_balloon_inhibit(bool state)
+{
+    balloon_inhibited = state;
+}
 
 static bool have_ballon(Error **errp)
 {
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 7bfbb75..b0e94ee 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -36,9 +36,11 @@
 static void balloon_page(void *addr, int deflate)
 {
 #if defined(__linux__)
-    if (!kvm_enabled() || kvm_has_sync_mmu())
+    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
+                                         kvm_has_sync_mmu())) {
         qemu_madvise(addr, TARGET_PAGE_SIZE,
                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+    }
 #endif
 }
 
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index 0345e01..6851d99 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -23,5 +23,7 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
 			     QEMUBalloonStatus *stat_func, void *opaque);
 void qemu_remove_balloon_handler(void *opaque);
+bool qemu_balloon_is_inhibited(void);
+void qemu_balloon_inhibit(bool state);
 
 #endif
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d8f5ccd..b9f5848 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -24,6 +24,7 @@
 #include "migration/migration.h"
 #include "migration/postcopy-ram.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/balloon.h"
 #include "qemu/bitmap.h"
 #include "qemu/error-report.h"
 #include "trace.h"
@@ -531,6 +532,8 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         mis->have_fault_thread = false;
     }
 
+    qemu_balloon_inhibit(false);
+
     if (enable_mlock) {
         if (os_mlock() < 0) {
             error_report("mlock: %s", strerror(errno));
@@ -780,6 +783,12 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
         return -1;
     }
 
+    /*
+     * Ballooning can mark pages as absent while we're postcopying
+     * that would cause false userfaults.
+     */
+    qemu_balloon_inhibit(true);
+
     trace_postcopy_ram_enable_notify();
 
     return 0;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works.
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works Dr. David Alan Gilbert (git)
@ 2015-03-05  3:21   ` David Gibson
  2015-03-05  9:21     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-05  3:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 11070 bytes --]

On Wed, Feb 25, 2015 at 04:51:24PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  docs/migration.txt | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 189 insertions(+)
> 
> diff --git a/docs/migration.txt b/docs/migration.txt
> index 0492a45..c6c3798 100644
> --- a/docs/migration.txt
> +++ b/docs/migration.txt
> @@ -294,3 +294,192 @@ save/send this state when we are in the middle of a pio operation
>  (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
>  not enabled, the values on that fields are garbage and don't need to
>  be sent.
> +
> += Return path =
> +
> +In most migration scenarios there is only a single data path that runs
> +from the source VM to the destination, typically along a single fd (although
> +possibly with another fd or similar for some fast way of throwing pages across).
> +
> +However, some uses need two way communication; in particular the Postcopy destination
> +needs to be able to request pages on demand from the source.
> +
> +For these scenarios there is a 'return path' from the destination to the source;
> +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
> +path.
> +
> +  Source side
> +     Forward path - written by migration thread
> +     Return path  - opened by main thread, read by return-path thread
> +
> +  Destination side
> +     Forward path - read by main thread
> +     Return path  - opened by main thread, written by main thread AND postcopy
> +                    thread (protected by rp_mutex)
> +
> += Postcopy =
> +'Postcopy' migration is a way to deal with migrations that refuse to converge;
> +its plus side is that there is an upper bound on the amount of migration traffic
> +and time it takes, the down side is that during the postcopy phase, a failure of
> +*either* side or the network connection causes the guest to be lost.
> +
> +In postcopy the destination CPUs are started before all the memory has been
> +transferred, and accesses to pages that are yet to be transferred cause
> +a fault that's translated by QEMU into a request to the source QEMU.
> +
> +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
> +doesn't finish in a given time the switch is made to postcopy.
> +
> +=== Enabling postcopy ===
> +
> +To enable postcopy (prior to the start of migration):
> +
> +migrate_set_capability x-postcopy-ram on
> +
> +The migration will still start in precopy mode, however issuing:
> +
> +migrate_start_postcopy
> +
> +will now cause the transition from precopy to postcopy.
> +It can be issued immediately after migration is started or any
> +time later on.  Issuing it after the end of a migration is harmless.

It's not quite clear to me what this means.  Does
"migrate_start_postcopy" mean it will immediately transfer execution
and transfer any remaining pages postcopy, or does it just mean it
will start postcopying once the remaining data to transfer is small
enough?

What's the reason for this rather awkward two stage activation of
postcopy?

> +=== Postcopy device transfer ===
> +
> +Loading of device data may cause the device emulation to access guest RAM
> +that may trigger faults that have to be resolved by the source, as such
> +the migration stream has to be able to respond with page data *during* the
> +device load, and hence the device data has to be read from the stream completely
> +before the device load begins to free the stream up.  This is achieved by
> +'packaging' the device data into a blob that's read in one go.
> +
> +Source behaviour
> +
> +Until postcopy is entered the migration stream is identical to normal
> +precopy, except for the addition of a 'postcopy advise' command at
> +the beginning, to tell the destination that postcopy might happen.
> +When postcopy starts the source sends the page discard data and then
> +forms the 'package' containing:
> +
> +   Command: 'postcopy ram listen'
> +   The device state
> +      A series of sections, identical to the precopy streams device state stream
> +      containing everything except postcopiable devices (i.e. RAM)
> +   Command: 'postcopy ram run'
> +
> +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> +contents are formatted in the same way as the main migration stream.

It seems to me the "ram listen", "ram run" and CMD_PACKAGED really
have to be used in conjuction this way, they don't really have any use
on their own.  So why not make it all CMD_POSTCOPY_TRANSITION and have
the "listen" and "run" take effect implicitly at the beginning and end
of the device data.

> +Destination behaviour
> +
> +Initially the destination looks the same as precopy, with a single thread
> +reading the migration stream; the 'postcopy advise' and 'discard' commands
> +are processed to change the way RAM is managed, but don't affect the stream
> +processing.
> +
> +------------------------------------------------------------------------------
> +                        1      2   3     4 5                      6   7
> +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
> +thread                             |       |
> +                                   |     (page request)
> +                                   |        \___
> +                                   v            \
> +listen thread:                     --- page -- page -- page -- page -- page --
> +
> +                                   a   b        c
> +------------------------------------------------------------------------------
> +
> +On receipt of CMD_PACKAGED (1)
> +   All the data associated with the package - the ( ... ) section in the
> +diagram

>- is read into memory (into a QEMUSizedBuffer), and the main thread
> +recurses into qemu_loadvm_state_main to process the contents of the package (2)
> +which contains commands (3,6) and devices (4...)
> +
> +On receipt of 'postcopy ram listen' - 3 -(i.e. the 1st command in the package)
> +a new thread (a) is started that takes over servicing the migration stream,
> +while the main thread carries on loading the package.   It loads normal
> +background page data (b) but if during a device load a fault happens (5) the
> +returned page (c) is loaded by the listen thread allowing the main threads
> +device load to carry on.
> +
> +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
> +CPUs start running.
> +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
> +and is no longer used by migration, while the listen thread carries
> +on servicing page data until the end of migration.
> +
> +=== Postcopy states ===
> +
> +Postcopy moves through a series of states (see postcopy_state) from
> +ADVISE->LISTEN->RUNNING->END
> +
> +  Advise: Set at the start of migration if postcopy is enabled, even
> +          if it hasn't had the start command; here the destination
> +          checks that its OS has the support needed for postcopy, and performs
> +          setup to ensure the RAM mappings are suitable for later postcopy.
> +          (Triggered by reception of POSTCOPY_ADVISE command)
> +
> +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
> +          the destination state to Listen, and starts a new thread
> +          (the 'listen thread') which takes over the job of receiving
> +          pages off the migration stream, while the main thread carries
> +          on processing the blob.  With this thread able to process page
> +          reception, the destination now 'sensitises' the RAM to detect
> +          any access to missing pages (on Linux using the 'userfault'
> +          system).
> +
> +  Running: POSTCOPY_RUN causes the destination to synchronise all
> +          state and start the CPUs and IO devices running.  The main
> +          thread now finishes processing the migration package and
> +          now carries on as it would for normal precopy migration
> +          (although it can't do the cleanup it would do as it
> +          finishes a normal migration).
> +
> +  End: The listen thread can now quit, and perform the cleanup of migration
> +          state, the migration is now complete.
> +
> +=== Source side page maps ===
> +
> +The source side keeps two bitmaps during postcopy; 'the migration bitmap'
> +and 'sent map'.  The 'migration bitmap' is basically the same as in
> +the precopy case, and holds a bit to indicate that page is 'dirty' -
> +i.e. needs sending.  During the precopy phase this is updated as the CPU
> +dirties pages, however during postcopy the CPUs are stopped and nothing
> +should dirty anything any more.
> +
> +The 'sent map' is used for the transition to postcopy. It is a bitmap that
> +has a bit set whenever a page is sent to the destination, however during
> +the transition to postcopy mode it is masked against the migration bitmap
> +(sentmap &= migrationbitmap) to generate a bitmap recording pages that
> +have been previously been sent but are now dirty again.  This masked
> +sentmap is sent to the destination which discards those now dirty pages
> +before starting the CPUs.
> +
> +Note that once in postcopy mode, the sent map is still updated; however,
> +its contents are not necessarily consistent with the pages already sent
> +due to the masking with the migration bitmap.
> +
> +=== Destination side page maps ===
> +
> +(Needs to be changed so we can update both easily - at the moment updates are done
> + with a lock)
> +The destination keeps a state for each page which is 'missing', 'received'
> +or 'requested'; these three states are encoded in a 2 bit state array.
> +Incoming requests from the kernel cause the state to transition from 'missing'
> +to 'requested'.   Received pages cause a transition from either 'missing' or
> +'requested' to 'received'; the kernel is notified on reception to wake up
> +any threads that were waiting for the page.
> +If the kernel requests a page that has already been 'received' the kernel is
> +notified without re-requesting.
> +
> +This leads to four valid page states:
> +page states:
> +    missing        - page not yet received or requested
> +    received       - Page received
> +    requested      - page requested but not yet received
> +
> +state transitions:
> +      received -> missing   (only during setup/discard)
> +      missing -> received   (normal incoming page)
> +      requested -> received (incoming page previously requested)
> +      missing -> requested  (userfault request)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works.
  2015-03-05  3:21   ` David Gibson
@ 2015-03-05  9:21     ` Dr. David Alan Gilbert
  2015-03-10  1:04       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-05  9:21 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:24PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  docs/migration.txt | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 189 insertions(+)
> > 
> > diff --git a/docs/migration.txt b/docs/migration.txt
> > index 0492a45..c6c3798 100644
> > --- a/docs/migration.txt
> > +++ b/docs/migration.txt
> > @@ -294,3 +294,192 @@ save/send this state when we are in the middle of a pio operation
> >  (that is what ide_drive_pio_state_needed() checks).  If DRQ_STAT is
> >  not enabled, the values on that fields are garbage and don't need to
> >  be sent.
> > +
> > += Return path =
> > +
> > +In most migration scenarios there is only a single data path that runs
> > +from the source VM to the destination, typically along a single fd (although
> > +possibly with another fd or similar for some fast way of throwing pages across).
> > +
> > +However, some uses need two way communication; in particular the Postcopy destination
> > +needs to be able to request pages on demand from the source.
> > +
> > +For these scenarios there is a 'return path' from the destination to the source;
> > +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
> > +path.
> > +
> > +  Source side
> > +     Forward path - written by migration thread
> > +     Return path  - opened by main thread, read by return-path thread
> > +
> > +  Destination side
> > +     Forward path - read by main thread
> > +     Return path  - opened by main thread, written by main thread AND postcopy
> > +                    thread (protected by rp_mutex)
> > +
> > += Postcopy =
> > +'Postcopy' migration is a way to deal with migrations that refuse to converge;
> > +its plus side is that there is an upper bound on the amount of migration traffic
> > +and time it takes, the down side is that during the postcopy phase, a failure of
> > +*either* side or the network connection causes the guest to be lost.
> > +
> > +In postcopy the destination CPUs are started before all the memory has been
> > +transferred, and accesses to pages that are yet to be transferred cause
> > +a fault that's translated by QEMU into a request to the source QEMU.
> > +
> > +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
> > +doesn't finish in a given time the switch is made to postcopy.
> > +
> > +=== Enabling postcopy ===
> > +
> > +To enable postcopy (prior to the start of migration):
> > +
> > +migrate_set_capability x-postcopy-ram on
> > +
> > +The migration will still start in precopy mode, however issuing:
> > +
> > +migrate_start_postcopy
> > +
> > +will now cause the transition from precopy to postcopy.
> > +It can be issued immediately after migration is started or any
> > +time later on.  Issuing it after the end of a migration is harmless.
> 
> It's not quite clear to me what this means.  Does
> "migrate_start_postcopy" mean it will immediately transfer execution
> and transfer any remaining pages postcopy, or does it just mean it
> will start postcopying once the remaining data to transfer is small
> enough?

Yes; it will flip into postcopy soon after issuing that command irrespective
of the amount of data remaining.

> What's the reason for this rather awkward two stage activation of
> postcopy?

We need to keep track of the pages that are received during the precopy phase,
and do some madvise and other setups on the destination RAM area before precopy
starts; and so we need to know we might want to do postcopy - so we need
to be told early.  In the earliest posted version of my patches I had a
time-limit setting and after the time limit expired QEMU would switch into
the second phase of postcopy itself, but Paolo suggested the migrate_start_postcopy:

https://lists.nongnu.org/archive/html/qemu-devel/2014-07/msg00943.html

and it works out simpler anyway.

> > +=== Postcopy device transfer ===
> > +
> > +Loading of device data may cause the device emulation to access guest RAM
> > +that may trigger faults that have to be resolved by the source, as such
> > +the migration stream has to be able to respond with page data *during* the
> > +device load, and hence the device data has to be read from the stream completely
> > +before the device load begins to free the stream up.  This is achieved by
> > +'packaging' the device data into a blob that's read in one go.
> > +
> > +Source behaviour
> > +
> > +Until postcopy is entered the migration stream is identical to normal
> > +precopy, except for the addition of a 'postcopy advise' command at
> > +the beginning, to tell the destination that postcopy might happen.
> > +When postcopy starts the source sends the page discard data and then
> > +forms the 'package' containing:
> > +
> > +   Command: 'postcopy ram listen'
> > +   The device state
> > +      A series of sections, identical to the precopy streams device state stream
> > +      containing everything except postcopiable devices (i.e. RAM)
> > +   Command: 'postcopy ram run'
> > +
> > +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> > +contents are formatted in the same way as the main migration stream.
> 
> It seems to me the "ram listen", "ram run" and CMD_PACKAGED really
> have to be used in conjuction this way, they don't really have any use
> on their own.  So why not make it all CMD_POSTCOPY_TRANSITION and have
> the "listen" and "run" take effect implicitly at the beginning and end
> of the device data.

CMD_PACKAGED seems like something that was generally useful; it's fairly
complicated on it's own and so it seemed best to keep it separate.

(Reading your comment here I notice I've still got it as 'postcopy ram listen'
when I removed the 'ram' based on previous review comments; I've fixed
that locally).

Dave

> > +Destination behaviour
> > +
> > +Initially the destination looks the same as precopy, with a single thread
> > +reading the migration stream; the 'postcopy advise' and 'discard' commands
> > +are processed to change the way RAM is managed, but don't affect the stream
> > +processing.
> > +
> > +------------------------------------------------------------------------------
> > +                        1      2   3     4 5                      6   7
> > +main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
> > +thread                             |       |
> > +                                   |     (page request)
> > +                                   |        \___
> > +                                   v            \
> > +listen thread:                     --- page -- page -- page -- page -- page --
> > +
> > +                                   a   b        c
> > +------------------------------------------------------------------------------
> > +
> > +On receipt of CMD_PACKAGED (1)
> > +   All the data associated with the package - the ( ... ) section in the
> > +diagram
> 
> >- is read into memory (into a QEMUSizedBuffer), and the main thread
> > +recurses into qemu_loadvm_state_main to process the contents of the package (2)
> > +which contains commands (3,6) and devices (4...)
> > +
> > +On receipt of 'postcopy ram listen' - 3 -(i.e. the 1st command in the package)
> > +a new thread (a) is started that takes over servicing the migration stream,
> > +while the main thread carries on loading the package.   It loads normal
> > +background page data (b) but if during a device load a fault happens (5) the
> > +returned page (c) is loaded by the listen thread allowing the main threads
> > +device load to carry on.
> > +
> > +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
> > +CPUs start running.
> > +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
> > +and is no longer used by migration, while the listen thread carries
> > +on servicing page data until the end of migration.
> > +
> > +=== Postcopy states ===
> > +
> > +Postcopy moves through a series of states (see postcopy_state) from
> > +ADVISE->LISTEN->RUNNING->END
> > +
> > +  Advise: Set at the start of migration if postcopy is enabled, even
> > +          if it hasn't had the start command; here the destination
> > +          checks that its OS has the support needed for postcopy, and performs
> > +          setup to ensure the RAM mappings are suitable for later postcopy.
> > +          (Triggered by reception of POSTCOPY_ADVISE command)
> > +
> > +  Listen: The first command in the package, POSTCOPY_LISTEN, switches
> > +          the destination state to Listen, and starts a new thread
> > +          (the 'listen thread') which takes over the job of receiving
> > +          pages off the migration stream, while the main thread carries
> > +          on processing the blob.  With this thread able to process page
> > +          reception, the destination now 'sensitises' the RAM to detect
> > +          any access to missing pages (on Linux using the 'userfault'
> > +          system).
> > +
> > +  Running: POSTCOPY_RUN causes the destination to synchronise all
> > +          state and start the CPUs and IO devices running.  The main
> > +          thread now finishes processing the migration package and
> > +          now carries on as it would for normal precopy migration
> > +          (although it can't do the cleanup it would do as it
> > +          finishes a normal migration).
> > +
> > +  End: The listen thread can now quit, and perform the cleanup of migration
> > +          state, the migration is now complete.
> > +
> > +=== Source side page maps ===
> > +
> > +The source side keeps two bitmaps during postcopy; 'the migration bitmap'
> > +and 'sent map'.  The 'migration bitmap' is basically the same as in
> > +the precopy case, and holds a bit to indicate that page is 'dirty' -
> > +i.e. needs sending.  During the precopy phase this is updated as the CPU
> > +dirties pages, however during postcopy the CPUs are stopped and nothing
> > +should dirty anything any more.
> > +
> > +The 'sent map' is used for the transition to postcopy. It is a bitmap that
> > +has a bit set whenever a page is sent to the destination, however during
> > +the transition to postcopy mode it is masked against the migration bitmap
> > +(sentmap &= migrationbitmap) to generate a bitmap recording pages that
> > +have been previously been sent but are now dirty again.  This masked
> > +sentmap is sent to the destination which discards those now dirty pages
> > +before starting the CPUs.
> > +
> > +Note that once in postcopy mode, the sent map is still updated; however,
> > +its contents are not necessarily consistent with the pages already sent
> > +due to the masking with the migration bitmap.
> > +
> > +=== Destination side page maps ===
> > +
> > +(Needs to be changed so we can update both easily - at the moment updates are done
> > + with a lock)
> > +The destination keeps a state for each page which is 'missing', 'received'
> > +or 'requested'; these three states are encoded in a 2 bit state array.
> > +Incoming requests from the kernel cause the state to transition from 'missing'
> > +to 'requested'.   Received pages cause a transition from either 'missing' or
> > +'requested' to 'received'; the kernel is notified on reception to wake up
> > +any threads that were waiting for the page.
> > +If the kernel requests a page that has already been 'received' the kernel is
> > +notified without re-requesting.
> > +
> > +This leads to four valid page states:
> > +page states:
> > +    missing        - page not yet received or requested
> > +    received       - Page received
> > +    requested      - page requested but not yet received
> > +
> > +state transitions:
> > +      received -> missing   (only during setup/discard)
> > +      missing -> received   (normal incoming page)
> > +      requested -> received (incoming page previously requested)
> > +      missing -> requested  (userfault request)
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works.
  2015-03-05  9:21     ` Dr. David Alan Gilbert
@ 2015-03-10  1:04       ` David Gibson
  2015-03-13 13:07         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  1:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 4139 bytes --]

On Thu, Mar 05, 2015 at 09:21:39AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:24PM +0000, Dr. David Alan Gilbert
> (git) wrote:
[snip]
> > > +=== Enabling postcopy ===
> > > +
> > > +To enable postcopy (prior to the start of migration):
> > > +
> > > +migrate_set_capability x-postcopy-ram on
> > > +
> > > +The migration will still start in precopy mode, however issuing:
> > > +
> > > +migrate_start_postcopy
> > > +
> > > +will now cause the transition from precopy to postcopy.
> > > +It can be issued immediately after migration is started or any
> > > +time later on.  Issuing it after the end of a migration is harmless.
> > 
> > It's not quite clear to me what this means.  Does
> > "migrate_start_postcopy" mean it will immediately transfer execution
> > and transfer any remaining pages postcopy, or does it just mean it
> > will start postcopying once the remaining data to transfer is small
> > enough?
> 
> Yes; it will flip into postcopy soon after issuing that command irrespective
> of the amount of data remaining.
> 
> > What's the reason for this rather awkward two stage activation of
> > postcopy?
> 
> We need to keep track of the pages that are received during the precopy phase,
> and do some madvise and other setups on the destination RAM area before precopy
> starts; and so we need to know we might want to do postcopy - so we need
> to be told early.  In the earliest posted version of my patches I had a
> time-limit setting and after the time limit expired QEMU would switch into
> the second phase of postcopy itself, but Paolo suggested the migrate_start_postcopy:
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2014-07/msg00943.html
> 
> and it works out simpler anyway.

Ok, that makes sense.

> > > +=== Postcopy device transfer ===
> > > +
> > > +Loading of device data may cause the device emulation to access guest RAM
> > > +that may trigger faults that have to be resolved by the source, as such
> > > +the migration stream has to be able to respond with page data *during* the
> > > +device load, and hence the device data has to be read from the stream completely
> > > +before the device load begins to free the stream up.  This is achieved by
> > > +'packaging' the device data into a blob that's read in one go.
> > > +
> > > +Source behaviour
> > > +
> > > +Until postcopy is entered the migration stream is identical to normal
> > > +precopy, except for the addition of a 'postcopy advise' command at
> > > +the beginning, to tell the destination that postcopy might happen.
> > > +When postcopy starts the source sends the page discard data and then
> > > +forms the 'package' containing:
> > > +
> > > +   Command: 'postcopy ram listen'
> > > +   The device state
> > > +      A series of sections, identical to the precopy streams device state stream
> > > +      containing everything except postcopiable devices (i.e. RAM)
> > > +   Command: 'postcopy ram run'
> > > +
> > > +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> > > +contents are formatted in the same way as the main migration stream.
> > 
> > It seems to me the "ram listen", "ram run" and CMD_PACKAGED really
> > have to be used in conjuction this way, they don't really have any use
> > on their own.  So why not make it all CMD_POSTCOPY_TRANSITION and have
> > the "listen" and "run" take effect implicitly at the beginning and end
> > of the device data.
> 
> CMD_PACKAGED seems like something that was generally useful; it's fairly
> complicated on it's own and so it seemed best to keep it separate.

And can you actually think of another use case for it?

The thing that bothers me is that the "listen" and "run" operations
will not work correctly anywhere other than at the beginning and end
of the packaged blob.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/45] Split header writing out of qemu_save_state_begin
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 02/45] Split header writing out of qemu_save_state_begin Dr. David Alan Gilbert (git)
@ 2015-03-10  1:05   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10  1:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 707 bytes --]

On Wed, Feb 25, 2015 at 04:51:25PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Split qemu_save_state_begin to:
>   qemu_save_state_header   That writes the initial file header.
>   qemu_save_state_begin    That sets up devices and does the first
>                            device pass.
> 
> Used later in postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/45] Add qemu_get_counted_string to read a string prefixed by a count byte
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 04/45] Add qemu_get_counted_string to read a string prefixed by a count byte Dr. David Alan Gilbert (git)
@ 2015-03-10  1:12   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10  1:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 553 bytes --]

On Wed, Feb 25, 2015 at 04:51:27PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> and use it in loadvm_state and ram_load.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Pure cleanup, no change to migration stream.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/45] Create MigrationIncomingState
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 05/45] Create MigrationIncomingState Dr. David Alan Gilbert (git)
@ 2015-03-10  2:37   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10  2:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 682 bytes --]

On Wed, Feb 25, 2015 at 04:51:28PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> There are currently lots of pieces of incoming migration state scattered
> around, and postcopy is adding more, and it seems better to try and keep
> it together.
> 
> allocate MIS in process_incoming_migration_co
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/45] Provide runtime Target page information
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 06/45] Provide runtime Target page information Dr. David Alan Gilbert (git)
@ 2015-03-10  2:38   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10  2:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

On Wed, Feb 25, 2015 at 04:51:29PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The migration code generally is built target-independent, however
> there are a few places where knowing the target page size would
> avoid artificially moving stuff into arch_init.
> 
> Provide 'qemu_target_page_bits()' that returns TARGET_PAGE_BITS
> to other bits of code so that they can stay target-independent.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets Dr. David Alan Gilbert (git)
@ 2015-03-10  2:49   ` David Gibson
  2015-03-13 13:14     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  2:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2976 bytes --]

On Wed, Feb 25, 2015 at 04:51:30PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Postcopy needs a method to send messages from the destination back to
> the source, this is the 'return path'.
> 
> Wire it up for 'socket' QEMUFile's using a dup'd fd.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/qemu-file.h  |  7 +++++
>  migration/qemu-file-internal.h |  2 ++
>  migration/qemu-file-unix.c     | 58 +++++++++++++++++++++++++++++++++++-------
>  migration/qemu-file.c          | 12 +++++++++
>  4 files changed, 70 insertions(+), 9 deletions(-)
> 
> diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
> index 6ae0b03..3c38963 100644
> --- a/include/migration/qemu-file.h
> +++ b/include/migration/qemu-file.h
> @@ -85,6 +85,11 @@ typedef size_t (QEMURamSaveFunc)(QEMUFile *f, void *opaque,
>                                 int *bytes_sent);
>  
>  /*
> + * Return a QEMUFile for comms in the opposite direction
> + */
> +typedef QEMUFile *(QEMURetPathFunc)(void *opaque);
> +
> +/*
>   * Stop any read or write (depending on flags) on the underlying
>   * transport on the QEMUFile.
>   * Existing blocking reads/writes must be woken
> @@ -102,6 +107,7 @@ typedef struct QEMUFileOps {
>      QEMURamHookFunc *after_ram_iterate;
>      QEMURamHookFunc *hook_ram_load;
>      QEMURamSaveFunc *save_page;
> +    QEMURetPathFunc *get_return_path;
>      QEMUFileShutdownFunc *shut_down;
>  } QEMUFileOps;
>  
> @@ -188,6 +194,7 @@ int64_t qemu_file_get_rate_limit(QEMUFile *f);
>  int qemu_file_get_error(QEMUFile *f);
>  void qemu_file_set_error(QEMUFile *f, int ret);
>  int qemu_file_shutdown(QEMUFile *f);
> +QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>  void qemu_fflush(QEMUFile *f);
>  
>  static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
> diff --git a/migration/qemu-file-internal.h b/migration/qemu-file-internal.h
> index d95e853..a39b8e3 100644
> --- a/migration/qemu-file-internal.h
> +++ b/migration/qemu-file-internal.h
> @@ -48,6 +48,8 @@ struct QEMUFile {
>      unsigned int iovcnt;
>  
>      int last_error;
> +
> +    struct QEMUFile *return_path;

AFAICT, the only thing this field is used for is an assert, which
seems a bit pointless.  I'd suggest either getting rid of it, or
make qemu_file_get_return_path() safely idempotent by having it only
call the FileOps pointer if QEMUFile::return_path is non-NULL,
otherwise just return the existing return_path.

Setting the field probably belongs better in the wrapper than in the
socket specific callback, too, since there's nothing inherently
related to the socket implementation about it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's Dr. David Alan Gilbert (git)
@ 2015-03-10  2:56   ` David Gibson
  2015-03-10 13:35     ` Dr. David Alan Gilbert
  2015-03-28 15:30   ` Paolo Bonzini
  1 sibling, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  2:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2841 bytes --]

On Wed, Feb 25, 2015 at 04:51:31PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The return path uses a non-blocking fd so as not to block waiting
> for the (possibly broken) destination to finish returning a message,
> however we still want outbound data to behave in the same way and block.

It's not clear to me from this description exactly where the situation
is that you need to write to the non-blocking socket.  Is it on the
source or the destination?  If the source, why are you writing to the
return path?  If the destination, why are you marking the outgoing
return path as non-blocking?

> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/qemu-file-unix.c | 41 ++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 36 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/qemu-file-unix.c b/migration/qemu-file-unix.c
> index 50291cf..218dbd0 100644
> --- a/migration/qemu-file-unix.c
> +++ b/migration/qemu-file-unix.c
> @@ -39,12 +39,43 @@ static ssize_t socket_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
>      QEMUFileSocket *s = opaque;
>      ssize_t len;
>      ssize_t size = iov_size(iov, iovcnt);
> +    ssize_t offset = 0;
> +    int     err;
>  
> -    len = iov_send(s->fd, iov, iovcnt, 0, size);
> -    if (len < size) {
> -        len = -socket_error();
> -    }
> -    return len;
> +    while (size > 0) {
> +        len = iov_send(s->fd, iov, iovcnt, offset, size);
> +
> +        if (len > 0) {
> +            size -= len;
> +            offset += len;
> +        }
> +
> +        if (size > 0) {
> +            err = socket_error();
> +
> +            if (err != EAGAIN) {
> +                error_report("socket_writev_buffer: Got err=%d for (%zd/%zd)",
> +                             err, size, len);
> +                /*
> +                 * If I've already sent some but only just got the error, I
> +                 * could return the amount validly sent so far and wait for the
> +                 * next call to report the error, but I'd rather flag the error
> +                 * immediately.
> +                 */
> +                return -err;
> +            }
> +
> +            /* Emulate blocking */
> +            GPollFD pfd;
> +
> +            pfd.fd = s->fd;
> +            pfd.events = G_IO_OUT | G_IO_ERR;
> +            pfd.revents = 0;
> +            g_poll(&pfd, 1 /* 1 fd */, -1 /* no timeout */);
> +        }
> +     }
> +
> +    return offset;
>  }
>  
>  static int socket_get_fd(void *opaque)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/45] Migration commands
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 09/45] Migration commands Dr. David Alan Gilbert (git)
@ 2015-03-10  4:58   ` David Gibson
  2015-03-10 11:04     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  4:58 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

On Wed, Feb 25, 2015 at 04:51:32PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Create QEMU_VM_COMMAND section type for sending commands from
> source to destination.  These commands are not intended to convey
> guest state but to control the migration process.
> 
> For use in postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

[snip]
> +/* Send a 'QEMU_VM_COMMAND' type element with the command
> + * and associated data.
> + */
> +void qemu_savevm_command_send(QEMUFile *f,
> +                              enum qemu_vm_cmd command,
> +                              uint16_t len,
> +                              uint8_t *data)
> +{
> +    uint32_t tmp = (uint16_t)command;

Erm.. cast to u16, assign to u32, then send as u16?  What's up with
that?

> +    qemu_put_byte(f, QEMU_VM_COMMAND);
> +    qemu_put_be16(f, tmp);
> +    qemu_put_be16(f, len);
> +    if (len) {
> +        qemu_put_buffer(f, data, len);
> +    }
> +    qemu_fflush(f);
> +}
> +

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/45] Return path: Control commands
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 10/45] Return path: Control commands Dr. David Alan Gilbert (git)
@ 2015-03-10  5:40   ` David Gibson
  2015-03-28 15:32   ` Paolo Bonzini
  1 sibling, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10  5:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Wed, Feb 25, 2015 at 04:51:33PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add two src->dest commands:
>    * OPEN_RETURN_PATH - To request that the destination open the return path
>    * SEND_PING - Request an acknowledge from the destination
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source Dr. David Alan Gilbert (git)
@ 2015-03-10  5:47   ` David Gibson
  2015-03-10 14:34     ` Dr. David Alan Gilbert
  2015-03-28 15:34     ` Paolo Bonzini
  0 siblings, 2 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10  5:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 6380 bytes --]

On Wed, Feb 25, 2015 at 04:51:34PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add migrate_send_rp_message to send a message from destination to source along the return path.
>   (It uses a mutex to let it be called from multiple threads)
> Add migrate_send_rp_shut to send a 'shut' message to indicate
>   the destination is finished with the RP.
> Add migrate_send_rp_ack to send a 'PONG' message in response to a PING
>   Use it in the CMD_PING handler
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h | 17 ++++++++++++++++
>  migration/migration.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
>  savevm.c                      |  2 +-
>  trace-events                  |  1 +
>  4 files changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index c514dd4..6775747 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -41,6 +41,13 @@ struct MigrationParams {
>      bool shared;
>  };
>  
> +/* Commands sent on the return path from destination to source*/
> +enum mig_rpcomm_cmd {

"command" doesn't seem like quite the right description for these rp
messages.

> +    MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
> +    MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
> +    MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> +};
> +
>  typedef struct MigrationState MigrationState;
>  
>  /* State for the incoming migration */
> @@ -48,6 +55,7 @@ struct MigrationIncomingState {
>      QEMUFile *file;
>  
>      QEMUFile *return_path;
> +    QemuMutex      rp_mutex;    /* We send replies from multiple threads */
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> @@ -169,6 +177,15 @@ int64_t migrate_xbzrle_cache_size(void);
>  
>  int64_t xbzrle_cache_resize(int64_t new_size);
>  
> +/* Sending on the return path - generic and then for each message type */
> +void migrate_send_rp_message(MigrationIncomingState *mis,
> +                             enum mig_rpcomm_cmd cmd,
> +                             uint16_t len, uint8_t *data);
> +void migrate_send_rp_shut(MigrationIncomingState *mis,
> +                          uint32_t value);
> +void migrate_send_rp_pong(MigrationIncomingState *mis,
> +                          uint32_t value);
> +
>  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
>  void ram_control_load_hook(QEMUFile *f, uint64_t flags);
> diff --git a/migration/migration.c b/migration/migration.c
> index a36ea65..80d234c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
>  {
>      mis_current = g_malloc0(sizeof(MigrationIncomingState));
>      mis_current->file = f;
> +    qemu_mutex_init(&mis_current->rp_mutex);
>  
>      return mis_current;
>  }
> @@ -88,6 +89,50 @@ void migration_incoming_state_destroy(void)
>      mis_current = NULL;
>  }
>  
> +/*
> + * Send a message on the return channel back to the source
> + * of the migration.
> + */
> +void migrate_send_rp_message(MigrationIncomingState *mis,
> +                             enum mig_rpcomm_cmd cmd,
> +                             uint16_t len, uint8_t *data)

Using (void *) for data would avoid casts in a bunch of the callers.

> +{
> +    trace_migrate_send_rp_message((int)cmd, len);
> +    qemu_mutex_lock(&mis->rp_mutex);
> +    qemu_put_be16(mis->return_path, (unsigned int)cmd);
> +    qemu_put_be16(mis->return_path, len);
> +    qemu_put_buffer(mis->return_path, data, len);
> +    qemu_fflush(mis->return_path);
> +    qemu_mutex_unlock(&mis->rp_mutex);
> +}
> +
> +/*
> + * Send a 'SHUT' message on the return channel with the given value
> + * to indicate that we've finished with the RP.  None-0 value indicates
> + * error.
> + */
> +void migrate_send_rp_shut(MigrationIncomingState *mis,
> +                          uint32_t value)
> +{
> +    uint32_t buf;
> +
> +    buf = cpu_to_be32(value);
> +    migrate_send_rp_message(mis, MIG_RP_CMD_SHUT, 4, (uint8_t *)&buf);

                                                     ^ sizeof(buf)
						     would be safer

> +}
> +
> +/*
> + * Send a 'PONG' message on the return channel with the given value
> + * (normally in response to a 'PING')
> + */
> +void migrate_send_rp_pong(MigrationIncomingState *mis,
> +                          uint32_t value)
> +{
> +    uint32_t buf;
> +
> +    buf = cpu_to_be32(value);
> +    migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);

It occurs to me that you could define PONG as returning the whole
buffer that PING sends, instead of just 4-bytes.  Might allow for some
more testing of variable sized messages.

> +}
> +
>  void qemu_start_incoming_migration(const char *uri, Error **errp)
>  {
>      const char *p;
> diff --git a/savevm.c b/savevm.c
> index d082738..7084d07 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1008,7 +1008,7 @@ static int loadvm_process_command(QEMUFile *f)
>                           tmp32);
>              return -1;
>          }
> -        /* migrate_send_rp_pong(mis, tmp32); TODO: gets added later */
> +        migrate_send_rp_pong(mis, tmp32);
>          break;
>  
>      default:
> diff --git a/trace-events b/trace-events
> index 99e00b5..4f3eff8 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1379,6 +1379,7 @@ migrate_fd_cleanup(void) ""
>  migrate_fd_error(void) ""
>  migrate_fd_cancel(void) ""
>  migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
> +migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
>  migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
>  
>  # migration/rdma.c

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path Dr. David Alan Gilbert (git)
@ 2015-03-10  6:08   ` David Gibson
  2015-03-20 18:17     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  6:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 10617 bytes --]

On Wed, Feb 25, 2015 at 04:51:35PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Open a return path, and handle messages that are received upon it.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |   8 ++
>  migration/migration.c         | 178 +++++++++++++++++++++++++++++++++++++++++-
>  trace-events                  |  13 +++
>  3 files changed, 198 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 6775747..5242ead 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -73,6 +73,14 @@ struct MigrationState
>  
>      int state;
>      MigrationParams params;
> +
> +    /* State related to return path */
> +    struct {
> +        QEMUFile     *file;
> +        QemuThread    rp_thread;
> +        bool          error;
> +    } rp_state;
> +
>      double mbps;
>      int64_t total_time;
>      int64_t downtime;
> diff --git a/migration/migration.c b/migration/migration.c
> index 80d234c..34cd4fe 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -237,6 +237,23 @@ MigrationCapabilityStatusList *qmp_query_migrate_capabilities(Error **errp)
>      return head;
>  }
>  
> +/*
> + * Return true if we're already in the middle of a migration
> + * (i.e. any of the active or setup states)
> + */
> +static bool migration_already_active(MigrationState *ms)
> +{
> +    switch (ms->state) {
> +    case MIG_STATE_ACTIVE:
> +    case MIG_STATE_SETUP:
> +        return true;
> +
> +    default:
> +        return false;
> +
> +    }
> +}
> +
>  static void get_xbzrle_cache_stats(MigrationInfo *info)
>  {
>      if (migrate_use_xbzrle()) {
> @@ -362,6 +379,21 @@ static void migrate_set_state(MigrationState *s, int old_state, int new_state)
>      }
>  }
>  
> +static void migrate_fd_cleanup_src_rp(MigrationState *ms)
> +{
> +    QEMUFile *rp = ms->rp_state.file;
> +
> +    /*
> +     * When stuff goes wrong (e.g. failing destination) on the rp, it can get
> +     * cleaned up from a few threads; make sure not to do it twice in parallel
> +     */
> +    rp = atomic_cmpxchg(&ms->rp_state.file, rp, NULL);

A cmpxchg seems dangerously subtle for such a basic and infrequent
operation, but ok.

> +    if (rp) {
> +        trace_migrate_fd_cleanup_src_rp();
> +        qemu_fclose(rp);
> +    }
> +}
> +
>  static void migrate_fd_cleanup(void *opaque)
>  {
>      MigrationState *s = opaque;
> @@ -369,6 +401,8 @@ static void migrate_fd_cleanup(void *opaque)
>      qemu_bh_delete(s->cleanup_bh);
>      s->cleanup_bh = NULL;
>  
> +    migrate_fd_cleanup_src_rp(s);
> +
>      if (s->file) {
>          trace_migrate_fd_cleanup();
>          qemu_mutex_unlock_iothread();
> @@ -406,6 +440,11 @@ static void migrate_fd_cancel(MigrationState *s)
>      QEMUFile *f = migrate_get_current()->file;
>      trace_migrate_fd_cancel();
>  
> +    if (s->rp_state.file) {
> +        /* shutdown the rp socket, so causing the rp thread to shutdown */
> +        qemu_file_shutdown(s->rp_state.file);

I missed where qemu_file_shutdown() was implemented.  Does this
introduce a leftover socket dependency?

> +    }
> +
>      do {
>          old_state = s->state;
>          if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
> @@ -658,8 +697,145 @@ int64_t migrate_xbzrle_cache_size(void)
>      return s->xbzrle_cache_size;
>  }
>  
> -/* migration thread support */
> +/*
> + * Something bad happened to the RP stream, mark an error
> + * The caller shall print something to indicate why
> + */
> +static void source_return_path_bad(MigrationState *s)
> +{
> +    s->rp_state.error = true;
> +    migrate_fd_cleanup_src_rp(s);
> +}
> +
> +/*
> + * Handles messages sent on the return path towards the source VM
> + *
> + */
> +static void *source_return_path_thread(void *opaque)
> +{
> +    MigrationState *ms = opaque;
> +    QEMUFile *rp = ms->rp_state.file;
> +    uint16_t expected_len, header_len, header_com;
> +    const int max_len = 512;
> +    uint8_t buf[max_len];
> +    uint32_t tmp32;
> +    int res;
> +
> +    trace_source_return_path_thread_entry();
> +    while (rp && !qemu_file_get_error(rp) &&
> +        migration_already_active(ms)) {
> +        trace_source_return_path_thread_loop_top();
> +        header_com = qemu_get_be16(rp);
> +        header_len = qemu_get_be16(rp);
> +
> +        switch (header_com) {
> +        case MIG_RP_CMD_SHUT:
> +        case MIG_RP_CMD_PONG:
> +            expected_len = 4;

Could the knowledge of expected lengths be folded into the switch
below?  Switching twice on the same thing is a bit icky.

> +            break;
> +
> +        default:
> +            error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
> +                    header_com, header_len);
> +            source_return_path_bad(ms);
> +            goto out;
> +        }
>  
> +        if (header_len > expected_len) {
> +            error_report("RP: Received command 0x%04x with"
> +                    "incorrect length %d expecting %d",
> +                    header_com, header_len,
> +                    expected_len);
> +            source_return_path_bad(ms);
> +            goto out;
> +        }
> +
> +        /* We know we've got a valid header by this point */
> +        res = qemu_get_buffer(rp, buf, header_len);
> +        if (res != header_len) {
> +            trace_source_return_path_thread_failed_read_cmd_data();
> +            source_return_path_bad(ms);
> +            goto out;
> +        }
> +
> +        /* OK, we have the command and the data */
> +        switch (header_com) {
> +        case MIG_RP_CMD_SHUT:
> +            tmp32 = be32_to_cpup((uint32_t *)buf);
> +            trace_source_return_path_thread_shut(tmp32);
> +            if (tmp32) {
> +                error_report("RP: Sibling indicated error %d", tmp32);
> +                source_return_path_bad(ms);
> +            }
> +            /*
> +             * We'll let the main thread deal with closing the RP
> +             * we could do a shutdown(2) on it, but we're the only user
> +             * anyway, so there's nothing gained.
> +             */
> +            goto out;
> +
> +        case MIG_RP_CMD_PONG:
> +            tmp32 = be32_to_cpup((uint32_t *)buf);
> +            trace_source_return_path_thread_pong(tmp32);
> +            break;
> +
> +        default:
> +            /* This shouldn't happen because we should catch this above */
> +            trace_source_return_path_bad_header_com();
> +        }
> +        /* Latest command processed, now leave a gap for the next one */
> +        header_com = MIG_RP_CMD_INVALID;

This assignment will always get overwritten.

> +    }
> +    if (rp && qemu_file_get_error(rp)) {
> +        trace_source_return_path_thread_bad_end();
> +        source_return_path_bad(ms);
> +    }
> +
> +    trace_source_return_path_thread_end();
> +out:
> +    return NULL;
> +}
> +
> +__attribute__ (( unused )) /* Until later in patch series */
> +static int open_outgoing_return_path(MigrationState *ms)

Uh.. surely this should be open_incoming_return_path(); it's designed
to be used on the source side, AFAICT.

> +{
> +
> +    ms->rp_state.file = qemu_file_get_return_path(ms->file);
> +    if (!ms->rp_state.file) {
> +        return -1;
> +    }
> +
> +    trace_open_outgoing_return_path();
> +    qemu_thread_create(&ms->rp_state.rp_thread, "return path",
> +                       source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
> +
> +    trace_open_outgoing_return_path_continue();
> +
> +    return 0;
> +}
> +
> +__attribute__ (( unused )) /* Until later in patch series */
> +static void await_outgoing_return_path_close(MigrationState *ms)

Likewise "incoming" here, surely.

> +{
> +    /*
> +     * If this is a normal exit then the destination will send a SHUT and the
> +     * rp_thread will exit, however if there's an error we need to cause
> +     * it to exit, which we can do by a shutdown.
> +     * (canceling must also shutdown to stop us getting stuck here if
> +     * the destination died at just the wrong place)
> +     */
> +    if (qemu_file_get_error(ms->file) && ms->rp_state.file) {
> +        qemu_file_shutdown(ms->rp_state.file);
> +    }
> +    trace_await_outgoing_return_path_joining();
> +    qemu_thread_join(&ms->rp_state.rp_thread);
> +    trace_await_outgoing_return_path_close();
> +}
> +
> +/*
> + * Master migration thread on the source VM.
> + * It drives the migration and pumps the data down the outgoing channel.
> + */
>  static void *migration_thread(void *opaque)
>  {
>      MigrationState *s = opaque;
> diff --git a/trace-events b/trace-events
> index 4f3eff8..1951b25 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1374,12 +1374,25 @@ flic_no_device_api(int err) "flic: no Device Contral API support %d"
>  flic_reset_failed(int err) "flic: reset failed %d"
>  
>  # migration.c
> +await_outgoing_return_path_close(void) ""
> +await_outgoing_return_path_joining(void) ""
>  migrate_set_state(int new_state) "new state %d"
>  migrate_fd_cleanup(void) ""
> +migrate_fd_cleanup_src_rp(void) ""
>  migrate_fd_error(void) ""
>  migrate_fd_cancel(void) ""
>  migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
>  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> +open_outgoing_return_path(void) ""
> +open_outgoing_return_path_continue(void) ""
> +source_return_path_thread_bad_end(void) ""
> +source_return_path_bad_header_com(void) ""
> +source_return_path_thread_end(void) ""
> +source_return_path_thread_entry(void) ""
> +source_return_path_thread_failed_read_cmd_data(void) ""
> +source_return_path_thread_loop_top(void) ""
> +source_return_path_thread_pong(uint32_t val) "%x"
> +source_return_path_thread_shut(uint32_t val) "%x"
>  migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
>  
>  # migration/rdma.c

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text Dr. David Alan Gilbert (git)
@ 2015-03-10  6:11   ` David Gibson
  2015-03-20 18:48     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  6:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]

On Wed, Feb 25, 2015 at 04:51:36PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Misses out lines that are all the expected value so the output
> can be quite compact depending on the circumstance.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  arch_init.c                   | 39 +++++++++++++++++++++++++++++++++++++++
>  include/migration/migration.h |  1 +
>  2 files changed, 40 insertions(+)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 91645cc..fe0df0d 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -776,6 +776,45 @@ static void reset_ram_globals(void)
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
>  
> +/*
> + * 'expected' is the value you expect the bitmap mostly to be full
> + * of and it won't bother printing lines that are all this value
> + * if 'todump' is null the migration bitmap is dumped.
> + */
> +void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
> +{
> +    int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
> +
> +    int64_t cur;
> +    int64_t linelen = 128;
> +    char linebuf[129];
> +
> +    if (!todump) {
> +        todump = migration_bitmap;

Any reason not to just have the caller pass migration_bitmap, if
that's what they want?

> +    }
> +
> +    for (cur = 0; cur < ram_pages; cur += linelen) {
> +        int64_t curb;
> +        bool found = false;
> +        /*
> +         * Last line; catch the case where the line length
> +         * is longer than remaining ram
> +         */
> +        if (cur+linelen > ram_pages) {
> +            linelen = ram_pages - cur;
> +        }
> +        for (curb = 0; curb < linelen; curb++) {
> +            bool thisbit = test_bit(cur+curb, todump);
> +            linebuf[curb] = thisbit ? '1' : '.';
> +            found = found || (thisbit != expected);
> +        }
> +        if (found) {
> +            linebuf[curb] = '\0';
> +            fprintf(stderr,  "0x%08" PRIx64 " : %s\n", cur, linebuf);

Might be slightly more readable if it printed GPAs instead of page
numbers.

> +        }
> +    }
> +}
> +
>  static int ram_save_setup(QEMUFile *f, void *opaque)
>  {
>      RAMBlock *block;
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 5242ead..3776e86 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -156,6 +156,7 @@ uint64_t xbzrle_mig_pages_cache_miss(void);
>  double xbzrle_mig_cache_miss_rate(void);
>  
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
> +void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
>  
>  /**
>   * @migrate_add_blocker - prevent migration from proceeding

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState Dr. David Alan Gilbert (git)
@ 2015-03-10  6:19   ` David Gibson
  2015-03-10 10:12     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-10  6:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 5973 bytes --]

On Wed, Feb 25, 2015 at 04:51:37PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> In postcopy we need the loadvm_handlers to be used in a couple
> of different instances of the loadvm loop/routine, and thus
> it can't be local any more.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |  5 +++++
>  include/migration/vmstate.h   |  2 ++
>  include/qemu/typedefs.h       |  1 +
>  migration/migration.c         |  2 ++
>  savevm.c                      | 28 ++++++++++++++++------------
>  5 files changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 3776e86..751caa0 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -50,10 +50,15 @@ enum mig_rpcomm_cmd {
>  
>  typedef struct MigrationState MigrationState;
>  
> +typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> +
>  /* State for the incoming migration */
>  struct MigrationIncomingState {
>      QEMUFile *file;
>  
> +    /* See savevm.c */
> +    LoadStateEntry_Head loadvm_handlers;
> +
>      QEMUFile *return_path;
>      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
>  };
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index c20f2d1..18da207 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -797,6 +797,8 @@ extern const VMStateInfo vmstate_info_bitmap;
>  
>  #define SELF_ANNOUNCE_ROUNDS 5
>  
> +void loadvm_free_handlers(MigrationIncomingState *mis);
> +
>  int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
>                         void *opaque, int version_id);
>  void vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 74dfad3..6fdcbcd 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -31,6 +31,7 @@ typedef struct I2CBus I2CBus;
>  typedef struct I2SCodec I2SCodec;
>  typedef struct ISABus ISABus;
>  typedef struct ISADevice ISADevice;
> +typedef struct LoadStateEntry LoadStateEntry;
>  typedef struct MACAddr MACAddr;
>  typedef struct MachineClass MachineClass;
>  typedef struct MachineState MachineState;
> diff --git a/migration/migration.c b/migration/migration.c
> index 34cd4fe..4592060 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
>  {
>      mis_current = g_malloc0(sizeof(MigrationIncomingState));
>      mis_current->file = f;
> +    QLIST_INIT(&mis_current->loadvm_handlers);
>      qemu_mutex_init(&mis_current->rp_mutex);
>  
>      return mis_current;
> @@ -85,6 +86,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
>  
>  void migration_incoming_state_destroy(void)
>  {
> +    loadvm_free_handlers(mis_current);

AFAICT this is the only caler of loadvm_free_handlers(), so why not
just open-code it here?

>      g_free(mis_current);
>      mis_current = NULL;
>  }
> diff --git a/savevm.c b/savevm.c
> index 7084d07..f42713d 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1019,18 +1019,26 @@ static int loadvm_process_command(QEMUFile *f)
>      return 0;
>  }
>  
> -typedef struct LoadStateEntry {
> +struct LoadStateEntry {

Why remove the typedef?

>      QLIST_ENTRY(LoadStateEntry) entry;
>      SaveStateEntry *se;
>      int section_id;
>      int version_id;
> -} LoadStateEntry;
> +};
>  
> -int qemu_loadvm_state(QEMUFile *f)
> +void loadvm_free_handlers(MigrationIncomingState *mis)
>  {
> -    QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
> -        QLIST_HEAD_INITIALIZER(loadvm_handlers);
>      LoadStateEntry *le, *new_le;
> +
> +    QLIST_FOREACH_SAFE(le, &mis->loadvm_handlers, entry, new_le) {
> +        QLIST_REMOVE(le, entry);
> +        g_free(le);
> +    }
> +}
> +
> +int qemu_loadvm_state(QEMUFile *f)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>      Error *local_err = NULL;
>      uint8_t section_type;
>      unsigned int v;
> @@ -1061,6 +1069,7 @@ int qemu_loadvm_state(QEMUFile *f)
>      while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
>          uint32_t instance_id, version_id, section_id;
>          SaveStateEntry *se;
> +        LoadStateEntry *le;
>          char idstr[256];
>  
>          trace_qemu_loadvm_state_section(section_type);
> @@ -1102,7 +1111,7 @@ int qemu_loadvm_state(QEMUFile *f)
>              le->se = se;
>              le->section_id = section_id;
>              le->version_id = version_id;
> -            QLIST_INSERT_HEAD(&loadvm_handlers, le, entry);
> +            QLIST_INSERT_HEAD(&mis->loadvm_handlers, le, entry);
>  
>              ret = vmstate_load(f, le->se, le->version_id);
>              if (ret < 0) {
> @@ -1116,7 +1125,7 @@ int qemu_loadvm_state(QEMUFile *f)
>              section_id = qemu_get_be32(f);
>  
>              trace_qemu_loadvm_state_section_partend(section_id);
> -            QLIST_FOREACH(le, &loadvm_handlers, entry) {
> +            QLIST_FOREACH(le, &mis->loadvm_handlers, entry) {
>                  if (le->section_id == section_id) {
>                      break;
>                  }
> @@ -1152,11 +1161,6 @@ int qemu_loadvm_state(QEMUFile *f)
>      ret = 0;
>  
>  out:
> -    QLIST_FOREACH_SAFE(le, &loadvm_handlers, entry, new_le) {
> -        QLIST_REMOVE(le, entry);
> -        g_free(le);
> -    }
> -
>      if (ret == 0) {
>          ret = qemu_file_get_error(f);
>      }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState
  2015-03-10  6:19   ` David Gibson
@ 2015-03-10 10:12     ` Dr. David Alan Gilbert
  2015-03-10 11:03       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-10 10:12 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:37PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > In postcopy we need the loadvm_handlers to be used in a couple
> > of different instances of the loadvm loop/routine, and thus
> > it can't be local any more.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h |  5 +++++
> >  include/migration/vmstate.h   |  2 ++
> >  include/qemu/typedefs.h       |  1 +
> >  migration/migration.c         |  2 ++
> >  savevm.c                      | 28 ++++++++++++++++------------
> >  5 files changed, 26 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 3776e86..751caa0 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -50,10 +50,15 @@ enum mig_rpcomm_cmd {
> >  
> >  typedef struct MigrationState MigrationState;
> >  
> > +typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> > +
> >  /* State for the incoming migration */
> >  struct MigrationIncomingState {
> >      QEMUFile *file;
> >  
> > +    /* See savevm.c */
> > +    LoadStateEntry_Head loadvm_handlers;
> > +
> >      QEMUFile *return_path;
> >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> >  };
> > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > index c20f2d1..18da207 100644
> > --- a/include/migration/vmstate.h
> > +++ b/include/migration/vmstate.h
> > @@ -797,6 +797,8 @@ extern const VMStateInfo vmstate_info_bitmap;
> >  
> >  #define SELF_ANNOUNCE_ROUNDS 5
> >  
> > +void loadvm_free_handlers(MigrationIncomingState *mis);
> > +
> >  int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
> >                         void *opaque, int version_id);
> >  void vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
> > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > index 74dfad3..6fdcbcd 100644
> > --- a/include/qemu/typedefs.h
> > +++ b/include/qemu/typedefs.h
> > @@ -31,6 +31,7 @@ typedef struct I2CBus I2CBus;
> >  typedef struct I2SCodec I2SCodec;
> >  typedef struct ISABus ISABus;
> >  typedef struct ISADevice ISADevice;
> > +typedef struct LoadStateEntry LoadStateEntry;
> >  typedef struct MACAddr MACAddr;
> >  typedef struct MachineClass MachineClass;
> >  typedef struct MachineState MachineState;
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 34cd4fe..4592060 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
> >  {
> >      mis_current = g_malloc0(sizeof(MigrationIncomingState));
> >      mis_current->file = f;
> > +    QLIST_INIT(&mis_current->loadvm_handlers);
> >      qemu_mutex_init(&mis_current->rp_mutex);
> >  
> >      return mis_current;
> > @@ -85,6 +86,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
> >  
> >  void migration_incoming_state_destroy(void)
> >  {
> > +    loadvm_free_handlers(mis_current);
> 
> AFAICT this is the only caler of loadvm_free_handlers(), so why not
> just open-code it here?

I was keeping this handler list as owned by savevm.c; all it's allocation
and it's use are done in savevm.c, so it would seem odd to open code
freeing the list in a separate file.

> >      g_free(mis_current);
> >      mis_current = NULL;
> >  }
> > diff --git a/savevm.c b/savevm.c
> > index 7084d07..f42713d 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -1019,18 +1019,26 @@ static int loadvm_process_command(QEMUFile *f)
> >      return 0;
> >  }
> >  
> > -typedef struct LoadStateEntry {
> > +struct LoadStateEntry {
> 
> Why remove the typedef?

Because it is now typedef'd in typedefs.h, and older gcc (RHEL6)
object to two typedefs even if they're of the same thing.

Dave

> >      QLIST_ENTRY(LoadStateEntry) entry;
> >      SaveStateEntry *se;
> >      int section_id;
> >      int version_id;
> > -} LoadStateEntry;
> > +};
> >  
> > -int qemu_loadvm_state(QEMUFile *f)
> > +void loadvm_free_handlers(MigrationIncomingState *mis)
> >  {
> > -    QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
> > -        QLIST_HEAD_INITIALIZER(loadvm_handlers);
> >      LoadStateEntry *le, *new_le;
> > +
> > +    QLIST_FOREACH_SAFE(le, &mis->loadvm_handlers, entry, new_le) {
> > +        QLIST_REMOVE(le, entry);
> > +        g_free(le);
> > +    }
> > +}
> > +
> > +int qemu_loadvm_state(QEMUFile *f)
> > +{
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> >      Error *local_err = NULL;
> >      uint8_t section_type;
> >      unsigned int v;
> > @@ -1061,6 +1069,7 @@ int qemu_loadvm_state(QEMUFile *f)
> >      while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
> >          uint32_t instance_id, version_id, section_id;
> >          SaveStateEntry *se;
> > +        LoadStateEntry *le;
> >          char idstr[256];
> >  
> >          trace_qemu_loadvm_state_section(section_type);
> > @@ -1102,7 +1111,7 @@ int qemu_loadvm_state(QEMUFile *f)
> >              le->se = se;
> >              le->section_id = section_id;
> >              le->version_id = version_id;
> > -            QLIST_INSERT_HEAD(&loadvm_handlers, le, entry);
> > +            QLIST_INSERT_HEAD(&mis->loadvm_handlers, le, entry);
> >  
> >              ret = vmstate_load(f, le->se, le->version_id);
> >              if (ret < 0) {
> > @@ -1116,7 +1125,7 @@ int qemu_loadvm_state(QEMUFile *f)
> >              section_id = qemu_get_be32(f);
> >  
> >              trace_qemu_loadvm_state_section_partend(section_id);
> > -            QLIST_FOREACH(le, &loadvm_handlers, entry) {
> > +            QLIST_FOREACH(le, &mis->loadvm_handlers, entry) {
> >                  if (le->section_id == section_id) {
> >                      break;
> >                  }
> > @@ -1152,11 +1161,6 @@ int qemu_loadvm_state(QEMUFile *f)
> >      ret = 0;
> >  
> >  out:
> > -    QLIST_FOREACH_SAFE(le, &loadvm_handlers, entry, new_le) {
> > -        QLIST_REMOVE(le, entry);
> > -        g_free(le);
> > -    }
> > -
> >      if (ret == 0) {
> >          ret = qemu_file_get_error(f);
> >      }
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState
  2015-03-10 10:12     ` Dr. David Alan Gilbert
@ 2015-03-10 11:03       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10 11:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 4646 bytes --]

On Tue, Mar 10, 2015 at 10:12:14AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:37PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > In postcopy we need the loadvm_handlers to be used in a couple
> > > of different instances of the loadvm loop/routine, and thus
> > > it can't be local any more.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  include/migration/migration.h |  5 +++++
> > >  include/migration/vmstate.h   |  2 ++
> > >  include/qemu/typedefs.h       |  1 +
> > >  migration/migration.c         |  2 ++
> > >  savevm.c                      | 28 ++++++++++++++++------------
> > >  5 files changed, 26 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index 3776e86..751caa0 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -50,10 +50,15 @@ enum mig_rpcomm_cmd {
> > >  
> > >  typedef struct MigrationState MigrationState;
> > >  
> > > +typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> > > +
> > >  /* State for the incoming migration */
> > >  struct MigrationIncomingState {
> > >      QEMUFile *file;
> > >  
> > > +    /* See savevm.c */
> > > +    LoadStateEntry_Head loadvm_handlers;
> > > +
> > >      QEMUFile *return_path;
> > >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> > >  };
> > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > > index c20f2d1..18da207 100644
> > > --- a/include/migration/vmstate.h
> > > +++ b/include/migration/vmstate.h
> > > @@ -797,6 +797,8 @@ extern const VMStateInfo vmstate_info_bitmap;
> > >  
> > >  #define SELF_ANNOUNCE_ROUNDS 5
> > >  
> > > +void loadvm_free_handlers(MigrationIncomingState *mis);
> > > +
> > >  int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
> > >                         void *opaque, int version_id);
> > >  void vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
> > > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > > index 74dfad3..6fdcbcd 100644
> > > --- a/include/qemu/typedefs.h
> > > +++ b/include/qemu/typedefs.h
> > > @@ -31,6 +31,7 @@ typedef struct I2CBus I2CBus;
> > >  typedef struct I2SCodec I2SCodec;
> > >  typedef struct ISABus ISABus;
> > >  typedef struct ISADevice ISADevice;
> > > +typedef struct LoadStateEntry LoadStateEntry;
> > >  typedef struct MACAddr MACAddr;
> > >  typedef struct MachineClass MachineClass;
> > >  typedef struct MachineState MachineState;
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 34cd4fe..4592060 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
> > >  {
> > >      mis_current = g_malloc0(sizeof(MigrationIncomingState));
> > >      mis_current->file = f;
> > > +    QLIST_INIT(&mis_current->loadvm_handlers);
> > >      qemu_mutex_init(&mis_current->rp_mutex);
> > >  
> > >      return mis_current;
> > > @@ -85,6 +86,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
> > >  
> > >  void migration_incoming_state_destroy(void)
> > >  {
> > > +    loadvm_free_handlers(mis_current);
> > 
> > AFAICT this is the only caler of loadvm_free_handlers(), so why not
> > just open-code it here?
> 
> I was keeping this handler list as owned by savevm.c; all it's allocation
> and it's use are done in savevm.c, so it would seem odd to open code
> freeing the list in a separate file.
> 
> > >      g_free(mis_current);
> > >      mis_current = NULL;
> > >  }
> > > diff --git a/savevm.c b/savevm.c
> > > index 7084d07..f42713d 100644
> > > --- a/savevm.c
> > > +++ b/savevm.c
> > > @@ -1019,18 +1019,26 @@ static int loadvm_process_command(QEMUFile *f)
> > >      return 0;
> > >  }
> > >  
> > > -typedef struct LoadStateEntry {
> > > +struct LoadStateEntry {
> > 
> > Why remove the typedef?
> 
> Because it is now typedef'd in typedefs.h, and older gcc (RHEL6)
> object to two typedefs even if they're of the same thing.

Ok, makes sense.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/45] Migration commands
  2015-03-10  4:58   ` David Gibson
@ 2015-03-10 11:04     ` Dr. David Alan Gilbert
  2015-03-10 11:06       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-10 11:04 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:32PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Create QEMU_VM_COMMAND section type for sending commands from
> > source to destination.  These commands are not intended to convey
> > guest state but to control the migration process.
> > 
> > For use in postcopy.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> [snip]
> > +/* Send a 'QEMU_VM_COMMAND' type element with the command
> > + * and associated data.
> > + */
> > +void qemu_savevm_command_send(QEMUFile *f,
> > +                              enum qemu_vm_cmd command,
> > +                              uint16_t len,
> > +                              uint8_t *data)
> > +{
> > +    uint32_t tmp = (uint16_t)command;
> 
> Erm.. cast to u16, assign to u32, then send as u16?  What's up with
> that?

Hmm yes, that is insane;  now changed to:

+void qemu_savevm_command_send(QEMUFile *f,
+                              enum qemu_vm_cmd command,
+                              uint16_t len,
+                              uint8_t *data)
+{
+    qemu_put_byte(f, QEMU_VM_COMMAND);
+    qemu_put_be16(f, (uint16_t)command);
+    qemu_put_be16(f, len);
+    if (len) {
+        qemu_put_buffer(f, data, len);
+    }
+    qemu_fflush(f);
+}

Thanks,

Dave

> 
> > +    qemu_put_byte(f, QEMU_VM_COMMAND);
> > +    qemu_put_be16(f, tmp);
> > +    qemu_put_be16(f, len);
> > +    if (len) {
> > +        qemu_put_buffer(f, data, len);
> > +    }
> > +    qemu_fflush(f);
> > +}
> > +
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/45] Migration commands
  2015-03-10 11:04     ` Dr. David Alan Gilbert
@ 2015-03-10 11:06       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-10 11:06 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1785 bytes --]

On Tue, Mar 10, 2015 at 11:04:14AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:32PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Create QEMU_VM_COMMAND section type for sending commands from
> > > source to destination.  These commands are not intended to convey
> > > guest state but to control the migration process.
> > > 
> > > For use in postcopy.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > [snip]
> > > +/* Send a 'QEMU_VM_COMMAND' type element with the command
> > > + * and associated data.
> > > + */
> > > +void qemu_savevm_command_send(QEMUFile *f,
> > > +                              enum qemu_vm_cmd command,
> > > +                              uint16_t len,
> > > +                              uint8_t *data)
> > > +{
> > > +    uint32_t tmp = (uint16_t)command;
> > 
> > Erm.. cast to u16, assign to u32, then send as u16?  What's up with
> > that?
> 
> Hmm yes, that is insane;  now changed to:
> 
> +void qemu_savevm_command_send(QEMUFile *f,
> +                              enum qemu_vm_cmd command,
> +                              uint16_t len,
> +                              uint8_t *data)
> +{
> +    qemu_put_byte(f, QEMU_VM_COMMAND);
> +    qemu_put_be16(f, (uint16_t)command);
> +    qemu_put_be16(f, len);
> +    if (len) {
> +        qemu_put_buffer(f, data, len);
> +    }
> +    qemu_fflush(f);
> +}

That looks better.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-03-10  2:56   ` David Gibson
@ 2015-03-10 13:35     ` Dr. David Alan Gilbert
  2015-03-11  1:51       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-10 13:35 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:31PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The return path uses a non-blocking fd so as not to block waiting
> > for the (possibly broken) destination to finish returning a message,
> > however we still want outbound data to behave in the same way and block.
> 
> It's not clear to me from this description exactly where the situation
> is that you need to write to the non-blocking socket.  Is it on the
> source or the destination?  If the source, why are you writing to the
> return path?  If the destination, why are you marking the outgoing
> return path as non-blocking?

My understanding is that the semantics of set_nonblock() are to
set non-blocking on all operations on the transport associated with
the fd - and that it's true even if you dup() the fd; and so if you
set non-blocking in one direction you get it in the other direction as well.

The (existing) destination side sets non-block (see process_incoming_migration
in migration.c), and so it gets non-blocking on the incoming data stream, but
that has the side effect that it's also going to be non-blocking on
the destinations writes to the reverse-path; thus we need to be able
to safely do writes from the destination reverse-path.

The text is out of date, back in v2 the source used to use non-blocking
for the return-path, but we managed to kill that off by using a thread
for the return path in the source.

How about changing the text to:
--------
The destination sets the fd to non-blocking on incoming migrations;
this also affects the return path from the destination, and thus we
need to make sure we can safely write to the return path.

Dave

> 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  migration/qemu-file-unix.c | 41 ++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 36 insertions(+), 5 deletions(-)
> > 
> > diff --git a/migration/qemu-file-unix.c b/migration/qemu-file-unix.c
> > index 50291cf..218dbd0 100644
> > --- a/migration/qemu-file-unix.c
> > +++ b/migration/qemu-file-unix.c
> > @@ -39,12 +39,43 @@ static ssize_t socket_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
> >      QEMUFileSocket *s = opaque;
> >      ssize_t len;
> >      ssize_t size = iov_size(iov, iovcnt);
> > +    ssize_t offset = 0;
> > +    int     err;
> >  
> > -    len = iov_send(s->fd, iov, iovcnt, 0, size);
> > -    if (len < size) {
> > -        len = -socket_error();
> > -    }
> > -    return len;
> > +    while (size > 0) {
> > +        len = iov_send(s->fd, iov, iovcnt, offset, size);
> > +
> > +        if (len > 0) {
> > +            size -= len;
> > +            offset += len;
> > +        }
> > +
> > +        if (size > 0) {
> > +            err = socket_error();
> > +
> > +            if (err != EAGAIN) {
> > +                error_report("socket_writev_buffer: Got err=%d for (%zd/%zd)",
> > +                             err, size, len);
> > +                /*
> > +                 * If I've already sent some but only just got the error, I
> > +                 * could return the amount validly sent so far and wait for the
> > +                 * next call to report the error, but I'd rather flag the error
> > +                 * immediately.
> > +                 */
> > +                return -err;
> > +            }
> > +
> > +            /* Emulate blocking */
> > +            GPollFD pfd;
> > +
> > +            pfd.fd = s->fd;
> > +            pfd.events = G_IO_OUT | G_IO_ERR;
> > +            pfd.revents = 0;
> > +            g_poll(&pfd, 1 /* 1 fd */, -1 /* no timeout */);
> > +        }
> > +     }
> > +
> > +    return offset;
> >  }
> >  
> >  static int socket_get_fd(void *opaque)
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source
  2015-03-10  5:47   ` David Gibson
@ 2015-03-10 14:34     ` Dr. David Alan Gilbert
  2015-03-11  1:54       ` David Gibson
  2015-03-28 15:34     ` Paolo Bonzini
  1 sibling, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-10 14:34 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:34PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Add migrate_send_rp_message to send a message from destination to source along the return path.
> >   (It uses a mutex to let it be called from multiple threads)
> > Add migrate_send_rp_shut to send a 'shut' message to indicate
> >   the destination is finished with the RP.
> > Add migrate_send_rp_ack to send a 'PONG' message in response to a PING
> >   Use it in the CMD_PING handler
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h | 17 ++++++++++++++++
> >  migration/migration.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
> >  savevm.c                      |  2 +-
> >  trace-events                  |  1 +
> >  4 files changed, 64 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index c514dd4..6775747 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -41,6 +41,13 @@ struct MigrationParams {
> >      bool shared;
> >  };
> >  
> > +/* Commands sent on the return path from destination to source*/
> > +enum mig_rpcomm_cmd {
> 
> "command" doesn't seem like quite the right description for these rp
> messages.

Would you prefer 'message' ?

> > +    MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
> > +    MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
> > +    MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> > +};
> > +
> >  typedef struct MigrationState MigrationState;
> >  
> >  /* State for the incoming migration */
> > @@ -48,6 +55,7 @@ struct MigrationIncomingState {
> >      QEMUFile *file;
> >  
> >      QEMUFile *return_path;
> > +    QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> >  };
> >  
> >  MigrationIncomingState *migration_incoming_get_current(void);
> > @@ -169,6 +177,15 @@ int64_t migrate_xbzrle_cache_size(void);
> >  
> >  int64_t xbzrle_cache_resize(int64_t new_size);
> >  
> > +/* Sending on the return path - generic and then for each message type */
> > +void migrate_send_rp_message(MigrationIncomingState *mis,
> > +                             enum mig_rpcomm_cmd cmd,
> > +                             uint16_t len, uint8_t *data);
> > +void migrate_send_rp_shut(MigrationIncomingState *mis,
> > +                          uint32_t value);
> > +void migrate_send_rp_pong(MigrationIncomingState *mis,
> > +                          uint32_t value);
> > +
> >  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
> >  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
> >  void ram_control_load_hook(QEMUFile *f, uint64_t flags);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index a36ea65..80d234c 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
> >  {
> >      mis_current = g_malloc0(sizeof(MigrationIncomingState));
> >      mis_current->file = f;
> > +    qemu_mutex_init(&mis_current->rp_mutex);
> >  
> >      return mis_current;
> >  }
> > @@ -88,6 +89,50 @@ void migration_incoming_state_destroy(void)
> >      mis_current = NULL;
> >  }
> >  
> > +/*
> > + * Send a message on the return channel back to the source
> > + * of the migration.
> > + */
> > +void migrate_send_rp_message(MigrationIncomingState *mis,
> > +                             enum mig_rpcomm_cmd cmd,
> > +                             uint16_t len, uint8_t *data)
> 
> Using (void *) for data would avoid casts in a bunch of the callers.

Fixed; thanks.

> > +{
> > +    trace_migrate_send_rp_message((int)cmd, len);
> > +    qemu_mutex_lock(&mis->rp_mutex);
> > +    qemu_put_be16(mis->return_path, (unsigned int)cmd);
> > +    qemu_put_be16(mis->return_path, len);
> > +    qemu_put_buffer(mis->return_path, data, len);
> > +    qemu_fflush(mis->return_path);
> > +    qemu_mutex_unlock(&mis->rp_mutex);
> > +}
> > +
> > +/*
> > + * Send a 'SHUT' message on the return channel with the given value
> > + * to indicate that we've finished with the RP.  None-0 value indicates
> > + * error.
> > + */
> > +void migrate_send_rp_shut(MigrationIncomingState *mis,
> > +                          uint32_t value)
> > +{
> > +    uint32_t buf;
> > +
> > +    buf = cpu_to_be32(value);
> > +    migrate_send_rp_message(mis, MIG_RP_CMD_SHUT, 4, (uint8_t *)&buf);
> 
>                                                      ^ sizeof(buf)
> 						     would be safer

Done.

> > +}
> > +
> > +/*
> > + * Send a 'PONG' message on the return channel with the given value
> > + * (normally in response to a 'PING')
> > + */
> > +void migrate_send_rp_pong(MigrationIncomingState *mis,
> > +                          uint32_t value)
> > +{
> > +    uint32_t buf;
> > +
> > +    buf = cpu_to_be32(value);
> > +    migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
> 
> It occurs to me that you could define PONG as returning the whole
> buffer that PING sends, instead of just 4-bytes.  Might allow for some
> more testing of variable sized messages.

Yes; although it would complicate things a lot if I made it fully generic
because I'd have to worry about allocating a buffer etc and I'm not
making vast use of the 4 bytes I've already got.

Dave

> 
> > +}
> > +
> >  void qemu_start_incoming_migration(const char *uri, Error **errp)
> >  {
> >      const char *p;
> > diff --git a/savevm.c b/savevm.c
> > index d082738..7084d07 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -1008,7 +1008,7 @@ static int loadvm_process_command(QEMUFile *f)
> >                           tmp32);
> >              return -1;
> >          }
> > -        /* migrate_send_rp_pong(mis, tmp32); TODO: gets added later */
> > +        migrate_send_rp_pong(mis, tmp32);
> >          break;
> >  
> >      default:
> > diff --git a/trace-events b/trace-events
> > index 99e00b5..4f3eff8 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1379,6 +1379,7 @@ migrate_fd_cleanup(void) ""
> >  migrate_fd_error(void) ""
> >  migrate_fd_cancel(void) ""
> >  migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
> > +migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> >  migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
> >  
> >  # migration/rdma.c
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name Dr. David Alan Gilbert (git)
@ 2015-03-10 15:30   ` Eric Blake
  2015-03-10 16:21     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Eric Blake @ 2015-03-10 15:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, pbonzini, yanghy, david

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

On 02/25/2015 09:51 AM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> check the return value of the function it calls and error if it's non-0
> Fixup qemu_rdma_init_one_block that is the only current caller,
>   and __qemu_rdma_add_block the only function it calls using it.

Should we also be changing the name of __qemu_rdma_add_block to get rid
of the wrong use of the reserved leading __ namespace?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name
  2015-03-10 15:30   ` Eric Blake
@ 2015-03-10 16:21     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-10 16:21 UTC (permalink / raw)
  To: Eric Blake
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini,
	yanghy, david

* Eric Blake (eblake@redhat.com) wrote:
> On 02/25/2015 09:51 AM, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > check the return value of the function it calls and error if it's non-0
> > Fixup qemu_rdma_init_one_block that is the only current caller,
> >   and __qemu_rdma_add_block the only function it calls using it.
> 
> Should we also be changing the name of __qemu_rdma_add_block to get rid
> of the wrong use of the reserved leading __ namespace?

I already did that in ba795761857082a9bb8c4be1f28e7cdd82f039b0 that
got merged after I cut this patch.

Dave

> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-03-10 13:35     ` Dr. David Alan Gilbert
@ 2015-03-11  1:51       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-11  1:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]

On Tue, Mar 10, 2015 at 01:35:58PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:31PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > The return path uses a non-blocking fd so as not to block waiting
> > > for the (possibly broken) destination to finish returning a message,
> > > however we still want outbound data to behave in the same way and block.
> > 
> > It's not clear to me from this description exactly where the situation
> > is that you need to write to the non-blocking socket.  Is it on the
> > source or the destination?  If the source, why are you writing to the
> > return path?  If the destination, why are you marking the outgoing
> > return path as non-blocking?
> 
> My understanding is that the semantics of set_nonblock() are to
> set non-blocking on all operations on the transport associated with
> the fd - and that it's true even if you dup() the fd; and so if you
> set non-blocking in one direction you get it in the other direction as well.

Ah.. yes, I think you're right.

> The (existing) destination side sets non-block (see process_incoming_migration
> in migration.c), and so it gets non-blocking on the incoming data stream, but
> that has the side effect that it's also going to be non-blocking on
> the destinations writes to the reverse-path; thus we need to be able
> to safely do writes from the destination reverse-path.
> 
> The text is out of date, back in v2 the source used to use non-blocking
> for the return-path, but we managed to kill that off by using a thread
> for the return path in the source.
> 
> How about changing the text to:
> --------
> The destination sets the fd to non-blocking on incoming migrations;
> this also affects the return path from the destination, and thus we
> need to make sure we can safely write to the return path.

Yes, I think that makes it clearer.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source
  2015-03-10 14:34     ` Dr. David Alan Gilbert
@ 2015-03-11  1:54       ` David Gibson
  2015-03-25 18:47         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-11  1:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 6423 bytes --]

On Tue, Mar 10, 2015 at 02:34:03PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:34PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Add migrate_send_rp_message to send a message from destination to source along the return path.
> > >   (It uses a mutex to let it be called from multiple threads)
> > > Add migrate_send_rp_shut to send a 'shut' message to indicate
> > >   the destination is finished with the RP.
> > > Add migrate_send_rp_ack to send a 'PONG' message in response to a PING
> > >   Use it in the CMD_PING handler
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  include/migration/migration.h | 17 ++++++++++++++++
> > >  migration/migration.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
> > >  savevm.c                      |  2 +-
> > >  trace-events                  |  1 +
> > >  4 files changed, 64 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index c514dd4..6775747 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -41,6 +41,13 @@ struct MigrationParams {
> > >      bool shared;
> > >  };
> > >  
> > > +/* Commands sent on the return path from destination to source*/
> > > +enum mig_rpcomm_cmd {
> > 
> > "command" doesn't seem like quite the right description for these rp
> > messages.
> 
> Would you prefer 'message' ?

Perhaps "message type" to distinguish from the the blob including both
tag and data.

> > > +    MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
> > > +    MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
> > > +    MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> > > +};
> > > +
> > >  typedef struct MigrationState MigrationState;
> > >  
> > >  /* State for the incoming migration */
> > > @@ -48,6 +55,7 @@ struct MigrationIncomingState {
> > >      QEMUFile *file;
> > >  
> > >      QEMUFile *return_path;
> > > +    QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> > >  };
> > >  
> > >  MigrationIncomingState *migration_incoming_get_current(void);
> > > @@ -169,6 +177,15 @@ int64_t migrate_xbzrle_cache_size(void);
> > >  
> > >  int64_t xbzrle_cache_resize(int64_t new_size);
> > >  
> > > +/* Sending on the return path - generic and then for each message type */
> > > +void migrate_send_rp_message(MigrationIncomingState *mis,
> > > +                             enum mig_rpcomm_cmd cmd,
> > > +                             uint16_t len, uint8_t *data);
> > > +void migrate_send_rp_shut(MigrationIncomingState *mis,
> > > +                          uint32_t value);
> > > +void migrate_send_rp_pong(MigrationIncomingState *mis,
> > > +                          uint32_t value);
> > > +
> > >  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
> > >  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
> > >  void ram_control_load_hook(QEMUFile *f, uint64_t flags);
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index a36ea65..80d234c 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -78,6 +78,7 @@ MigrationIncomingState *migration_incoming_state_new(QEMUFile* f)
> > >  {
> > >      mis_current = g_malloc0(sizeof(MigrationIncomingState));
> > >      mis_current->file = f;
> > > +    qemu_mutex_init(&mis_current->rp_mutex);
> > >  
> > >      return mis_current;
> > >  }
> > > @@ -88,6 +89,50 @@ void migration_incoming_state_destroy(void)
> > >      mis_current = NULL;
> > >  }
> > >  
> > > +/*
> > > + * Send a message on the return channel back to the source
> > > + * of the migration.
> > > + */
> > > +void migrate_send_rp_message(MigrationIncomingState *mis,
> > > +                             enum mig_rpcomm_cmd cmd,
> > > +                             uint16_t len, uint8_t *data)
> > 
> > Using (void *) for data would avoid casts in a bunch of the callers.
> 
> Fixed; thanks.
> 
> > > +{
> > > +    trace_migrate_send_rp_message((int)cmd, len);
> > > +    qemu_mutex_lock(&mis->rp_mutex);
> > > +    qemu_put_be16(mis->return_path, (unsigned int)cmd);
> > > +    qemu_put_be16(mis->return_path, len);
> > > +    qemu_put_buffer(mis->return_path, data, len);
> > > +    qemu_fflush(mis->return_path);
> > > +    qemu_mutex_unlock(&mis->rp_mutex);
> > > +}
> > > +
> > > +/*
> > > + * Send a 'SHUT' message on the return channel with the given value
> > > + * to indicate that we've finished with the RP.  None-0 value indicates
> > > + * error.
> > > + */
> > > +void migrate_send_rp_shut(MigrationIncomingState *mis,
> > > +                          uint32_t value)
> > > +{
> > > +    uint32_t buf;
> > > +
> > > +    buf = cpu_to_be32(value);
> > > +    migrate_send_rp_message(mis, MIG_RP_CMD_SHUT, 4, (uint8_t *)&buf);
> > 
> >                                                      ^ sizeof(buf)
> > 						     would be safer
> 
> Done.
> 
> > > +}
> > > +
> > > +/*
> > > + * Send a 'PONG' message on the return channel with the given value
> > > + * (normally in response to a 'PING')
> > > + */
> > > +void migrate_send_rp_pong(MigrationIncomingState *mis,
> > > +                          uint32_t value)
> > > +{
> > > +    uint32_t buf;
> > > +
> > > +    buf = cpu_to_be32(value);
> > > +    migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
> > 
> > It occurs to me that you could define PONG as returning the whole
> > buffer that PING sends, instead of just 4-bytes.  Might allow for some
> > more testing of variable sized messages.
> 
> Yes; although it would complicate things a lot if I made it fully generic
> because I'd have to worry about allocating a buffer etc and I'm not
> making vast use of the 4 bytes I've already got.

Couldn't migrate_send_rp_pong just take a buf pointer and length, then
you can point that directly at the buffer in the ping message you've
received.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops Dr. David Alan Gilbert (git)
@ 2015-03-12  6:11   ` David Gibson
  2015-04-14 12:04     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-12  6:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 7135 bytes --]

On Wed, Feb 25, 2015 at 04:51:38PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Postcopy needs to have two migration streams loading concurrently;
> one from memory (with the device state) and the other from the fd
> with the memory transactions.
> 
> Split the core of qemu_loadvm_state out so we can use it for both.
> 
> Allow the inner loadvm loop to quit and signal whether the parent
> should.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  savevm.c     | 106 ++++++++++++++++++++++++++++++++++++-----------------------
>  trace-events |   4 +++
>  2 files changed, 69 insertions(+), 41 deletions(-)
> 
> diff --git a/savevm.c b/savevm.c
> index f42713d..4b619da 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -951,6 +951,16 @@ static SaveStateEntry *find_se(const char *idstr, int instance_id)
>      return NULL;
>  }
>  
> +/* ORable flags that control the (potentially nested) loadvm_state loops */
> +enum LoadVMExitCodes {
> +    /* Quit the loop level that received this command */
> +    LOADVM_QUIT_LOOP     =  1,
> +    /* Quit this loop and our parent */
> +    LOADVM_QUIT_PARENT   =  2,
> +};

The semantics of all the exit code stuff is doing my head in; I'm not
sure how to make it more comprehensible.

> +static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> +
>  static int loadvm_process_command_simple_lencheck(const char *name,
>                                                    unsigned int actual,
>                                                    unsigned int expected)
> @@ -967,6 +977,8 @@ static int loadvm_process_command_simple_lencheck(const char *name,
>  /*
>   * Process an incoming 'QEMU_VM_COMMAND'
>   * negative return on error (will issue error message)
> + * 0   just a normal return
> + * 1   All good, but exit the loop

This should probably also mention the possibility of negative returns
for errors.

Am I correct in thinking that at this point the function never returns
1?  I'm assuming later patches in the series change that.

Maybe I'm missing something in my mental model here, but tying the
duration of the containing loop to execution of specific commands
seems problematic.  What's the circumstance in which it makes sense
for a command to indicate that the rest of the packaged data should be
essentially ignored

>   */
>  static int loadvm_process_command(QEMUFile *f)
>  {
> @@ -1036,36 +1048,13 @@ void loadvm_free_handlers(MigrationIncomingState *mis)
>      }
>  }
>  
> -int qemu_loadvm_state(QEMUFile *f)
> +static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
>  {
> -    MigrationIncomingState *mis = migration_incoming_get_current();
> -    Error *local_err = NULL;
>      uint8_t section_type;
> -    unsigned int v;
>      int ret;
> +    int exitcode = 0;
>  
> -    if (qemu_savevm_state_blocked(&local_err)) {
> -        error_report("%s", error_get_pretty(local_err));
> -        error_free(local_err);
> -        return -EINVAL;
> -    }
> -
> -    v = qemu_get_be32(f);
> -    if (v != QEMU_VM_FILE_MAGIC) {
> -        error_report("Not a migration stream");
> -        return -EINVAL;
> -    }
> -
> -    v = qemu_get_be32(f);
> -    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
> -        error_report("SaveVM v2 format is obsolete and don't work anymore");
> -        return -ENOTSUP;
> -    }
> -    if (v != QEMU_VM_FILE_VERSION) {
> -        error_report("Unsupported migration stream version");
> -        return -ENOTSUP;
> -    }
> -
> +    trace_qemu_loadvm_state_main();
>      while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
>          uint32_t instance_id, version_id, section_id;
>          SaveStateEntry *se;
> @@ -1093,16 +1082,14 @@ int qemu_loadvm_state(QEMUFile *f)
>              if (se == NULL) {
>                  error_report("Unknown savevm section or instance '%s' %d",
>                               idstr, instance_id);
> -                ret = -EINVAL;
> -                goto out;
> +                return -EINVAL;
>              }
>  
>              /* Validate version */
>              if (version_id > se->version_id) {
>                  error_report("savevm: unsupported version %d for '%s' v%d",
>                               version_id, idstr, se->version_id);
> -                ret = -EINVAL;
> -                goto out;
> +                return -EINVAL;
>              }
>  
>              /* Add entry */
> @@ -1117,7 +1104,7 @@ int qemu_loadvm_state(QEMUFile *f)
>              if (ret < 0) {
>                  error_report("error while loading state for instance 0x%x of"
>                               " device '%s'", instance_id, idstr);
> -                goto out;
> +                return ret;
>              }
>              break;
>          case QEMU_VM_SECTION_PART:
> @@ -1132,36 +1119,73 @@ int qemu_loadvm_state(QEMUFile *f)
>              }
>              if (le == NULL) {
>                  error_report("Unknown savevm section %d", section_id);
> -                ret = -EINVAL;
> -                goto out;
> +                return -EINVAL;
>              }
>  
>              ret = vmstate_load(f, le->se, le->version_id);
>              if (ret < 0) {
>                  error_report("error while loading state section id %d(%s)",
>                               section_id, le->se->idstr);
> -                goto out;
> +                return ret;
>              }
>              break;
>          case QEMU_VM_COMMAND:
>              ret = loadvm_process_command(f);
> -            if (ret < 0) {
> -                goto out;
> +            trace_qemu_loadvm_state_section_command(ret);
> +            if ((ret < 0) || (ret & LOADVM_QUIT_LOOP)) {
> +                return ret;
>              }
> +            exitcode |= ret; /* Lets us pass flags up to the parent */
>              break;
>          default:
>              error_report("Unknown savevm section type %d", section_type);
> -            ret = -EINVAL;
> -            goto out;
> +            return -EINVAL;
>          }
>      }
>  
> -    cpu_synchronize_all_post_init();
> +    if (exitcode & LOADVM_QUIT_PARENT) {
> +        trace_qemu_loadvm_state_main_quit_parent();
> +        exitcode &= ~LOADVM_QUIT_PARENT;
> +        exitcode |= LOADVM_QUIT_LOOP;
> +    }

So, if I'm following properly putting a QUIT_PARENT will cause this
loop to exit, also returning QUIT_LOOP, so the next loop out also
quits.  If there was a third lood beyond that it wouldn't quit.

But are those really the semantics you want; or do you want the
options to be "quit one level" and "quit all levels", which seems a
little bit simpler.  In the current plans you only have the two levels
so they're equivalent.
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram.
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram Dr. David Alan Gilbert (git)
@ 2015-03-12  6:14   ` David Gibson
  2015-03-13 12:58     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-12  6:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2831 bytes --]

On Wed, Feb 25, 2015 at 04:51:39PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

This absolutely needs a commit message.  I shouldn't have to look at
the code to find out what the presence of this capability asserts, and
from where to where it's communicating that information.

> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  include/migration/migration.h | 1 +
>  migration/migration.c         | 9 +++++++++
>  qapi-schema.json              | 7 ++++++-
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 751caa0..f94af5b 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -177,6 +177,7 @@ void migrate_add_blocker(Error *reason);
>   */
>  void migrate_del_blocker(Error *reason);
>  
> +bool migrate_postcopy_ram(void);
>  bool migrate_rdma_pin_all(void);
>  bool migrate_zero_blocks(void);
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 4592060..434864a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -663,6 +663,15 @@ bool migrate_rdma_pin_all(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
>  }
>  
> +bool migrate_postcopy_ram(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_X_POSTCOPY_RAM];

As an asside, I'm assuming you'll get rid of these "x-" prefixes
before you post a series intended for final inclusion?

> +}
> +
>  bool migrate_auto_converge(void)
>  {
>      MigrationState *s;
> diff --git a/qapi-schema.json b/qapi-schema.json
> index e16f8eb..a8af1cb 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -494,10 +494,15 @@
>  # @auto-converge: If enabled, QEMU will automatically throttle down the guest
>  #          to speed up convergence of RAM migration. (since 1.6)
>  #
> +# @x-postcopy-ram: Start executing on the migration target before all of RAM has
> +#          been migrated, pulling the remaining pages along as needed. NOTE: If
> +#          the migration fails during postcopy the VM will fail.  (since 2.3)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks'] }
> +  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> +           'x-postcopy-ram'] }
>  
>  ##
>  # @MigrationCapabilityStatus

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages Dr. David Alan Gilbert (git)
@ 2015-03-12  9:30   ` David Gibson
  2015-03-26 16:33     ` Dr. David Alan Gilbert
  2015-03-28 15:43   ` Paolo Bonzini
  1 sibling, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-12  9:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 22598 bytes --]

On Wed, Feb 25, 2015 at 04:51:40PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The state of the postcopy process is managed via a series of messages;
>    * Add wrappers and handlers for sending/receiving these messages
>    * Add state variable that track the current state of postcopy
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |  15 ++
>  include/sysemu/sysemu.h       |  23 +++
>  migration/migration.c         |  13 ++
>  savevm.c                      | 325 ++++++++++++++++++++++++++++++++++++++++++
>  trace-events                  |  11 ++
>  5 files changed, 387 insertions(+)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index f94af5b..81cd1f2 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -52,6 +52,14 @@ typedef struct MigrationState MigrationState;
>  
>  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
>  
> +typedef enum {
> +    POSTCOPY_INCOMING_NONE = 0,  /* Initial state - no postcopy */
> +    POSTCOPY_INCOMING_ADVISE,
> +    POSTCOPY_INCOMING_LISTENING,
> +    POSTCOPY_INCOMING_RUNNING,
> +    POSTCOPY_INCOMING_END
> +} PostcopyState;
> +
>  /* State for the incoming migration */
>  struct MigrationIncomingState {
>      QEMUFile *file;
> @@ -59,6 +67,8 @@ struct MigrationIncomingState {
>      /* See savevm.c */
>      LoadStateEntry_Head loadvm_handlers;
>  
> +    PostcopyState postcopy_state;
> +
>      QEMUFile *return_path;
>      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
>  };
> @@ -219,4 +229,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>                               ram_addr_t offset, size_t size,
>                               int *bytes_sent);
>  
> +PostcopyState postcopy_state_get(MigrationIncomingState *mis);
> +
> +/* Set the state and return the old state */
> +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> +                                 PostcopyState new_state);
>  #endif
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 8da879f..d6a6d51 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -87,6 +87,18 @@ enum qemu_vm_cmd {
>      MIG_CMD_INVALID = 0,       /* Must be 0 */
>      MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
>      MIG_CMD_PING,              /* Request a PONG on the RP */
> +
> +    MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
> +                                      warn we might want to do PC */
> +    MIG_CMD_POSTCOPY_LISTEN,       /* Start listening for incoming
> +                                      pages as it's running. */
> +    MIG_CMD_POSTCOPY_RUN,          /* Start execution */
> +    MIG_CMD_POSTCOPY_END,          /* Postcopy is finished. */
> +
> +    MIG_CMD_POSTCOPY_RAM_DISCARD,  /* A list of pages to discard that
> +                                      were previously sent during
> +                                      precopy but are dirty. */
> +
>  };
>  
>  bool qemu_savevm_state_blocked(Error **errp);
> @@ -101,6 +113,17 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
>                                uint16_t len, uint8_t *data);
>  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
>  void qemu_savevm_send_open_return_path(QEMUFile *f);
> +void qemu_savevm_send_postcopy_advise(QEMUFile *f);
> +void qemu_savevm_send_postcopy_listen(QEMUFile *f);
> +void qemu_savevm_send_postcopy_run(QEMUFile *f);
> +void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status);
> +
> +void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
> +                                           uint16_t len, uint8_t offset,
> +                                           uint64_t *addrlist,
> +                                           uint32_t *masklist);
> +
> +
>  int qemu_loadvm_state(QEMUFile *f);
>  
>  /* SLIRP */
> diff --git a/migration/migration.c b/migration/migration.c
> index 434864a..957115a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -971,3 +971,16 @@ void migrate_fd_connect(MigrationState *s)
>      qemu_thread_create(&s->thread, "migration", migration_thread, s,
>                         QEMU_THREAD_JOINABLE);
>  }
> +
> +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> +{
> +    return atomic_fetch_add(&mis->postcopy_state, 0);
> +}
> +
> +/* Set the state and return the old state */
> +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> +                                 PostcopyState new_state)
> +{
> +    return atomic_xchg(&mis->postcopy_state, new_state);

Is there anything explaining what the overall atomicity requirements
are for this state variable?  It's a bit hard to tell if an atomic
xchg is necessary or sufficient without a description of what the
overall concurrency scheme is with regards to this variable.

> +}
> +
> diff --git a/savevm.c b/savevm.c
> index 4b619da..e31ccb0 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -39,6 +39,7 @@
>  #include "exec/memory.h"
>  #include "qmp-commands.h"
>  #include "trace.h"
> +#include "qemu/bitops.h"
>  #include "qemu/iov.h"
>  #include "block/snapshot.h"
>  #include "block/qapi.h"
> @@ -635,6 +636,90 @@ void qemu_savevm_send_open_return_path(QEMUFile *f)
>      qemu_savevm_command_send(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
>  }
>  
> +/* Send prior to any postcopy transfer */
> +void qemu_savevm_send_postcopy_advise(QEMUFile *f)
> +{
> +    uint64_t tmp[2];
> +    tmp[0] = cpu_to_be64(getpagesize());
> +    tmp[1] = cpu_to_be64(1ul << qemu_target_page_bits());

I wonder if using a structure for the tmp buffer might be an odea, as
a form of documentation of the data expected in the ADVISE command.

> +    trace_qemu_savevm_send_postcopy_advise();
> +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_ADVISE, 16, (uint8_t *)tmp);
> +}
> +
> +/* Prior to running, to cause pages that have been dirtied after precopy
> + * started to be discarded on the destination.
> + * CMD_POSTCOPY_RAM_DISCARD consist of:
> + *  3 byte header (filled in by qemu_savevm_send_postcopy_ram_discard)
> + *      byte   version (0)
> + *      byte   offset to be subtracted from each page address to deal with
> + *             RAMBlocks that don't start on a mask word boundary.

I think this needs more explanation.  Why can't this be folded into
the page addresses sent over the wire?

> + *      byte   Length of name field
> + *  n x byte   RAM block name (NOT 0 terminated)

I think \0 terminating would probably be safer, even if it's
technically redundant.

> + *  n x
> + *      be64   Page addresses for start of an invalidation range
> + *      be32   mask of 32 pages, '1' to discard'

Is the extra compactness from this semi-sparse bitmap encoding
actually worth it?  A simple list of page addresses, or address ranges
to discard would be substantially simpler to get one's head around,
and also seems like it might be more robust against future
implementation changes as a wire format.

> + *  Hopefully this is pretty sparse so we don't get too many entries,
> + *  and using the mask should deal with most pagesize differences
> + *  just ending up as a single full mask
> + *
> + * The mask is always 32bits irrespective of the long size
> + *
> + *  name:  RAMBlock name that these entries are part of
> + *  len: Number of page entries
> + *  addrlist: 'len' addresses
> + *  masklist: 'len' masks (corresponding to the addresses)
> + */
> +void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
> +                                           uint16_t len, uint8_t offset,
> +                                           uint64_t *addrlist,
> +                                           uint32_t *masklist)
> +{
> +    uint8_t *buf;
> +    uint16_t tmplen;
> +    uint16_t t;
> +
> +    trace_qemu_savevm_send_postcopy_ram_discard();
> +    buf = g_malloc0(len*12 + strlen(name) + 3);
> +    buf[0] = 0; /* Version */
> +    buf[1] = offset;
> +    assert(strlen(name) < 256);
> +    buf[2] = strlen(name);
> +    memcpy(buf+3, name, strlen(name));
> +    tmplen = 3+strlen(name);

Repeated calls to strlen() always seem icky to me, although I guess
it's all gcc builtins here, so they are probably optimized out by
CSE.

> +    for (t = 0; t < len; t++) {
> +        cpu_to_be64w((uint64_t *)(buf + tmplen), addrlist[t]);
> +        tmplen += 8;
> +        cpu_to_be32w((uint32_t *)(buf + tmplen), masklist[t]);
> +        tmplen += 4;
> +    }
> +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_RAM_DISCARD, tmplen, buf);
> +    g_free(buf);
> +}
> +
> +/* Get the destination into a state where it can receive postcopy data. */
> +void qemu_savevm_send_postcopy_listen(QEMUFile *f)
> +{
> +    trace_savevm_send_postcopy_listen();
> +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_LISTEN, 0, NULL);
> +}
> +
> +/* Kick the destination into running */
> +void qemu_savevm_send_postcopy_run(QEMUFile *f)
> +{
> +    trace_savevm_send_postcopy_run();
> +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_RUN, 0, NULL);
> +}

DISCARD will typically immediately precede LISTEN, won't it?  Is there
a reason not to put the discard data into the LISTEN command?

> +
> +/* End of postcopy - with a status byte; 0 is good, anything else is a fail */
> +void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status)
> +{
> +    trace_savevm_send_postcopy_end();
> +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_END, 1, &status);
> +}

What's the distinction between the postcopy END command and the normal
end of the migration stream?  Is there already a way to detect the end
of stream normally?

>  bool qemu_savevm_state_blocked(Error **errp)
>  {
>      SaveStateEntry *se;
> @@ -961,6 +1046,212 @@ enum LoadVMExitCodes {
>  
>  static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>  
> +/* ------ incoming postcopy messages ------ */
> +/* 'advise' arrives before any transfers just to tell us that a postcopy
> + * *might* happen - it might be skipped if precopy transferred everything
> + * quickly.
> + */
> +static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
> +                                         uint64_t remote_hps,
> +                                         uint64_t remote_tps)
> +{
> +    PostcopyState ps = postcopy_state_get(mis);
> +    trace_loadvm_postcopy_handle_advise();
> +    if (ps != POSTCOPY_INCOMING_NONE) {
> +        error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state (%d)", ps);
> +        return -1;
> +    }
> +
> +    if (remote_hps != getpagesize())  {
> +        /*
> +         * Some combinations of mismatch are probably possible but it gets
> +         * a bit more complicated.  In particular we need to place whole
> +         * host pages on the dest at once, and we need to ensure that we
> +         * handle dirtying to make sure we never end up sending part of
> +         * a hostpage on it's own.
> +         */
> +        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
> +                     (int)remote_hps, getpagesize());
> +        return -1;
> +    }
> +
> +    if (remote_tps != (1ul << qemu_target_page_bits())) {
> +        /*
> +         * Again, some differences could be dealt with, but for now keep it
> +         * simple.
> +         */
> +        error_report("Postcopy needs matching target page sizes (s=%d d=%d)",
> +                     (int)remote_tps, 1 << qemu_target_page_bits());
> +        return -1;
> +    }
> +
> +    postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);

Should you be checking the return value here to make sure it's still
POSTCOPY_INCOMING_NONE?  Atomic xchgs seem overkill if you still have
a race between the fetch at the top and the set here.

Or, in fact, should you be just doing an atomic exchange-then-check at
the top, rather than checking at the top, then changing at the bottom.

> +    return 0;
> +}
> +
> +/* After postcopy we will be told to throw some pages away since they're
> + * dirty and will have to be demand fetched.  Must happen before CPU is
> + * started.
> + * There can be 0..many of these messages, each encoding multiple pages.
> + * Bits set in the message represent a page in the source VMs bitmap, but
> + * since the guest/target page sizes can be different on s/d then we have
> + * to convert.

Uh.. I thought the checks in the ADVISE processing eliminated that possibility.

> + */
> +static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
> +                                              uint16_t len)
> +{
> +    int tmp;
> +    unsigned int first_bit_offset;
> +    char ramid[256];
> +    PostcopyState ps = postcopy_state_get(mis);
> +
> +    trace_loadvm_postcopy_ram_handle_discard();
> +
> +    if (ps != POSTCOPY_INCOMING_ADVISE) {

Could you theoretically also get these in LISTEN state?  I realise the
current implementation doesn't do that.

> +        error_report("CMD_POSTCOPY_RAM_DISCARD in wrong postcopy state (%d)",
> +                     ps);
> +        return -1;
> +    }
> +    /* We're expecting a
> +     *    3 byte header,
> +     *    a RAM ID string
> +     *    then at least 1 12 byte chunks
> +    */
> +    if (len < 16) {
> +        error_report("CMD_POSTCOPY_RAM_DISCARD invalid length (%d)", len);
> +        return -1;
> +    }
> +
> +    tmp = qemu_get_byte(mis->file);
> +    if (tmp != 0) {
> +        error_report("CMD_POSTCOPY_RAM_DISCARD invalid version (%d)", tmp);
> +        return -1;
> +    }
> +    first_bit_offset = qemu_get_byte(mis->file);
> +
> +    if (qemu_get_counted_string(mis->file, ramid)) {
> +        error_report("CMD_POSTCOPY_RAM_DISCARD Failed to read RAMBlock ID");
> +        return -1;
> +    }
> +
> +    len -= 3+strlen(ramid);
> +    if (len % 12) {
> +        error_report("CMD_POSTCOPY_RAM_DISCARD invalid length (%d)", len);
> +        return -1;
> +    }
> +    while (len) {
> +        uint64_t startaddr;
> +        uint32_t mask;
> +        /*
> +         * We now have pairs of address, mask
> +         *   Each word of mask is 32 bits, where each bit corresponds to one
> +         *   target page.
> +         *   RAMBlocks don't necessarily start on word boundaries,
> +         *   and the offset in the header indicates the offset into the 1st
> +         *   mask word that corresponds to the 1st page of the RAMBlock.
> +         */
> +        startaddr = qemu_get_be64(mis->file);
> +        mask = qemu_get_be32(mis->file);
> +
> +        len -= 12;
> +
> +        while (mask) {
> +            /* mask= .....?10...0 */
> +            /*             ^fs    */
> +            int firstset = ctz32(mask);
> +
> +            /* tmp32=.....?11...1 */
> +            /*             ^fs    */
> +            uint32_t tmp32 = mask | ((((uint32_t)1)<<firstset)-1);
> +
> +            /* mask= .?01..10...0 */
> +            /*         ^fz ^fs    */
> +            int firstzero = cto32(tmp32);
> +
> +            if ((startaddr == 0) && (firstset < first_bit_offset)) {
> +                error_report("CMD_POSTCOPY_RAM_DISCARD bad data; bit set"
> +                               " prior to block; block=%s offset=%d"
> +                               " firstset=%d\n", ramid, first_bit_offset,
> +                               firstzero);
> +                return -1;
> +            }
> +
> +            /*
> +             * we know there must be at least 1 bit set due to the loop entry
> +             * If there is no 0 firstzero will be 32
> +             */
> +            /* TODO - ram_discard_range gets added in a later patch
> +            int ret = ram_discard_range(mis, ramid,
> +                                startaddr + firstset - first_bit_offset,
> +                                startaddr + (firstzero - 1) - first_bit_offset);
> +            ret = -1;
> +            if (ret) {
> +                return ret;
> +            }
> +            */
> +
> +            /* mask= .?0000000000 */
> +            /*         ^fz ^fs    */
> +            if (firstzero != 32) {
> +                mask &= (((uint32_t)-1) << firstzero);
> +            } else {
> +                mask = 0;
> +            }
> +        }

Ugh.  Again I ask, are you really sure the semi-sparse wire
representation is worth all this hassle?

> +    }
> +    trace_loadvm_postcopy_ram_handle_discard_end();
> +
> +    return 0;
> +}
> +
> +/* After this message we must be able to immediately receive postcopy data */

This doesn't quite make sense to me.  AFAICT, this is executed on the
destination side, so this isn't so much a command to start listening
as an assertion that the source side is already listening and ready to
receive page requests on the return path.

> +static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> +{
> +    PostcopyState ps = postcopy_state_set(mis, POSTCOPY_INCOMING_LISTENING);
> +    trace_loadvm_postcopy_handle_listen();
> +    if (ps != POSTCOPY_INCOMING_ADVISE) {
> +        error_report("CMD_POSTCOPY_LISTEN in wrong postcopy state (%d)", ps);
> +        return -1;
> +    }
> +
> +    /* TODO start up the postcopy listening thread */
> +    return 0;
> +}
> +
> +/* After all discards we can start running and asking for pages */
> +static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
> +{
> +    PostcopyState ps = postcopy_state_set(mis, POSTCOPY_INCOMING_RUNNING);
> +    trace_loadvm_postcopy_handle_run();
> +    if (ps != POSTCOPY_INCOMING_LISTENING) {
> +        error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps);
> +        return -1;
> +    }
> +
> +    if (autostart) {
> +        /* Hold onto your hats, starting the CPU */
> +        vm_start();
> +    } else {
> +        /* leave it paused and let management decide when to start the CPU */
> +        runstate_set(RUN_STATE_PAUSED);
> +    }
> +
> +    return 0;
> +}
> +
> +/* The end - with a byte from the source which can tell us to fail. */
> +static int loadvm_postcopy_handle_end(MigrationIncomingState *mis)
> +{
> +    PostcopyState ps = postcopy_state_get(mis);
> +    trace_loadvm_postcopy_handle_end();
> +    if (ps == POSTCOPY_INCOMING_NONE) {
> +        error_report("CMD_POSTCOPY_END in wrong postcopy state (%d)", ps);
> +        return -1;
> +    }
> +    return -1; /* TODO - expecting 1 byte good/fail */
> +}
> +
>  static int loadvm_process_command_simple_lencheck(const char *name,
>                                                    unsigned int actual,
>                                                    unsigned int expected)
> @@ -986,6 +1277,7 @@ static int loadvm_process_command(QEMUFile *f)
>      uint16_t com;
>      uint16_t len;
>      uint32_t tmp32;
> +    uint64_t tmp64a, tmp64b;
>  
>      com = qemu_get_be16(f);
>      len = qemu_get_be16(f);
> @@ -1023,6 +1315,39 @@ static int loadvm_process_command(QEMUFile *f)
>          migrate_send_rp_pong(mis, tmp32);
>          break;
>  
> +    case MIG_CMD_POSTCOPY_ADVISE:
> +        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_ADVISE",
> +                                                   len, 16)) {
> +            return -1;
> +        }
> +        tmp64a = qemu_get_be64(f); /* hps */
> +        tmp64b = qemu_get_be64(f); /* tps */
> +        return loadvm_postcopy_handle_advise(mis, tmp64a, tmp64b);
> +
> +    case MIG_CMD_POSTCOPY_LISTEN:
> +        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_LISTEN",
> +                                                   len, 0)) {
> +            return -1;
> +        }
> +        return loadvm_postcopy_handle_listen(mis);
> +
> +    case MIG_CMD_POSTCOPY_RUN:
> +        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_RUN",
> +                                                   len, 0)) {
> +            return -1;
> +        }
> +        return loadvm_postcopy_handle_run(mis);
> +
> +    case MIG_CMD_POSTCOPY_END:
> +        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_END",
> +                                                   len, 1)) {
> +            return -1;
> +        }
> +        return loadvm_postcopy_handle_end(mis);
> +
> +    case MIG_CMD_POSTCOPY_RAM_DISCARD:
> +        return loadvm_postcopy_ram_handle_discard(mis, len);
> +
>      default:
>          error_report("VM_COMMAND 0x%x unknown (len 0x%x)", com, len);
>          return -1;
> diff --git a/trace-events b/trace-events
> index 4ff55fe..050f553 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1171,11 +1171,22 @@ qemu_loadvm_state_main(void) ""
>  qemu_loadvm_state_main_quit_parent(void) ""
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> +loadvm_postcopy_handle_advise(void) ""
> +loadvm_postcopy_handle_end(void) ""
> +loadvm_postcopy_handle_listen(void) ""
> +loadvm_postcopy_handle_run(void) ""
> +loadvm_postcopy_ram_handle_discard(void) ""
> +loadvm_postcopy_ram_handle_discard_end(void) ""
>  loadvm_process_command(uint16_t com, uint16_t len) "com=0x%x len=%d"
>  loadvm_process_command_ping(uint32_t val) "%x"
> +qemu_savevm_send_postcopy_advise(void) ""
> +qemu_savevm_send_postcopy_ram_discard(void) ""
>  savevm_section_start(const char *id, unsigned int section_id) "%s, section_id %u"
>  savevm_section_end(const char *id, unsigned int section_id, int ret) "%s, section_id %u -> %d"
>  savevm_send_ping(uint32_t val) "%x"
> +savevm_send_postcopy_end(void) ""
> +savevm_send_postcopy_listen(void) ""
> +savevm_send_postcopy_run(void) ""
>  savevm_state_begin(void) ""
>  savevm_state_header(void) ""
>  savevm_state_iterate(void) ""

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream Dr. David Alan Gilbert (git)
@ 2015-03-13  0:55   ` David Gibson
  2015-03-13 11:51     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-13  0:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 7421 bytes --]

On Wed, Feb 25, 2015 at 04:51:41PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> MIG_CMD_PACKAGED is a migration command that allows a chunk
> of migration stream to be sent in one go, and be received by
> a separate instance of the loadvm loop while not interacting
> with the migration stream.

Hrm.  I'd be more comfortable if the semantics of CMD_PACKAGED were
defined in terms of visible effects on the other end, rather than in
terms of how it's implemented internally.

> This is used by postcopy to load device state (from the package)
> while loading memory pages from the main stream.

Which makes the above paragraph a bit misleading - the whole point
here is that loading the package data *does* interact with the
migration stream - just that it's the migration stream after the end
of the package.

> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/sysemu/sysemu.h |  4 +++
>  savevm.c                | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
>  trace-events            |  4 +++
>  3 files changed, 90 insertions(+)
> 
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index d6a6d51..e83bf80 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -87,6 +87,7 @@ enum qemu_vm_cmd {
>      MIG_CMD_INVALID = 0,       /* Must be 0 */
>      MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
>      MIG_CMD_PING,              /* Request a PONG on the RP */
> +    MIG_CMD_PACKAGED,          /* Send a wrapped stream within this stream */
>  
>      MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
>                                        warn we might want to do PC */
> @@ -101,6 +102,8 @@ enum qemu_vm_cmd {
>  
>  };
>  
> +#define MAX_VM_CMD_PACKAGED_SIZE (1ul << 24)
> +
>  bool qemu_savevm_state_blocked(Error **errp);
>  void qemu_savevm_state_begin(QEMUFile *f,
>                               const MigrationParams *params);
> @@ -113,6 +116,7 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
>                                uint16_t len, uint8_t *data);
>  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
>  void qemu_savevm_send_open_return_path(QEMUFile *f);
> +void qemu_savevm_send_packaged(QEMUFile *f, const QEMUSizedBuffer *qsb);
>  void qemu_savevm_send_postcopy_advise(QEMUFile *f);
>  void qemu_savevm_send_postcopy_listen(QEMUFile *f);
>  void qemu_savevm_send_postcopy_run(QEMUFile *f);
> diff --git a/savevm.c b/savevm.c
> index e31ccb0..f65bff3 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -636,6 +636,38 @@ void qemu_savevm_send_open_return_path(QEMUFile *f)
>      qemu_savevm_command_send(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
>  }
>  
> +/* We have a buffer of data to send; we don't want that all to be loaded
> + * by the command itself, so the command contains just the length of the
> + * extra buffer that we then send straight after it.
> + * TODO: Must be a better way to organise that
> + */
> +void qemu_savevm_send_packaged(QEMUFile *f, const QEMUSizedBuffer *qsb)
> +{
> +    size_t cur_iov;
> +    size_t len = qsb_get_length(qsb);
> +    uint32_t tmp;
> +
> +    tmp = cpu_to_be32(len);
> +
> +    trace_qemu_savevm_send_packaged();
> +    qemu_savevm_command_send(f, MIG_CMD_PACKAGED, 4, (uint8_t *)&tmp);
> +
> +    /* all the data follows (concatinating the iov's) */
> +    for (cur_iov = 0; cur_iov < qsb->n_iov; cur_iov++) {
> +        /* The iov entries are partially filled */
> +        size_t towrite = (qsb->iov[cur_iov].iov_len > len) ?
> +                              len :
> +                              qsb->iov[cur_iov].iov_len;
> +        len -= towrite;
> +
> +        if (!towrite) {
> +            break;
> +        }
> +
> +        qemu_put_buffer(f, qsb->iov[cur_iov].iov_base, towrite);
> +    }
> +}
> +
>  /* Send prior to any postcopy transfer */
>  void qemu_savevm_send_postcopy_advise(QEMUFile *f)
>  {
> @@ -1265,6 +1297,48 @@ static int loadvm_process_command_simple_lencheck(const char *name,
>      return 0;
>  }
>  
> +/* Immediately following this command is a blob of data containing an embedded
> + * chunk of migration stream; read it and load it.
> + */
> +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> +                                      uint32_t length)
> +{
> +    int ret;
> +    uint8_t *buffer;
> +    QEMUSizedBuffer *qsb;
> +
> +    trace_loadvm_handle_cmd_packaged(length);
> +
> +    if (length > MAX_VM_CMD_PACKAGED_SIZE) {
> +        error_report("Unreasonably large packaged state: %u", length);
> +        return -1;

It would be a good idea to check this on the send side as well as
receive, wouldn't it?

> +    }
> +    buffer = g_malloc0(length);
> +    ret = qemu_get_buffer(mis->file, buffer, (int)length);
> +    if (ret != length) {
> +        g_free(buffer);
> +        error_report("CMD_PACKAGED: Buffer receive fail ret=%d length=%d\n",
> +                ret, length);
> +        return (re/t < 0) ? ret : -EAGAIN;
> +    }
> +    trace_loadvm_handle_cmd_packaged_received(ret);
> +
> +    /* Setup a dummy QEMUFile that actually reads from the buffer */
> +    qsb = qsb_create(buffer, length);
> +    g_free(buffer); /* Because qsb_create copies */
> +    if (!qsb) {
> +        error_report("Unable to create qsb");
> +    }
> +    QEMUFile *packf = qemu_bufopen("r", qsb);
> +
> +    ret = qemu_loadvm_state_main(packf, mis);
> +    trace_loadvm_handle_cmd_packaged_main(ret);
> +    qemu_fclose(packf);
> +    qsb_free(qsb);
> +
> +    return ret;
> +}
> +
>  /*
>   * Process an incoming 'QEMU_VM_COMMAND'
>   * negative return on error (will issue error message)
> @@ -1315,6 +1389,14 @@ static int loadvm_process_command(QEMUFile *f)
>          migrate_send_rp_pong(mis, tmp32);
>          break;
>  
> +    case MIG_CMD_PACKAGED:
> +        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_PACKAGED",
> +            len, 4)) {
> +            return -1;
> +         }
> +        tmp32 = qemu_get_be32(f);
> +        return loadvm_handle_cmd_packaged(mis, tmp32);
> +
>      case MIG_CMD_POSTCOPY_ADVISE:
>          if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_ADVISE",
>                                                     len, 16)) {
> diff --git a/trace-events b/trace-events
> index 050f553..cbf995c 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1171,6 +1171,10 @@ qemu_loadvm_state_main(void) ""
>  qemu_loadvm_state_main_quit_parent(void) ""
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> +qemu_savevm_send_packaged(void) ""
> +loadvm_handle_cmd_packaged(unsigned int length) "%u"
> +loadvm_handle_cmd_packaged_main(int ret) "%d"
> +loadvm_handle_cmd_packaged_received(int ret) "%d"
>  loadvm_postcopy_handle_advise(void) ""
>  loadvm_postcopy_handle_end(void) ""
>  loadvm_postcopy_handle_listen(void) ""

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy Dr. David Alan Gilbert (git)
@ 2015-03-13  1:00   ` David Gibson
  2015-03-13 10:19     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-13  1:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 8730 bytes --]

On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Modify save_live_pending to return separate postcopiable and
> non-postcopiable counts.
> 
> Add 'can_postcopy' to allow a device to state if it can postcopy

What's the purpose of the can_postcopy callback?  There are no callers
in this patch - is it still necessary with the change to
save_live_pending?

> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  arch_init.c                 | 15 +++++++++++++--
>  include/migration/vmstate.h | 10 ++++++++--
>  include/sysemu/sysemu.h     |  4 +++-
>  migration/block.c           |  7 +++++--
>  migration/migration.c       |  9 +++++++--
>  savevm.c                    | 21 +++++++++++++++++----
>  trace-events                |  2 +-
>  7 files changed, 54 insertions(+), 14 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index fe0df0d..7bc5fa6 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -997,7 +997,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> -static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
> +static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
> +                             uint64_t *non_postcopiable_pending,
> +                             uint64_t *postcopiable_pending)
>  {
>      uint64_t remaining_size;
>  
> @@ -1009,7 +1011,9 @@ static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
>          qemu_mutex_unlock_iothread();
>          remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
>      }
> -    return remaining_size;
> +
> +    *non_postcopiable_pending = 0;
> +    *postcopiable_pending = remaining_size;
>  }
>  
>  static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
> @@ -1204,6 +1208,12 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>      return ret;
>  }
>  
> +/* RAM's always up for postcopying */
> +static bool ram_can_postcopy(void *opaque)
> +{
> +    return true;
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> @@ -1211,6 +1221,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_pending = ram_save_pending,
>      .load_state = ram_load,
>      .cancel = ram_migration_cancel,
> +    .can_postcopy = ram_can_postcopy,
>  };
>  
>  void ram_mig_init(void)
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 18da207..c9ec74a 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -54,8 +54,14 @@ typedef struct SaveVMHandlers {
>  
>      /* This runs outside the iothread lock!  */
>      int (*save_live_setup)(QEMUFile *f, void *opaque);
> -    uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size);
> -
> +    /*
> +     * postcopiable_pending must return 0 unless the can_postcopy
> +     * handler returns true.
> +     */
> +    void (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size,
> +                              uint64_t *non_postcopiable_pending,
> +                              uint64_t *postcopiable_pending);
> +    bool (*can_postcopy)(void *opaque);
>      LoadStateHandler *load_state;
>  } SaveVMHandlers;
>  
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index e83bf80..5f518b3 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -111,7 +111,9 @@ void qemu_savevm_state_header(QEMUFile *f);
>  int qemu_savevm_state_iterate(QEMUFile *f);
>  void qemu_savevm_state_complete(QEMUFile *f);
>  void qemu_savevm_state_cancel(void);
> -uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
> +void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
> +                               uint64_t *res_non_postcopiable,
> +                               uint64_t *res_postcopiable);
>  void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
>                                uint16_t len, uint8_t *data);
>  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
> diff --git a/migration/block.c b/migration/block.c
> index 0c76106..0f6f209 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -754,7 +754,9 @@ static int block_save_complete(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> -static uint64_t block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
> +static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
> +                               uint64_t *non_postcopiable_pending,
> +                               uint64_t *postcopiable_pending)
>  {
>      /* Estimate pending number of bytes to send */
>      uint64_t pending;
> @@ -773,7 +775,8 @@ static uint64_t block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
>      qemu_mutex_unlock_iothread();
>  
>      DPRINTF("Enter save live pending  %" PRIu64 "\n", pending);
> -    return pending;
> +    *non_postcopiable_pending = pending;
> +    *postcopiable_pending = 0;
>  }
>  
>  static int block_load(QEMUFile *f, void *opaque, int version_id)
> diff --git a/migration/migration.c b/migration/migration.c
> index 2e6adca..a4fc7d7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -868,8 +868,13 @@ static void *migration_thread(void *opaque)
>          uint64_t pending_size;
>  
>          if (!qemu_file_rate_limit(s->file)) {
> -            pending_size = qemu_savevm_state_pending(s->file, max_size);
> -            trace_migrate_pending(pending_size, max_size);
> +            uint64_t pend_post, pend_nonpost;
> +
> +            qemu_savevm_state_pending(s->file, max_size, &pend_nonpost,
> +                                      &pend_post);
> +            pending_size = pend_nonpost + pend_post;
> +            trace_migrate_pending(pending_size, max_size,
> +                                  pend_post, pend_nonpost);
>              if (pending_size && pending_size >= max_size) {
>                  qemu_savevm_state_iterate(s->file);
>              } else {
> diff --git a/savevm.c b/savevm.c
> index df48ba8..e301a0a 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -944,10 +944,20 @@ void qemu_savevm_state_complete(QEMUFile *f)
>      qemu_fflush(f);
>  }
>  
> -uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size)
> +/* Give an estimate of the amount left to be transferred,
> + * the result is split into the amount for units that can and
> + * for units that can't do postcopy.
> + */
> +void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
> +                               uint64_t *res_non_postcopiable,
> +                               uint64_t *res_postcopiable)
>  {
>      SaveStateEntry *se;
> -    uint64_t ret = 0;
> +    uint64_t tmp_non_postcopiable, tmp_postcopiable;
> +
> +    *res_non_postcopiable = 0;
> +    *res_postcopiable = 0;
> +
>  
>      QTAILQ_FOREACH(se, &savevm_handlers, entry) {
>          if (!se->ops || !se->ops->save_live_pending) {
> @@ -958,9 +968,12 @@ uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size)
>                  continue;
>              }
>          }
> -        ret += se->ops->save_live_pending(f, se->opaque, max_size);
> +        se->ops->save_live_pending(f, se->opaque, max_size,
> +                                   &tmp_non_postcopiable, &tmp_postcopiable);
> +
> +        *res_postcopiable += tmp_postcopiable;
> +        *res_non_postcopiable += tmp_non_postcopiable;
>      }
> -    return ret;
>  }
>  
>  void qemu_savevm_state_cancel(void)
> diff --git a/trace-events b/trace-events
> index cbf995c..83312b6 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1400,7 +1400,7 @@ migrate_fd_cleanup(void) ""
>  migrate_fd_cleanup_src_rp(void) ""
>  migrate_fd_error(void) ""
>  migrate_fd_cancel(void) ""
> -migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
> +migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
>  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
>  open_outgoing_return_path(void) ""
>  open_outgoing_return_path_continue(void) ""

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test Dr. David Alan Gilbert (git)
@ 2015-03-13  1:23   ` David Gibson
  2015-03-13 10:41     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-13  1:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 8695 bytes --]

On Wed, Feb 25, 2015 at 04:51:45PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Provide a check to see if the OS we're running on has all the bits
> needed for postcopy.
> 
> Creates postcopy-ram.c which will get most of the other helpers we need.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/postcopy-ram.h |  19 +++++
>  migration/Makefile.objs          |   2 +-
>  migration/postcopy-ram.c         | 161 +++++++++++++++++++++++++++++++++++++++
>  savevm.c                         |   5 ++
>  4 files changed, 186 insertions(+), 1 deletion(-)
>  create mode 100644 include/migration/postcopy-ram.h
>  create mode 100644 migration/postcopy-ram.c
> 
> diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> new file mode 100644
> index 0000000..d81934f
> --- /dev/null
> +++ b/include/migration/postcopy-ram.h
> @@ -0,0 +1,19 @@
> +/*
> + * Postcopy migration for RAM
> + *
> + * Copyright 2013 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *  Dave Gilbert  <dgilbert@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +#ifndef QEMU_POSTCOPY_RAM_H
> +#define QEMU_POSTCOPY_RAM_H
> +
> +/* Return true if the host supports everything we need to do postcopy-ram */
> +bool postcopy_ram_supported_by_host(void);
> +
> +#endif
> diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> index d929e96..0cac6d7 100644
> --- a/migration/Makefile.objs
> +++ b/migration/Makefile.objs
> @@ -1,7 +1,7 @@
>  common-obj-y += migration.o tcp.o
>  common-obj-y += vmstate.o
>  common-obj-y += qemu-file.o qemu-file-buf.o qemu-file-unix.o qemu-file-stdio.o
> -common-obj-y += xbzrle.o
> +common-obj-y += xbzrle.o postcopy-ram.o
>  
>  common-obj-$(CONFIG_RDMA) += rdma.o
>  common-obj-$(CONFIG_POSIX) += exec.o unix.o fd.o
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> new file mode 100644
> index 0000000..a0e20b2
> --- /dev/null
> +++ b/migration/postcopy-ram.c
> @@ -0,0 +1,161 @@
> +/*
> + * Postcopy migration for RAM
> + *
> + * Copyright 2013-2014 Red Hat, Inc. and/or its affiliates
> + *
> + * Authors:
> + *  Dave Gilbert  <dgilbert@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +/*
> + * Postcopy is a migration technique where the execution flips from the
> + * source to the destination before all the data has been copied.
> + */
> +
> +#include <glib.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include "qemu-common.h"
> +#include "migration/migration.h"
> +#include "migration/postcopy-ram.h"
> +#include "sysemu/sysemu.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +/* Postcopy needs to detect accesses to pages that haven't yet been copied
> + * across, and efficiently map new pages in, the techniques for doing this
> + * are target OS specific.
> + */
> +#if defined(__linux__)
> +
> +#include <sys/mman.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <asm/types.h> /* for __u64 */
> +#include <linux/userfaultfd.h>
> +
> +#ifdef HOST_X86_64
> +#ifndef __NR_userfaultfd
> +#define __NR_userfaultfd 323

Sholdn't this come from the kernel headers imported in the previous
patch?  Rather than having an arch-specific hack.

> +#endif
> +#endif
> +
> +#endif
> +
> +#if defined(__linux__) && defined(__NR_userfaultfd)
> +
> +static bool ufd_version_check(int ufd)
> +{
> +    struct uffdio_api api_struct;
> +    uint64_t feature_mask;
> +
> +    api_struct.api = UFFD_API;
> +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> +        perror("postcopy_ram_supported_by_host: UFFDIO_API failed");

This should be error_report() not, perror(), to match qemu
conventions, shouldn't it?

> +        return false;
> +    }
> +
> +    feature_mask = (__u64)1 << _UFFDIO_REGISTER |
> +                   (__u64)1 << _UFFDIO_UNREGISTER;
> +    if ((api_struct.ioctls & feature_mask) != feature_mask) {
> +        error_report("Missing userfault features: %" PRIu64,
> +                     (uint64_t)(~api_struct.ioctls & feature_mask));
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +bool postcopy_ram_supported_by_host(void)
> +{
> +    long pagesize = getpagesize();
> +    int ufd = -1;
> +    bool ret = false; /* Error unless we change it */
> +    void *testarea = NULL;
> +    struct uffdio_register reg_struct;
> +    struct uffdio_range range_struct;
> +    uint64_t feature_mask;
> +
> +    if ((1ul << qemu_target_page_bits()) > pagesize) {
> +        /* The PMI code doesn't yet deal with TPS>HPS */
> +        error_report("Target page size bigger than host page size");
> +        goto out;
> +    }
> +
> +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> +    if (ufd == -1) {
> +        perror("postcopy_ram_supported_by_host: userfaultfd not available");

And here as well?  And several places below.

> +        goto out;
> +    }
> +
> +    /* Version and features check */
> +    if (!ufd_version_check(ufd)) {
> +        goto out;
> +    }
> +
> +    /*
> +     *  We need to check that the ops we need are supported on anon memory
> +     *  To do that we need to register a chunk and see the flags that
> +     *  are returned.
> +     */
> +    testarea = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> +                                    MAP_ANONYMOUS, -1, 0);
> +    if (!testarea) {

This should be (testarea == MAP_FAILED).  Otherwise mmap() failures
will always trip the assert below.

> +        perror("postcopy_ram_supported_by_host: Failed to map test area");
> +        goto out;
> +    }
> +    g_assert(((size_t)testarea & (pagesize-1)) == 0);
> +
> +    reg_struct.range.start = (uint64_t)(uintptr_t)testarea;
> +    reg_struct.range.len = (uint64_t)pagesize;
> +    reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
> +
> +    if (ioctl(ufd, UFFDIO_REGISTER, &reg_struct)) {
> +        perror("postcopy_ram_supported_by_host userfault register");
> +        goto out;
> +    }
> +
> +    range_struct.start = (uint64_t)(uintptr_t)testarea;
> +    range_struct.len = (uint64_t)pagesize;

I don't think you need the (uint64_t) casts (though you do need the
uintptr_t cast).  I think the assignment will do an implicit
conversion without probvlems.

> +    if (ioctl(ufd, UFFDIO_UNREGISTER, &range_struct)) {
> +        perror("postcopy_ram_supported_by_host userfault unregister");
> +        goto out;
> +    }
> +
> +    feature_mask = (__u64)1 << _UFFDIO_WAKE |
> +                   (__u64)1 << _UFFDIO_COPY |
> +                   (__u64)1 << _UFFDIO_ZEROPAGE;
> +    if ((reg_struct.ioctls & feature_mask) != feature_mask) {
> +        error_report("Missing userfault map features: %" PRIu64,

I'm guessing you want PRIx64, in order to make the feature mask at
least semi-readable.

> +                     (uint64_t)(~reg_struct.ioctls & feature_mask));
> +        goto out;
> +    }
> +
> +    /* Success! */
> +    ret = true;
> +out:
> +    if (testarea) {
> +        munmap(testarea, pagesize);
> +    }
> +    if (ufd != -1) {
> +        close(ufd);
> +    }
> +    return ret;
> +}
> +
> +#else
> +/* No target OS support, stubs just fail */
> +
> +bool postcopy_ram_supported_by_host(void)
> +{
> +    error_report("%s: No OS support", __func__);
> +    return false;
> +}
> +
> +#endif
> +
> diff --git a/savevm.c b/savevm.c
> index e301a0a..2ea4c76 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -33,6 +33,7 @@
>  #include "qemu/timer.h"
>  #include "audio/audio.h"
>  #include "migration/migration.h"
> +#include "migration/postcopy-ram.h"
>  #include "qemu/sockets.h"
>  #include "qemu/queue.h"
>  #include "sysemu/cpus.h"
> @@ -1109,6 +1110,10 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
>          return -1;
>      }
>  
> +    if (!postcopy_ram_supported_by_host()) {
> +        return -1;
> +    }
> +
>      if (remote_hps != getpagesize())  {
>          /*
>           * Some combinations of mismatch are probably possible but it gets

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy Dr. David Alan Gilbert (git)
@ 2015-03-13  1:26   ` David Gibson
  2015-03-13 11:19     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-13  1:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 6337 bytes --]

On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Once postcopy is enabled (with migrate_set_capability), the migration
> will still start on precopy mode.  To cause a transition into postcopy
> the:
> 
>   migrate_start_postcopy
> 
> command must be issued.  Postcopy will start sometime after this
> (when it's next checked in the migration loop).
> 
> Issuing the command before migration has started will error,
> and issuing after it has finished is ignored.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  hmp-commands.hx               | 15 +++++++++++++++
>  hmp.c                         |  7 +++++++
>  hmp.h                         |  1 +
>  include/migration/migration.h |  3 +++
>  migration/migration.c         | 22 ++++++++++++++++++++++
>  qapi-schema.json              |  8 ++++++++
>  qmp-commands.hx               | 19 +++++++++++++++++++
>  7 files changed, 75 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index e37bc8b..03b8b78 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
>  ETEXI
>  
>      {
> +        .name       = "migrate_start_postcopy",
> +        .args_type  = "",
> +        .params     = "",
> +        .help       = "Switch migration to postcopy mode",
> +        .mhandler.cmd = hmp_migrate_start_postcopy,
> +    },
> +
> +STEXI
> +@item migrate_start_postcopy
> +@findex migrate_start_postcopy
> +Switch in-progress migration to postcopy mode. Ignored after the end of
> +migration (or once already in postcopy).
> +ETEXI
> +
> +    {
>          .name       = "client_migrate_info",
>          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
>          .params     = "protocol hostname port tls-port cert-subject",
> diff --git a/hmp.c b/hmp.c
> index b47f331..df9736c 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
>      }
>  }
>  
> +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> +{
> +    Error *err = NULL;
> +    qmp_migrate_start_postcopy(&err);
> +    hmp_handle_error(mon, &err);
> +}
> +
>  void hmp_set_password(Monitor *mon, const QDict *qdict)
>  {
>      const char *protocol  = qdict_get_str(qdict, "protocol");
> diff --git a/hmp.h b/hmp.h
> index 4bb5dca..da1334f 100644
> --- a/hmp.h
> +++ b/hmp.h
> @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
>  void hmp_set_password(Monitor *mon, const QDict *qdict);
>  void hmp_expire_password(Monitor *mon, const QDict *qdict);
>  void hmp_eject(Monitor *mon, const QDict *qdict);
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e6a814a..293c83e 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -104,6 +104,9 @@ struct MigrationState
>      int64_t xbzrle_cache_size;
>      int64_t setup_time;
>      int64_t dirty_sync_count;
> +
> +    /* Flag set once the migration has been asked to enter postcopy */
> +    bool start_postcopy;
>  };
>  
>  void process_incoming_migration(QEMUFile *f);
> diff --git a/migration/migration.c b/migration/migration.c
> index a4fc7d7..43ca656 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>      }
>  }
>  
> +void qmp_migrate_start_postcopy(Error **errp)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    if (!migrate_postcopy_ram()) {
> +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> +                         " the start of migration");
> +        return;
> +    }
> +
> +    if (s->state == MIG_STATE_NONE) {
> +        error_setg(errp, "Postcopy must be started after migration has been"
> +                         " started");
> +        return;
> +    }
> +    /*
> +     * we don't error if migration has finished since that would be racy
> +     * with issuing this command.
> +     */
> +    atomic_set(&s->start_postcopy, true);

Why atomic_set?

> +}
> +
>  /* shared migration helpers */
>  
>  static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> diff --git a/qapi-schema.json b/qapi-schema.json
> index a8af1cb..7ff61e9 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -542,6 +542,14 @@
>  { 'command': 'query-migrate-capabilities', 'returns':   ['MigrationCapabilityStatus']}
>  
>  ##
> +# @migrate-start-postcopy
> +#
> +# Switch migration to postcopy mode
> +#
> +# Since: 2.3
> +{ 'command': 'migrate-start-postcopy' }
> +
> +##
>  # @MouseInfo:
>  #
>  # Information about a mouse device.
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index a85d847..25d2208 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -685,6 +685,25 @@ Example:
>  
>  EQMP
>      {
> +        .name       = "migrate-start-postcopy",
> +        .args_type  = "",
> +        .mhandler.cmd_new = qmp_marshal_input_migrate_start_postcopy,
> +    },
> +
> +SQMP
> +migrate-start-postcopy
> +----------------------
> +
> +Switch an in-progress migration to postcopy mode. Ignored after the end of
> +migration (or once already in postcopy).
> +
> +Example:
> +-> { "execute": "migrate-start-postcopy" }
> +<- { "return": {} }
> +
> +EQMP
> +
> +    {
>          .name       = "query-migrate-cache-size",
>          .args_type  = "",
>          .mhandler.cmd_new = qmp_marshal_input_query_migrate_cache_size,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 24/45] MIG_STATE_POSTCOPY_ACTIVE: Add new migration state
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 24/45] MIG_STATE_POSTCOPY_ACTIVE: Add new migration state Dr. David Alan Gilbert (git)
@ 2015-03-13  4:45   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-13  4:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

On Wed, Feb 25, 2015 at 04:51:47PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> 'MIG_STATE_POSTCOPY_ACTIVE' is entered after migrate_start_postcopy
> 
> 'migration_postcopy_phase' is provided for other sections to know if
> they're in postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes Dr. David Alan Gilbert (git)
@ 2015-03-13  4:58   ` David Gibson
  2015-03-13 12:25     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-13  4:58 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

On Wed, Feb 25, 2015 at 04:51:48PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> When postcopy calls qemu_savevm_state_complete it's not really
> the end of migration, so skip:

Given that, maybe the name should change..

>    a) Finishing postcopiable iterative devices - they'll carry on
>    b) The termination byte on the end of the stream.
> 
> We then also add:
>   qemu_savevm_state_postcopy_complete
> which is called at the end of a postcopy migration to call the
> complete methods on devices skipped in the _complete call.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Otherwise,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure Dr. David Alan Gilbert (git)
@ 2015-03-13  5:19   ` David Gibson
  2015-03-13 13:47     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-13  5:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 14490 bytes --]

On Wed, Feb 25, 2015 at 04:51:49PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The PMI holds the state of each page on the incoming side,
> so that we can tell if the page is missing, already received
> or there is a request outstanding for it.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h    |  18 ++++
>  include/migration/postcopy-ram.h |  12 +++
>  include/qemu/typedefs.h          |   1 +
>  migration/postcopy-ram.c         | 223 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 254 insertions(+)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index b44b9b2..86200b9 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -48,6 +48,23 @@ enum mig_rpcomm_cmd {
>      MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
>  };
>  
> +/* Postcopy page-map-incoming - data about each page on the inbound side */
> +typedef enum {
> +   POSTCOPY_PMI_MISSING    = 0, /* page hasn't yet been received */

This appears to be a 3 space indent instead of the usual 4.

> +   POSTCOPY_PMI_REQUESTED  = 1, /* Kernel asked for a page, not yet got it */
> +   POSTCOPY_PMI_RECEIVED   = 2, /* We've got the page */
> +} PostcopyPMIState;

TBH, I'm not sure this enum actually helps anything.  I wonder if
things might actually be cleared if you simply treat the received and
requested bitmaps separately.

> +struct PostcopyPMI {
> +    QemuMutex      mutex;
> +    unsigned long *state0;        /* Together with state1 form a */
> +    unsigned long *state1;        /* PostcopyPMIState */

The comments on the lines above don't appear to shed any light on anything.

> +    unsigned long  host_mask;     /* A mask with enough bits set to cover one
> +                                     host page in the PMI */
> +    unsigned long  host_bits;     /* The number of bits in the map representing
> +                                     one host page */

I find the host_bits name fairly confusing.  Maybe "tp_per_hp"?

> +};
> +
>  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
>  
>  typedef enum {
> @@ -69,6 +86,7 @@ struct MigrationIncomingState {
>  
>      QEMUFile *return_path;
>      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> +    PostcopyPMI    postcopy_pmi;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> index d81934f..e93ee8a 100644
> --- a/include/migration/postcopy-ram.h
> +++ b/include/migration/postcopy-ram.h
> @@ -13,7 +13,19 @@
>  #ifndef QEMU_POSTCOPY_RAM_H
>  #define QEMU_POSTCOPY_RAM_H
>  
> +#include "migration/migration.h"
> +
>  /* Return true if the host supports everything we need to do postcopy-ram */
>  bool postcopy_ram_supported_by_host(void);
>  
> +/*
> + * In 'advise' mode record that a page has been received.
> + */
> +void postcopy_hook_early_receive(MigrationIncomingState *mis,
> +                                 size_t bitmap_index);
> +
> +void postcopy_pmi_destroy(MigrationIncomingState *mis);
> +void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> +                                size_t start, size_t npages);
> +void postcopy_pmi_dump(MigrationIncomingState *mis);
>  #endif
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 611db46..924eeb6 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -61,6 +61,7 @@ typedef struct PCIExpressHost PCIExpressHost;
>  typedef struct PCIHostState PCIHostState;
>  typedef struct PCMCIACardState PCMCIACardState;
>  typedef struct PixelFormat PixelFormat;
> +typedef struct PostcopyPMI PostcopyPMI;
>  typedef struct PropertyInfo PropertyInfo;
>  typedef struct Property Property;
>  typedef struct QEMUBH QEMUBH;
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a0e20b2..4f29055 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -24,6 +24,7 @@
>  #include "migration/migration.h"
>  #include "migration/postcopy-ram.h"
>  #include "sysemu/sysemu.h"
> +#include "qemu/bitmap.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  
> @@ -49,6 +50,220 @@
>  
>  #if defined(__linux__) && defined(__NR_userfaultfd)
>  
> +/* ---------------------------------------------------------------------- */
> +/* Postcopy pagemap-inbound (pmi) - data structures that record the       */
> +/* state of each page used by the inbound postcopy                        */
> +/* It's a pair of bitmaps (of the same structure as the migration bitmaps)*/
> +/* holding one bit per target-page, although most operations work on host */
> +/* pages, the exception being a hook that receives incoming pages off the */
> +/* migration stream which come in a TP at a time, although the source     */
> +/* _should_ guarantee it sends a sequence of TPs representing HPs during  */
> +/* the postcopy phase, there is no such guarantee during precopy.  We     */
> +/* could boil this down to only holding one bit per-host page, but we lose*/
> +/* sanity checking that we really do get whole host-pages from the source.*/
> +__attribute__ (( unused )) /* Until later in patch series */
> +static void postcopy_pmi_init(MigrationIncomingState *mis, size_t ram_pages)
> +{
> +    unsigned int tpb = qemu_target_page_bits();
> +    unsigned long host_bits;
> +
> +    qemu_mutex_init(&mis->postcopy_pmi.mutex);
> +    mis->postcopy_pmi.state0 = bitmap_new(ram_pages);
> +    mis->postcopy_pmi.state1 = bitmap_new(ram_pages);
> +    bitmap_clear(mis->postcopy_pmi.state0, 0, ram_pages);
> +    bitmap_clear(mis->postcopy_pmi.state1, 0, ram_pages);
> +    /*
> +     * Each bit in the map represents one 'target page' which is no bigger
> +     * than a host page but can be smaller.  It's useful to have some
> +     * convenience masks for later
> +     */
> +
> +    /*
> +     * The number of bits one host page takes up in the bitmap
> +     * e.g. on a 64k host page, 4k Target page, host_bits=64/4=16
> +     */
> +    host_bits = getpagesize() / (1ul << tpb);

That's equivalent to getpagesize() >> tpb, isn't it?

> +    assert(is_power_of_2(host_bits));
> +
> +    mis->postcopy_pmi.host_bits = host_bits;
> +
> +    if (host_bits < BITS_PER_LONG) {
> +        /* A mask starting at bit 0 containing host_bits continuous set bits */
> +        mis->postcopy_pmi.host_mask =  (1ul << host_bits) - 1;
> +    } else {
> +        /*
> +         * This is a host where the ratio between host and target pages is
> +         * bigger than the size of our longs, so we can't make a mask
> +         * but we are only losing sanity checking if we just check one long's
> +         * worth of bits.
> +         */
> +        mis->postcopy_pmi.host_mask = ~0l;
> +    }
> +
> +
> +    assert((ram_pages % host_bits) == 0);
> +}
> +
> +void postcopy_pmi_destroy(MigrationIncomingState *mis)
> +{
> +    g_free(mis->postcopy_pmi.state0);
> +    mis->postcopy_pmi.state0 = NULL;
> +    g_free(mis->postcopy_pmi.state1);
> +    mis->postcopy_pmi.state1 = NULL;
> +    qemu_mutex_destroy(&mis->postcopy_pmi.mutex);
> +}
> +
> +/*
> + * Mark a set of pages in the PMI as being clear; this is used by the discard
> + * at the start of postcopy, and before the postcopy stream starts.
> + */
> +void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> +                                size_t start, size_t npages)
> +{
> +    /* Clear to state 0 = missing */
> +    bitmap_clear(mis->postcopy_pmi.state0, start, npages);
> +    bitmap_clear(mis->postcopy_pmi.state1, start, npages);
> +}
> +
> +/*
> + * Test a host-page worth of bits in the map starting at bitmap_index
> + * The bits should all be consistent
> + */
> +static bool test_hpbits(MigrationIncomingState *mis,
> +                        size_t bitmap_index, unsigned long *map)
> +{
> +    long masked;
> +
> +    assert((bitmap_index & (mis->postcopy_pmi.host_bits-1)) == 0);
> +
> +    masked = (map[BIT_WORD(bitmap_index)] >>
> +               (bitmap_index % BITS_PER_LONG)) &
> +             mis->postcopy_pmi.host_mask;
> +
> +    assert((masked == 0) || (masked == mis->postcopy_pmi.host_mask));
> +    return !!masked;
> +}
> +
> +/*
> + * Set host-page worth of bits in the map starting at bitmap_index
> + * to the given state
> + */
> +static void set_hp(MigrationIncomingState *mis,
> +                   size_t bitmap_index, PostcopyPMIState state)
> +{
> +    long shifted_mask = mis->postcopy_pmi.host_mask <<
> +                        (bitmap_index % BITS_PER_LONG);
> +
> +    assert((bitmap_index & (mis->postcopy_pmi.host_bits-1)) == 0);

assert(state != 0)?

> +
> +    if (state & 1) {

Using the symbolic constants for PostcopyPMIState values here might be
better.

> +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] |= shifted_mask;
> +    } else {
> +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> +    }
> +    if (state & 2) {
> +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] |= shifted_mask;
> +    } else {
> +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> +    }
> +}
> +
> +/*
> + * Retrieve the state of the given page
> + * Note: This version for use by callers already holding the lock
> + */
> +static PostcopyPMIState postcopy_pmi_get_state_nolock(
> +                            MigrationIncomingState *mis,
> +                            size_t bitmap_index)
> +{
> +    bool b0, b1;
> +
> +    b0 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state0);
> +    b1 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state1);
> +
> +    return (b0 ? 1 : 0) + (b1 ? 2 : 0);

Ugh.. this is a hidden dependency on the PostcopyPMIState enum
elements never changing value.  Safer to code it as:
      if (!b0 && !b1) {
          return POSTCOPY_PMI_MISSING;
      } else if (...)
           ...

and let gcc sort it out.

> +}
> +
> +/* Retrieve the state of the given page */
> +__attribute__ (( unused )) /* Until later in patch series */
> +static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
> +                                               size_t bitmap_index)
> +{
> +    PostcopyPMIState ret;
> +    qemu_mutex_lock(&mis->postcopy_pmi.mutex);
> +    ret = postcopy_pmi_get_state_nolock(mis, bitmap_index);
> +    qemu_mutex_unlock(&mis->postcopy_pmi.mutex);
> +
> +    return ret;
> +}
> +
> +/*
> + * Set the page state to the given state if the previous state was as expected
> + * Return the actual previous state.
> + */
> +__attribute__ (( unused )) /* Until later in patch series */
> +static PostcopyPMIState postcopy_pmi_change_state(MigrationIncomingState *mis,
> +                                           size_t bitmap_index,
> +                                           PostcopyPMIState expected_state,
> +                                           PostcopyPMIState new_state)
> +{
> +    PostcopyPMIState old_state;
> +
> +    qemu_mutex_lock(&mis->postcopy_pmi.mutex);
> +    old_state = postcopy_pmi_get_state_nolock(mis, bitmap_index);
> +
> +    if (old_state == expected_state) {
> +        switch (new_state) {
> +        case POSTCOPY_PMI_MISSING:
> +            assert(0); /* This shouldn't happen - use discard_range */
> +            break;
> +
> +        case POSTCOPY_PMI_REQUESTED:
> +            assert(old_state == POSTCOPY_PMI_MISSING);
> +            /* missing -> requested */
> +            set_hp(mis, bitmap_index, POSTCOPY_PMI_REQUESTED);
> +            break;
> +
> +        case POSTCOPY_PMI_RECEIVED:
> +            assert(old_state == POSTCOPY_PMI_MISSING ||
> +                   old_state == POSTCOPY_PMI_REQUESTED);
> +            /* -> received */
> +            set_hp(mis, bitmap_index, POSTCOPY_PMI_RECEIVED);
> +            break;
> +        }
> +    }
> +
> +    qemu_mutex_unlock(&mis->postcopy_pmi.mutex);
> +    return old_state;
> +}
> +
> +/*
> + * Useful when debugging postcopy, although if it failed early the
> + * received map can be quite sparse and thus big when dumped.
> + */
> +void postcopy_pmi_dump(MigrationIncomingState *mis)
> +{
> +    fprintf(stderr, "postcopy_pmi_dump: bit 0\n");
> +    ram_debug_dump_bitmap(mis->postcopy_pmi.state0, false);
> +    fprintf(stderr, "postcopy_pmi_dump: bit 1\n");
> +    ram_debug_dump_bitmap(mis->postcopy_pmi.state1, true);
> +    fprintf(stderr, "postcopy_pmi_dump: end\n");
> +}
> +
> +/* Called by ram_load prior to mapping the page */
> +void postcopy_hook_early_receive(MigrationIncomingState *mis,
> +                                 size_t bitmap_index)
> +{
> +    if (mis->postcopy_state == POSTCOPY_INCOMING_ADVISE) {
> +        /*
> +         * If we're in precopy-advise mode we need to track received pages even
> +         * though we don't need to place pages atomically yet.
> +         * In advise mode there's only a single thread, so don't need locks
> +         */
> +        set_bit(bitmap_index, mis->postcopy_pmi.state1); /* 2=received */

Yeah.. so this bypasses postcopy_pmi_{get,change}_state, which again
makes me wonder whether the enum serves any purpose.

> +    }
> +}
> +
>  static bool ufd_version_check(int ufd)
>  {
>      struct uffdio_api api_struct;
> @@ -71,6 +286,7 @@ static bool ufd_version_check(int ufd)
>      return true;
>  }
>  
> +

Extraneous whitespace change.

>  bool postcopy_ram_supported_by_host(void)
>  {
>      long pagesize = getpagesize();
> @@ -157,5 +373,12 @@ bool postcopy_ram_supported_by_host(void)
>      return false;
>  }
>  
> +/* Called by ram_load prior to mapping the page */
> +void postcopy_hook_early_receive(MigrationIncomingState *mis,
> +                                 size_t bitmap_index)
> +{
> +    /* We don't support postcopy so don't care */
> +}
> +
>  #endif
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-13  1:00   ` David Gibson
@ 2015-03-13 10:19     ` Dr. David Alan Gilbert
  2015-03-16  6:18       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 10:19 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Modify save_live_pending to return separate postcopiable and
> > non-postcopiable counts.
> > 
> > Add 'can_postcopy' to allow a device to state if it can postcopy
> 
> What's the purpose of the can_postcopy callback?  There are no callers
> in this patch - is it still necessary with the change to
> save_live_pending?

The patch 'qemu_savevm_state_complete: Postcopy changes' uses
it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
to decide which devices must be completed at that point.

Dave

> 
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  arch_init.c                 | 15 +++++++++++++--
> >  include/migration/vmstate.h | 10 ++++++++--
> >  include/sysemu/sysemu.h     |  4 +++-
> >  migration/block.c           |  7 +++++--
> >  migration/migration.c       |  9 +++++++--
> >  savevm.c                    | 21 +++++++++++++++++----
> >  trace-events                |  2 +-
> >  7 files changed, 54 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch_init.c b/arch_init.c
> > index fe0df0d..7bc5fa6 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -997,7 +997,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> >      return 0;
> >  }
> >  
> > -static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
> > +static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
> > +                             uint64_t *non_postcopiable_pending,
> > +                             uint64_t *postcopiable_pending)
> >  {
> >      uint64_t remaining_size;
> >  
> > @@ -1009,7 +1011,9 @@ static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
> >          qemu_mutex_unlock_iothread();
> >          remaining_size = ram_save_remaining() * TARGET_PAGE_SIZE;
> >      }
> > -    return remaining_size;
> > +
> > +    *non_postcopiable_pending = 0;
> > +    *postcopiable_pending = remaining_size;
> >  }
> >  
> >  static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
> > @@ -1204,6 +1208,12 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >      return ret;
> >  }
> >  
> > +/* RAM's always up for postcopying */
> > +static bool ram_can_postcopy(void *opaque)
> > +{
> > +    return true;
> > +}
> > +
> >  static SaveVMHandlers savevm_ram_handlers = {
> >      .save_live_setup = ram_save_setup,
> >      .save_live_iterate = ram_save_iterate,
> > @@ -1211,6 +1221,7 @@ static SaveVMHandlers savevm_ram_handlers = {
> >      .save_live_pending = ram_save_pending,
> >      .load_state = ram_load,
> >      .cancel = ram_migration_cancel,
> > +    .can_postcopy = ram_can_postcopy,
> >  };
> >  
> >  void ram_mig_init(void)
> > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> > index 18da207..c9ec74a 100644
> > --- a/include/migration/vmstate.h
> > +++ b/include/migration/vmstate.h
> > @@ -54,8 +54,14 @@ typedef struct SaveVMHandlers {
> >  
> >      /* This runs outside the iothread lock!  */
> >      int (*save_live_setup)(QEMUFile *f, void *opaque);
> > -    uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size);
> > -
> > +    /*
> > +     * postcopiable_pending must return 0 unless the can_postcopy
> > +     * handler returns true.
> > +     */
> > +    void (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size,
> > +                              uint64_t *non_postcopiable_pending,
> > +                              uint64_t *postcopiable_pending);
> > +    bool (*can_postcopy)(void *opaque);
> >      LoadStateHandler *load_state;
> >  } SaveVMHandlers;
> >  
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index e83bf80..5f518b3 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -111,7 +111,9 @@ void qemu_savevm_state_header(QEMUFile *f);
> >  int qemu_savevm_state_iterate(QEMUFile *f);
> >  void qemu_savevm_state_complete(QEMUFile *f);
> >  void qemu_savevm_state_cancel(void);
> > -uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
> > +void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
> > +                               uint64_t *res_non_postcopiable,
> > +                               uint64_t *res_postcopiable);
> >  void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
> >                                uint16_t len, uint8_t *data);
> >  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
> > diff --git a/migration/block.c b/migration/block.c
> > index 0c76106..0f6f209 100644
> > --- a/migration/block.c
> > +++ b/migration/block.c
> > @@ -754,7 +754,9 @@ static int block_save_complete(QEMUFile *f, void *opaque)
> >      return 0;
> >  }
> >  
> > -static uint64_t block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
> > +static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
> > +                               uint64_t *non_postcopiable_pending,
> > +                               uint64_t *postcopiable_pending)
> >  {
> >      /* Estimate pending number of bytes to send */
> >      uint64_t pending;
> > @@ -773,7 +775,8 @@ static uint64_t block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
> >      qemu_mutex_unlock_iothread();
> >  
> >      DPRINTF("Enter save live pending  %" PRIu64 "\n", pending);
> > -    return pending;
> > +    *non_postcopiable_pending = pending;
> > +    *postcopiable_pending = 0;
> >  }
> >  
> >  static int block_load(QEMUFile *f, void *opaque, int version_id)
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 2e6adca..a4fc7d7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -868,8 +868,13 @@ static void *migration_thread(void *opaque)
> >          uint64_t pending_size;
> >  
> >          if (!qemu_file_rate_limit(s->file)) {
> > -            pending_size = qemu_savevm_state_pending(s->file, max_size);
> > -            trace_migrate_pending(pending_size, max_size);
> > +            uint64_t pend_post, pend_nonpost;
> > +
> > +            qemu_savevm_state_pending(s->file, max_size, &pend_nonpost,
> > +                                      &pend_post);
> > +            pending_size = pend_nonpost + pend_post;
> > +            trace_migrate_pending(pending_size, max_size,
> > +                                  pend_post, pend_nonpost);
> >              if (pending_size && pending_size >= max_size) {
> >                  qemu_savevm_state_iterate(s->file);
> >              } else {
> > diff --git a/savevm.c b/savevm.c
> > index df48ba8..e301a0a 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -944,10 +944,20 @@ void qemu_savevm_state_complete(QEMUFile *f)
> >      qemu_fflush(f);
> >  }
> >  
> > -uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size)
> > +/* Give an estimate of the amount left to be transferred,
> > + * the result is split into the amount for units that can and
> > + * for units that can't do postcopy.
> > + */
> > +void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
> > +                               uint64_t *res_non_postcopiable,
> > +                               uint64_t *res_postcopiable)
> >  {
> >      SaveStateEntry *se;
> > -    uint64_t ret = 0;
> > +    uint64_t tmp_non_postcopiable, tmp_postcopiable;
> > +
> > +    *res_non_postcopiable = 0;
> > +    *res_postcopiable = 0;
> > +
> >  
> >      QTAILQ_FOREACH(se, &savevm_handlers, entry) {
> >          if (!se->ops || !se->ops->save_live_pending) {
> > @@ -958,9 +968,12 @@ uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size)
> >                  continue;
> >              }
> >          }
> > -        ret += se->ops->save_live_pending(f, se->opaque, max_size);
> > +        se->ops->save_live_pending(f, se->opaque, max_size,
> > +                                   &tmp_non_postcopiable, &tmp_postcopiable);
> > +
> > +        *res_postcopiable += tmp_postcopiable;
> > +        *res_non_postcopiable += tmp_non_postcopiable;
> >      }
> > -    return ret;
> >  }
> >  
> >  void qemu_savevm_state_cancel(void)
> > diff --git a/trace-events b/trace-events
> > index cbf995c..83312b6 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1400,7 +1400,7 @@ migrate_fd_cleanup(void) ""
> >  migrate_fd_cleanup_src_rp(void) ""
> >  migrate_fd_error(void) ""
> >  migrate_fd_cancel(void) ""
> > -migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
> > +migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
> >  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> >  open_outgoing_return_path(void) ""
> >  open_outgoing_return_path_continue(void) ""
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-03-13  1:23   ` David Gibson
@ 2015-03-13 10:41     ` Dr. David Alan Gilbert
  2015-03-16  6:22       ` David Gibson
  2015-03-30  8:14       ` Paolo Bonzini
  0 siblings, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 10:41 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:45PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Provide a check to see if the OS we're running on has all the bits
> > needed for postcopy.
> > 
> > Creates postcopy-ram.c which will get most of the other helpers we need.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/postcopy-ram.h |  19 +++++
> >  migration/Makefile.objs          |   2 +-
> >  migration/postcopy-ram.c         | 161 +++++++++++++++++++++++++++++++++++++++
> >  savevm.c                         |   5 ++
> >  4 files changed, 186 insertions(+), 1 deletion(-)
> >  create mode 100644 include/migration/postcopy-ram.h
> >  create mode 100644 migration/postcopy-ram.c
> > 
> > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > new file mode 100644
> > index 0000000..d81934f
> > --- /dev/null
> > +++ b/include/migration/postcopy-ram.h
> > @@ -0,0 +1,19 @@
> > +/*
> > + * Postcopy migration for RAM
> > + *
> > + * Copyright 2013 Red Hat, Inc. and/or its affiliates
> > + *
> > + * Authors:
> > + *  Dave Gilbert  <dgilbert@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +#ifndef QEMU_POSTCOPY_RAM_H
> > +#define QEMU_POSTCOPY_RAM_H
> > +
> > +/* Return true if the host supports everything we need to do postcopy-ram */
> > +bool postcopy_ram_supported_by_host(void);
> > +
> > +#endif
> > diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> > index d929e96..0cac6d7 100644
> > --- a/migration/Makefile.objs
> > +++ b/migration/Makefile.objs
> > @@ -1,7 +1,7 @@
> >  common-obj-y += migration.o tcp.o
> >  common-obj-y += vmstate.o
> >  common-obj-y += qemu-file.o qemu-file-buf.o qemu-file-unix.o qemu-file-stdio.o
> > -common-obj-y += xbzrle.o
> > +common-obj-y += xbzrle.o postcopy-ram.o
> >  
> >  common-obj-$(CONFIG_RDMA) += rdma.o
> >  common-obj-$(CONFIG_POSIX) += exec.o unix.o fd.o
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > new file mode 100644
> > index 0000000..a0e20b2
> > --- /dev/null
> > +++ b/migration/postcopy-ram.c
> > @@ -0,0 +1,161 @@
> > +/*
> > + * Postcopy migration for RAM
> > + *
> > + * Copyright 2013-2014 Red Hat, Inc. and/or its affiliates
> > + *
> > + * Authors:
> > + *  Dave Gilbert  <dgilbert@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +/*
> > + * Postcopy is a migration technique where the execution flips from the
> > + * source to the destination before all the data has been copied.
> > + */
> > +
> > +#include <glib.h>
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +
> > +#include "qemu-common.h"
> > +#include "migration/migration.h"
> > +#include "migration/postcopy-ram.h"
> > +#include "sysemu/sysemu.h"
> > +#include "qemu/error-report.h"
> > +#include "trace.h"
> > +
> > +/* Postcopy needs to detect accesses to pages that haven't yet been copied
> > + * across, and efficiently map new pages in, the techniques for doing this
> > + * are target OS specific.
> > + */
> > +#if defined(__linux__)
> > +
> > +#include <sys/mman.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/types.h>
> > +#include <asm/types.h> /* for __u64 */
> > +#include <linux/userfaultfd.h>
> > +
> > +#ifdef HOST_X86_64
> > +#ifndef __NR_userfaultfd
> > +#define __NR_userfaultfd 323
> 
> Sholdn't this come from the kernel headers imported in the previous
> patch?  Rather than having an arch-specific hack.

The header, like the rest of the kernel headers, just provides
the constant and structure definitions for the call; the syscall numbers
come from arch specific headers.  I guess in the final world I wouldn't
need this at all since it'll come from the system headers; but what's
the right way to put this in for new syscalls?

> > +#endif
> > +#endif
> > +
> > +#endif
> > +
> > +#if defined(__linux__) && defined(__NR_userfaultfd)
> > +
> > +static bool ufd_version_check(int ufd)
> > +{
> > +    struct uffdio_api api_struct;
> > +    uint64_t feature_mask;
> > +
> > +    api_struct.api = UFFD_API;
> > +    if (ioctl(ufd, UFFDIO_API, &api_struct)) {
> > +        perror("postcopy_ram_supported_by_host: UFFDIO_API failed");
> 
> This should be error_report() not, perror(), to match qemu
> conventions, shouldn't it?

Thanks; I've done all of the perrors.

> > +        return false;
> > +    }
> > +
> > +    feature_mask = (__u64)1 << _UFFDIO_REGISTER |
> > +                   (__u64)1 << _UFFDIO_UNREGISTER;
> > +    if ((api_struct.ioctls & feature_mask) != feature_mask) {
> > +        error_report("Missing userfault features: %" PRIu64,
> > +                     (uint64_t)(~api_struct.ioctls & feature_mask));
> > +        return false;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +bool postcopy_ram_supported_by_host(void)
> > +{
> > +    long pagesize = getpagesize();
> > +    int ufd = -1;
> > +    bool ret = false; /* Error unless we change it */
> > +    void *testarea = NULL;
> > +    struct uffdio_register reg_struct;
> > +    struct uffdio_range range_struct;
> > +    uint64_t feature_mask;
> > +
> > +    if ((1ul << qemu_target_page_bits()) > pagesize) {
> > +        /* The PMI code doesn't yet deal with TPS>HPS */
> > +        error_report("Target page size bigger than host page size");
> > +        goto out;
> > +    }
> > +
> > +    ufd = syscall(__NR_userfaultfd, O_CLOEXEC);
> > +    if (ufd == -1) {
> > +        perror("postcopy_ram_supported_by_host: userfaultfd not available");
> 
> And here as well?  And several places below.
> 
> > +        goto out;
> > +    }
> > +
> > +    /* Version and features check */
> > +    if (!ufd_version_check(ufd)) {
> > +        goto out;
> > +    }
> > +
> > +    /*
> > +     *  We need to check that the ops we need are supported on anon memory
> > +     *  To do that we need to register a chunk and see the flags that
> > +     *  are returned.
> > +     */
> > +    testarea = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> > +                                    MAP_ANONYMOUS, -1, 0);
> > +    if (!testarea) {
> 
> This should be (testarea == MAP_FAILED).  Otherwise mmap() failures
> will always trip the assert below.

Thanks; fixed.

> > +        perror("postcopy_ram_supported_by_host: Failed to map test area");
> > +        goto out;
> > +    }
> > +    g_assert(((size_t)testarea & (pagesize-1)) == 0);
> > +
> > +    reg_struct.range.start = (uint64_t)(uintptr_t)testarea;
> > +    reg_struct.range.len = (uint64_t)pagesize;
> > +    reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
> > +
> > +    if (ioctl(ufd, UFFDIO_REGISTER, &reg_struct)) {
> > +        perror("postcopy_ram_supported_by_host userfault register");
> > +        goto out;
> > +    }
> > +
> > +    range_struct.start = (uint64_t)(uintptr_t)testarea;
> > +    range_struct.len = (uint64_t)pagesize;
> 
> I don't think you need the (uint64_t) casts (though you do need the
> uintptr_t cast).  I think the assignment will do an implicit
> conversion without probvlems.

Yes, you're right - they're gone.

> > +    if (ioctl(ufd, UFFDIO_UNREGISTER, &range_struct)) {
> > +        perror("postcopy_ram_supported_by_host userfault unregister");
> > +        goto out;
> > +    }
> > +
> > +    feature_mask = (__u64)1 << _UFFDIO_WAKE |
> > +                   (__u64)1 << _UFFDIO_COPY |
> > +                   (__u64)1 << _UFFDIO_ZEROPAGE;
> > +    if ((reg_struct.ioctls & feature_mask) != feature_mask) {
> > +        error_report("Missing userfault map features: %" PRIu64,
> 
> I'm guessing you want PRIx64, in order to make the feature mask at
> least semi-readable.

Yes, thanks - done.

> > +                     (uint64_t)(~reg_struct.ioctls & feature_mask));
> > +        goto out;
> > +    }
> > +
> > +    /* Success! */
> > +    ret = true;
> > +out:
> > +    if (testarea) {
> > +        munmap(testarea, pagesize);
> > +    }
> > +    if (ufd != -1) {
> > +        close(ufd);
> > +    }
> > +    return ret;
> > +}
> > +
> > +#else
> > +/* No target OS support, stubs just fail */
> > +
> > +bool postcopy_ram_supported_by_host(void)
> > +{
> > +    error_report("%s: No OS support", __func__);
> > +    return false;
> > +}
> > +
> > +#endif
> > +
> > diff --git a/savevm.c b/savevm.c
> > index e301a0a..2ea4c76 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -33,6 +33,7 @@
> >  #include "qemu/timer.h"
> >  #include "audio/audio.h"
> >  #include "migration/migration.h"
> > +#include "migration/postcopy-ram.h"
> >  #include "qemu/sockets.h"
> >  #include "qemu/queue.h"
> >  #include "sysemu/cpus.h"
> > @@ -1109,6 +1110,10 @@ static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
> >          return -1;
> >      }
> >  
> > +    if (!postcopy_ram_supported_by_host()) {
> > +        return -1;
> > +    }
> > +
> >      if (remote_hps != getpagesize())  {
> >          /*
> >           * Some combinations of mismatch are probably possible but it gets
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-13  1:26   ` David Gibson
@ 2015-03-13 11:19     ` Dr. David Alan Gilbert
  2015-03-16  6:23       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 11:19 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Once postcopy is enabled (with migrate_set_capability), the migration
> > will still start on precopy mode.  To cause a transition into postcopy
> > the:
> > 
> >   migrate_start_postcopy
> > 
> > command must be issued.  Postcopy will start sometime after this
> > (when it's next checked in the migration loop).
> > 
> > Issuing the command before migration has started will error,
> > and issuing after it has finished is ignored.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Eric Blake <eblake@redhat.com>
> > ---
> >  hmp-commands.hx               | 15 +++++++++++++++
> >  hmp.c                         |  7 +++++++
> >  hmp.h                         |  1 +
> >  include/migration/migration.h |  3 +++
> >  migration/migration.c         | 22 ++++++++++++++++++++++
> >  qapi-schema.json              |  8 ++++++++
> >  qmp-commands.hx               | 19 +++++++++++++++++++
> >  7 files changed, 75 insertions(+)
> > 
> > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > index e37bc8b..03b8b78 100644
> > --- a/hmp-commands.hx
> > +++ b/hmp-commands.hx
> > @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
> >  ETEXI
> >  
> >      {
> > +        .name       = "migrate_start_postcopy",
> > +        .args_type  = "",
> > +        .params     = "",
> > +        .help       = "Switch migration to postcopy mode",
> > +        .mhandler.cmd = hmp_migrate_start_postcopy,
> > +    },
> > +
> > +STEXI
> > +@item migrate_start_postcopy
> > +@findex migrate_start_postcopy
> > +Switch in-progress migration to postcopy mode. Ignored after the end of
> > +migration (or once already in postcopy).
> > +ETEXI
> > +
> > +    {
> >          .name       = "client_migrate_info",
> >          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
> >          .params     = "protocol hostname port tls-port cert-subject",
> > diff --git a/hmp.c b/hmp.c
> > index b47f331..df9736c 100644
> > --- a/hmp.c
> > +++ b/hmp.c
> > @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
> >      }
> >  }
> >  
> > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> > +{
> > +    Error *err = NULL;
> > +    qmp_migrate_start_postcopy(&err);
> > +    hmp_handle_error(mon, &err);
> > +}
> > +
> >  void hmp_set_password(Monitor *mon, const QDict *qdict)
> >  {
> >      const char *protocol  = qdict_get_str(qdict, "protocol");
> > diff --git a/hmp.h b/hmp.h
> > index 4bb5dca..da1334f 100644
> > --- a/hmp.h
> > +++ b/hmp.h
> > @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> >  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
> >  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
> >  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
> >  void hmp_set_password(Monitor *mon, const QDict *qdict);
> >  void hmp_expire_password(Monitor *mon, const QDict *qdict);
> >  void hmp_eject(Monitor *mon, const QDict *qdict);
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index e6a814a..293c83e 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -104,6 +104,9 @@ struct MigrationState
> >      int64_t xbzrle_cache_size;
> >      int64_t setup_time;
> >      int64_t dirty_sync_count;
> > +
> > +    /* Flag set once the migration has been asked to enter postcopy */
> > +    bool start_postcopy;
> >  };
> >  
> >  void process_incoming_migration(QEMUFile *f);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index a4fc7d7..43ca656 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> >      }
> >  }
> >  
> > +void qmp_migrate_start_postcopy(Error **errp)
> > +{
> > +    MigrationState *s = migrate_get_current();
> > +
> > +    if (!migrate_postcopy_ram()) {
> > +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> > +                         " the start of migration");
> > +        return;
> > +    }
> > +
> > +    if (s->state == MIG_STATE_NONE) {
> > +        error_setg(errp, "Postcopy must be started after migration has been"
> > +                         " started");
> > +        return;
> > +    }
> > +    /*
> > +     * we don't error if migration has finished since that would be racy
> > +     * with issuing this command.
> > +     */
> > +    atomic_set(&s->start_postcopy, true);
> 
> Why atomic_set?

It's being read by the migration thread, this is happening in the main thread.

There's no strict ordering requirement or anything.

Dave

> 
> > +}
> > +
> >  /* shared migration helpers */
> >  
> >  static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> > diff --git a/qapi-schema.json b/qapi-schema.json
> > index a8af1cb..7ff61e9 100644
> > --- a/qapi-schema.json
> > +++ b/qapi-schema.json
> > @@ -542,6 +542,14 @@
> >  { 'command': 'query-migrate-capabilities', 'returns':   ['MigrationCapabilityStatus']}
> >  
> >  ##
> > +# @migrate-start-postcopy
> > +#
> > +# Switch migration to postcopy mode
> > +#
> > +# Since: 2.3
> > +{ 'command': 'migrate-start-postcopy' }
> > +
> > +##
> >  # @MouseInfo:
> >  #
> >  # Information about a mouse device.
> > diff --git a/qmp-commands.hx b/qmp-commands.hx
> > index a85d847..25d2208 100644
> > --- a/qmp-commands.hx
> > +++ b/qmp-commands.hx
> > @@ -685,6 +685,25 @@ Example:
> >  
> >  EQMP
> >      {
> > +        .name       = "migrate-start-postcopy",
> > +        .args_type  = "",
> > +        .mhandler.cmd_new = qmp_marshal_input_migrate_start_postcopy,
> > +    },
> > +
> > +SQMP
> > +migrate-start-postcopy
> > +----------------------
> > +
> > +Switch an in-progress migration to postcopy mode. Ignored after the end of
> > +migration (or once already in postcopy).
> > +
> > +Example:
> > +-> { "execute": "migrate-start-postcopy" }
> > +<- { "return": {} }
> > +
> > +EQMP
> > +
> > +    {
> >          .name       = "query-migrate-cache-size",
> >          .args_type  = "",
> >          .mhandler.cmd_new = qmp_marshal_input_query_migrate_cache_size,
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream
  2015-03-13  0:55   ` David Gibson
@ 2015-03-13 11:51     ` Dr. David Alan Gilbert
  2015-03-16  6:16       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 11:51 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:41PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > MIG_CMD_PACKAGED is a migration command that allows a chunk
> > of migration stream to be sent in one go, and be received by
> > a separate instance of the loadvm loop while not interacting
> > with the migration stream.
> 
> Hrm.  I'd be more comfortable if the semantics of CMD_PACKAGED were
> defined in terms of visible effects on the other end, rather than in
> terms of how it's implemented internally.
> 
> > This is used by postcopy to load device state (from the package)
> > while loading memory pages from the main stream.
> 
> Which makes the above paragraph a bit misleading - the whole point
> here is that loading the package data *does* interact with the
> migration stream - just that it's the migration stream after the end
> of the package.

Hmm, how about:


MIG_CMD_PACKAGED is a migration command that wraps a chunk of migration
stream inside a package whose length can be determined purely by reading
its header.  The destination guarantees that the whole MIG_CMD_PACKAGED is
read off the stream prior to parsing the contents.

This is used by postcopy to load device state (from the package)
while leaving the main stream free to receive memory pages.


> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/sysemu/sysemu.h |  4 +++
> >  savevm.c                | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  trace-events            |  4 +++
> >  3 files changed, 90 insertions(+)
> > 
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index d6a6d51..e83bf80 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -87,6 +87,7 @@ enum qemu_vm_cmd {
> >      MIG_CMD_INVALID = 0,       /* Must be 0 */
> >      MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
> >      MIG_CMD_PING,              /* Request a PONG on the RP */
> > +    MIG_CMD_PACKAGED,          /* Send a wrapped stream within this stream */
> >  
> >      MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
> >                                        warn we might want to do PC */
> > @@ -101,6 +102,8 @@ enum qemu_vm_cmd {
> >  
> >  };
> >  
> > +#define MAX_VM_CMD_PACKAGED_SIZE (1ul << 24)
> > +
> >  bool qemu_savevm_state_blocked(Error **errp);
> >  void qemu_savevm_state_begin(QEMUFile *f,
> >                               const MigrationParams *params);
> > @@ -113,6 +116,7 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
> >                                uint16_t len, uint8_t *data);
> >  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
> >  void qemu_savevm_send_open_return_path(QEMUFile *f);
> > +void qemu_savevm_send_packaged(QEMUFile *f, const QEMUSizedBuffer *qsb);
> >  void qemu_savevm_send_postcopy_advise(QEMUFile *f);
> >  void qemu_savevm_send_postcopy_listen(QEMUFile *f);
> >  void qemu_savevm_send_postcopy_run(QEMUFile *f);
> > diff --git a/savevm.c b/savevm.c
> > index e31ccb0..f65bff3 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -636,6 +636,38 @@ void qemu_savevm_send_open_return_path(QEMUFile *f)
> >      qemu_savevm_command_send(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
> >  }
> >  
> > +/* We have a buffer of data to send; we don't want that all to be loaded
> > + * by the command itself, so the command contains just the length of the
> > + * extra buffer that we then send straight after it.
> > + * TODO: Must be a better way to organise that
> > + */
> > +void qemu_savevm_send_packaged(QEMUFile *f, const QEMUSizedBuffer *qsb)
> > +{
> > +    size_t cur_iov;
> > +    size_t len = qsb_get_length(qsb);
> > +    uint32_t tmp;
> > +
> > +    tmp = cpu_to_be32(len);
> > +
> > +    trace_qemu_savevm_send_packaged();
> > +    qemu_savevm_command_send(f, MIG_CMD_PACKAGED, 4, (uint8_t *)&tmp);
> > +
> > +    /* all the data follows (concatinating the iov's) */
> > +    for (cur_iov = 0; cur_iov < qsb->n_iov; cur_iov++) {
> > +        /* The iov entries are partially filled */
> > +        size_t towrite = (qsb->iov[cur_iov].iov_len > len) ?
> > +                              len :
> > +                              qsb->iov[cur_iov].iov_len;
> > +        len -= towrite;
> > +
> > +        if (!towrite) {
> > +            break;
> > +        }
> > +
> > +        qemu_put_buffer(f, qsb->iov[cur_iov].iov_base, towrite);
> > +    }
> > +}
> > +
> >  /* Send prior to any postcopy transfer */
> >  void qemu_savevm_send_postcopy_advise(QEMUFile *f)
> >  {
> > @@ -1265,6 +1297,48 @@ static int loadvm_process_command_simple_lencheck(const char *name,
> >      return 0;
> >  }
> >  
> > +/* Immediately following this command is a blob of data containing an embedded
> > + * chunk of migration stream; read it and load it.
> > + */
> > +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> > +                                      uint32_t length)
> > +{
> > +    int ret;
> > +    uint8_t *buffer;
> > +    QEMUSizedBuffer *qsb;
> > +
> > +    trace_loadvm_handle_cmd_packaged(length);
> > +
> > +    if (length > MAX_VM_CMD_PACKAGED_SIZE) {
> > +        error_report("Unreasonably large packaged state: %u", length);
> > +        return -1;
> 
> It would be a good idea to check this on the send side as well as
> receive, wouldn't it?

Yes, I had been doing that in the postcopy code that called the
send code; but I've now moved it down into savevm_send_packaged.
Thanks.

Dave

> > +    }
> > +    buffer = g_malloc0(length);
> > +    ret = qemu_get_buffer(mis->file, buffer, (int)length);
> > +    if (ret != length) {
> > +        g_free(buffer);
> > +        error_report("CMD_PACKAGED: Buffer receive fail ret=%d length=%d\n",
> > +                ret, length);
> > +        return (re/t < 0) ? ret : -EAGAIN;
> > +    }
> > +    trace_loadvm_handle_cmd_packaged_received(ret);
> > +
> > +    /* Setup a dummy QEMUFile that actually reads from the buffer */
> > +    qsb = qsb_create(buffer, length);
> > +    g_free(buffer); /* Because qsb_create copies */
> > +    if (!qsb) {
> > +        error_report("Unable to create qsb");
> > +    }
> > +    QEMUFile *packf = qemu_bufopen("r", qsb);
> > +
> > +    ret = qemu_loadvm_state_main(packf, mis);
> > +    trace_loadvm_handle_cmd_packaged_main(ret);
> > +    qemu_fclose(packf);
> > +    qsb_free(qsb);
> > +
> > +    return ret;
> > +}
> > +
> >  /*
> >   * Process an incoming 'QEMU_VM_COMMAND'
> >   * negative return on error (will issue error message)
> > @@ -1315,6 +1389,14 @@ static int loadvm_process_command(QEMUFile *f)
> >          migrate_send_rp_pong(mis, tmp32);
> >          break;
> >  
> > +    case MIG_CMD_PACKAGED:
> > +        if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_PACKAGED",
> > +            len, 4)) {
> > +            return -1;
> > +         }
> > +        tmp32 = qemu_get_be32(f);
> > +        return loadvm_handle_cmd_packaged(mis, tmp32);
> > +
> >      case MIG_CMD_POSTCOPY_ADVISE:
> >          if (loadvm_process_command_simple_lencheck("CMD_POSTCOPY_ADVISE",
> >                                                     len, 16)) {
> > diff --git a/trace-events b/trace-events
> > index 050f553..cbf995c 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1171,6 +1171,10 @@ qemu_loadvm_state_main(void) ""
> >  qemu_loadvm_state_main_quit_parent(void) ""
> >  qemu_loadvm_state_post_main(int ret) "%d"
> >  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> > +qemu_savevm_send_packaged(void) ""
> > +loadvm_handle_cmd_packaged(unsigned int length) "%u"
> > +loadvm_handle_cmd_packaged_main(int ret) "%d"
> > +loadvm_handle_cmd_packaged_received(int ret) "%d"
> >  loadvm_postcopy_handle_advise(void) ""
> >  loadvm_postcopy_handle_end(void) ""
> >  loadvm_postcopy_handle_listen(void) ""
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes
  2015-03-13  4:58   ` David Gibson
@ 2015-03-13 12:25     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 12:25 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:48PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > When postcopy calls qemu_savevm_state_complete it's not really
> > the end of migration, so skip:
> 
> Given that, maybe the name should change..

The name reflects that it calls the save_live_complete method on each
device, so if we wanted to remove the 'complete' from the name we'd
probably want to change the method name everywhere;   and anyway
it does complete most devices; the only exception are devices
that are postcopiable.

> >    a) Finishing postcopiable iterative devices - they'll carry on
> >    b) The termination byte on the end of the stream.
> > 
> > We then also add:
> >   qemu_savevm_state_postcopy_complete
> > which is called at the end of a postcopy migration to call the
> > complete methods on devices skipped in the _complete call.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Otherwise,
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Thanks.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram.
  2015-03-12  6:14   ` David Gibson
@ 2015-03-13 12:58     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 12:58 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:39PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> This absolutely needs a commit message.  I shouldn't have to look at
> the code to find out what the presence of this capability asserts, and
> from where to where it's communicating that information.

OK, how about:

The 'postcopy ram' flag allows postcopy migration of RAM;
note that the migration starts off in precopy mode until
postcopy mode is triggered (see the migrate_start_postcopy
patch later in the series).


> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Eric Blake <eblake@redhat.com>
> > ---
> >  include/migration/migration.h | 1 +
> >  migration/migration.c         | 9 +++++++++
> >  qapi-schema.json              | 7 ++++++-
> >  3 files changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 751caa0..f94af5b 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -177,6 +177,7 @@ void migrate_add_blocker(Error *reason);
> >   */
> >  void migrate_del_blocker(Error *reason);
> >  
> > +bool migrate_postcopy_ram(void);
> >  bool migrate_rdma_pin_all(void);
> >  bool migrate_zero_blocks(void);
> >  
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 4592060..434864a 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -663,6 +663,15 @@ bool migrate_rdma_pin_all(void)
> >      return s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
> >  }
> >  
> > +bool migrate_postcopy_ram(void)
> > +{
> > +    MigrationState *s;
> > +
> > +    s = migrate_get_current();
> > +
> > +    return s->enabled_capabilities[MIGRATION_CAPABILITY_X_POSTCOPY_RAM];
> 
> As an asside, I'm assuming you'll get rid of these "x-" prefixes
> before you post a series intended for final inclusion?

I was going to do that as a final patch that removed the x-.

Dave

> 
> > +}
> > +
> >  bool migrate_auto_converge(void)
> >  {
> >      MigrationState *s;
> > diff --git a/qapi-schema.json b/qapi-schema.json
> > index e16f8eb..a8af1cb 100644
> > --- a/qapi-schema.json
> > +++ b/qapi-schema.json
> > @@ -494,10 +494,15 @@
> >  # @auto-converge: If enabled, QEMU will automatically throttle down the guest
> >  #          to speed up convergence of RAM migration. (since 1.6)
> >  #
> > +# @x-postcopy-ram: Start executing on the migration target before all of RAM has
> > +#          been migrated, pulling the remaining pages along as needed. NOTE: If
> > +#          the migration fails during postcopy the VM will fail.  (since 2.3)
> > +#
> >  # Since: 1.2
> >  ##
> >  { 'enum': 'MigrationCapability',
> > -  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks'] }
> > +  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> > +           'x-postcopy-ram'] }
> >  
> >  ##
> >  # @MigrationCapabilityStatus
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works.
  2015-03-10  1:04       ` David Gibson
@ 2015-03-13 13:07         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 13:07 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Thu, Mar 05, 2015 at 09:21:39AM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:24PM +0000, Dr. David Alan Gilbert
> > (git) wrote:
> [snip]
> > > > +=== Enabling postcopy ===
> > > > +
> > > > +To enable postcopy (prior to the start of migration):
> > > > +
> > > > +migrate_set_capability x-postcopy-ram on
> > > > +
> > > > +The migration will still start in precopy mode, however issuing:
> > > > +
> > > > +migrate_start_postcopy
> > > > +
> > > > +will now cause the transition from precopy to postcopy.
> > > > +It can be issued immediately after migration is started or any
> > > > +time later on.  Issuing it after the end of a migration is harmless.
> > > 
> > > It's not quite clear to me what this means.  Does
> > > "migrate_start_postcopy" mean it will immediately transfer execution
> > > and transfer any remaining pages postcopy, or does it just mean it
> > > will start postcopying once the remaining data to transfer is small
> > > enough?
> > 
> > Yes; it will flip into postcopy soon after issuing that command irrespective
> > of the amount of data remaining.
> > 
> > > What's the reason for this rather awkward two stage activation of
> > > postcopy?
> > 
> > We need to keep track of the pages that are received during the precopy phase,
> > and do some madvise and other setups on the destination RAM area before precopy
> > starts; and so we need to know we might want to do postcopy - so we need
> > to be told early.  In the earliest posted version of my patches I had a
> > time-limit setting and after the time limit expired QEMU would switch into
> > the second phase of postcopy itself, but Paolo suggested the migrate_start_postcopy:
> > 
> > https://lists.nongnu.org/archive/html/qemu-devel/2014-07/msg00943.html
> > 
> > and it works out simpler anyway.
> 
> Ok, that makes sense.
> 
> > > > +=== Postcopy device transfer ===
> > > > +
> > > > +Loading of device data may cause the device emulation to access guest RAM
> > > > +that may trigger faults that have to be resolved by the source, as such
> > > > +the migration stream has to be able to respond with page data *during* the
> > > > +device load, and hence the device data has to be read from the stream completely
> > > > +before the device load begins to free the stream up.  This is achieved by
> > > > +'packaging' the device data into a blob that's read in one go.
> > > > +
> > > > +Source behaviour
> > > > +
> > > > +Until postcopy is entered the migration stream is identical to normal
> > > > +precopy, except for the addition of a 'postcopy advise' command at
> > > > +the beginning, to tell the destination that postcopy might happen.
> > > > +When postcopy starts the source sends the page discard data and then
> > > > +forms the 'package' containing:
> > > > +
> > > > +   Command: 'postcopy ram listen'
> > > > +   The device state
> > > > +      A series of sections, identical to the precopy streams device state stream
> > > > +      containing everything except postcopiable devices (i.e. RAM)
> > > > +   Command: 'postcopy ram run'
> > > > +
> > > > +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
> > > > +contents are formatted in the same way as the main migration stream.
> > > 
> > > It seems to me the "ram listen", "ram run" and CMD_PACKAGED really
> > > have to be used in conjuction this way, they don't really have any use
> > > on their own.  So why not make it all CMD_POSTCOPY_TRANSITION and have
> > > the "listen" and "run" take effect implicitly at the beginning and end
> > > of the device data.
> > 
> > CMD_PACKAGED seems like something that was generally useful; it's fairly
> > complicated on it's own and so it seemed best to keep it separate.
> 
> And can you actually think of another use case for it?
> 
> The thing that bothers me is that the "listen" and "run" operations
> will not work correctly anywhere other than at the beginning and end
> of the packaged blob.

It feels similar to the packaged blobs that checkpointing schemes
like COLO and microcheckpointing use; although they seem to craft their own wire
protocol rather than sticking with a migration protocol.
What they do in terms of controlling the CPU etc is certainly different
(so the RUN/listen stuff is different).

Dave

> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets
  2015-03-10  2:49   ` David Gibson
@ 2015-03-13 13:14     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 13:14 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:30PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Postcopy needs a method to send messages from the destination back to
> > the source, this is the 'return path'.
> > 
> > Wire it up for 'socket' QEMUFile's using a dup'd fd.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/qemu-file.h  |  7 +++++
> >  migration/qemu-file-internal.h |  2 ++
> >  migration/qemu-file-unix.c     | 58 +++++++++++++++++++++++++++++++++++-------
> >  migration/qemu-file.c          | 12 +++++++++
> >  4 files changed, 70 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
> > index 6ae0b03..3c38963 100644
> > --- a/include/migration/qemu-file.h
> > +++ b/include/migration/qemu-file.h
> > @@ -85,6 +85,11 @@ typedef size_t (QEMURamSaveFunc)(QEMUFile *f, void *opaque,
> >                                 int *bytes_sent);
> >  
> >  /*
> > + * Return a QEMUFile for comms in the opposite direction
> > + */
> > +typedef QEMUFile *(QEMURetPathFunc)(void *opaque);
> > +
> > +/*
> >   * Stop any read or write (depending on flags) on the underlying
> >   * transport on the QEMUFile.
> >   * Existing blocking reads/writes must be woken
> > @@ -102,6 +107,7 @@ typedef struct QEMUFileOps {
> >      QEMURamHookFunc *after_ram_iterate;
> >      QEMURamHookFunc *hook_ram_load;
> >      QEMURamSaveFunc *save_page;
> > +    QEMURetPathFunc *get_return_path;
> >      QEMUFileShutdownFunc *shut_down;
> >  } QEMUFileOps;
> >  
> > @@ -188,6 +194,7 @@ int64_t qemu_file_get_rate_limit(QEMUFile *f);
> >  int qemu_file_get_error(QEMUFile *f);
> >  void qemu_file_set_error(QEMUFile *f, int ret);
> >  int qemu_file_shutdown(QEMUFile *f);
> > +QEMUFile *qemu_file_get_return_path(QEMUFile *f);
> >  void qemu_fflush(QEMUFile *f);
> >  
> >  static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
> > diff --git a/migration/qemu-file-internal.h b/migration/qemu-file-internal.h
> > index d95e853..a39b8e3 100644
> > --- a/migration/qemu-file-internal.h
> > +++ b/migration/qemu-file-internal.h
> > @@ -48,6 +48,8 @@ struct QEMUFile {
> >      unsigned int iovcnt;
> >  
> >      int last_error;
> > +
> > +    struct QEMUFile *return_path;
> 
> AFAICT, the only thing this field is used for is an assert, which
> seems a bit pointless.  I'd suggest either getting rid of it, or

Done; it's gone.

Dave

> make qemu_file_get_return_path() safely idempotent by having it only
> call the FileOps pointer if QEMUFile::return_path is non-NULL,
> otherwise just return the existing return_path.
> 
> Setting the field probably belongs better in the wrapper than in the
> socket specific callback, too, since there's nothing inherently
> related to the socket implementation about it.
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure
  2015-03-13  5:19   ` David Gibson
@ 2015-03-13 13:47     ` Dr. David Alan Gilbert
  2015-03-16  6:30       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-13 13:47 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:49PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The PMI holds the state of each page on the incoming side,
> > so that we can tell if the page is missing, already received
> > or there is a request outstanding for it.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h    |  18 ++++
> >  include/migration/postcopy-ram.h |  12 +++
> >  include/qemu/typedefs.h          |   1 +
> >  migration/postcopy-ram.c         | 223 +++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 254 insertions(+)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index b44b9b2..86200b9 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -48,6 +48,23 @@ enum mig_rpcomm_cmd {
> >      MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> >  };
> >  
> > +/* Postcopy page-map-incoming - data about each page on the inbound side */
> > +typedef enum {
> > +   POSTCOPY_PMI_MISSING    = 0, /* page hasn't yet been received */
> 
> This appears to be a 3 space indent instead of the usual 4.

Thanks; (wth didn't scripts/checkpatch spot that?)

> > +   POSTCOPY_PMI_REQUESTED  = 1, /* Kernel asked for a page, not yet got it */
> > +   POSTCOPY_PMI_RECEIVED   = 2, /* We've got the page */
> > +} PostcopyPMIState;
> 
> TBH, I'm not sure this enum actually helps anything.  I wonder if
> things might actually be cleared if you simply treat the received and
> requested bitmaps separately.
> 
> > +struct PostcopyPMI {
> > +    QemuMutex      mutex;
> > +    unsigned long *state0;        /* Together with state1 form a */
> > +    unsigned long *state1;        /* PostcopyPMIState */
> 
> The comments on the lines above don't appear to shed any light on anything.

Hmm; so most of the comments here come down to how this pair are represented.
I'd previously had a 'received' and 'requested' array pair, and only used
!received, !requested
received, !requested
!received, requested

but one of the intermediate changes (that never survived) I needed a 4th state,
and well I did have the 4th state it ended up encoded as received && requested;
but that wasn't actually what my 4th state meant, and so I thought it best
to try and get away from each of the bits meaning something and
try and move more towards just treating it as a state with the encoding
to the two state bits done in as few places as possible.
Really what I want is a nice dense efficient array of PostcopyPMIState's.

> > +    unsigned long  host_mask;     /* A mask with enough bits set to cover one
> > +                                     host page in the PMI */
> > +    unsigned long  host_bits;     /* The number of bits in the map representing
> > +                                     one host page */
> 
> I find the host_bits name fairly confusing.  Maybe "tp_per_hp"?

Done.

> > +};
> > +
> >  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> >  
> >  typedef enum {
> > @@ -69,6 +86,7 @@ struct MigrationIncomingState {
> >  
> >      QEMUFile *return_path;
> >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> > +    PostcopyPMI    postcopy_pmi;
> >  };
> >  
> >  MigrationIncomingState *migration_incoming_get_current(void);
> > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > index d81934f..e93ee8a 100644
> > --- a/include/migration/postcopy-ram.h
> > +++ b/include/migration/postcopy-ram.h
> > @@ -13,7 +13,19 @@
> >  #ifndef QEMU_POSTCOPY_RAM_H
> >  #define QEMU_POSTCOPY_RAM_H
> >  
> > +#include "migration/migration.h"
> > +
> >  /* Return true if the host supports everything we need to do postcopy-ram */
> >  bool postcopy_ram_supported_by_host(void);
> >  
> > +/*
> > + * In 'advise' mode record that a page has been received.
> > + */
> > +void postcopy_hook_early_receive(MigrationIncomingState *mis,
> > +                                 size_t bitmap_index);
> > +
> > +void postcopy_pmi_destroy(MigrationIncomingState *mis);
> > +void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> > +                                size_t start, size_t npages);
> > +void postcopy_pmi_dump(MigrationIncomingState *mis);
> >  #endif
> > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > index 611db46..924eeb6 100644
> > --- a/include/qemu/typedefs.h
> > +++ b/include/qemu/typedefs.h
> > @@ -61,6 +61,7 @@ typedef struct PCIExpressHost PCIExpressHost;
> >  typedef struct PCIHostState PCIHostState;
> >  typedef struct PCMCIACardState PCMCIACardState;
> >  typedef struct PixelFormat PixelFormat;
> > +typedef struct PostcopyPMI PostcopyPMI;
> >  typedef struct PropertyInfo PropertyInfo;
> >  typedef struct Property Property;
> >  typedef struct QEMUBH QEMUBH;
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index a0e20b2..4f29055 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -24,6 +24,7 @@
> >  #include "migration/migration.h"
> >  #include "migration/postcopy-ram.h"
> >  #include "sysemu/sysemu.h"
> > +#include "qemu/bitmap.h"
> >  #include "qemu/error-report.h"
> >  #include "trace.h"
> >  
> > @@ -49,6 +50,220 @@
> >  
> >  #if defined(__linux__) && defined(__NR_userfaultfd)
> >  
> > +/* ---------------------------------------------------------------------- */
> > +/* Postcopy pagemap-inbound (pmi) - data structures that record the       */
> > +/* state of each page used by the inbound postcopy                        */
> > +/* It's a pair of bitmaps (of the same structure as the migration bitmaps)*/
> > +/* holding one bit per target-page, although most operations work on host */
> > +/* pages, the exception being a hook that receives incoming pages off the */
> > +/* migration stream which come in a TP at a time, although the source     */
> > +/* _should_ guarantee it sends a sequence of TPs representing HPs during  */
> > +/* the postcopy phase, there is no such guarantee during precopy.  We     */
> > +/* could boil this down to only holding one bit per-host page, but we lose*/
> > +/* sanity checking that we really do get whole host-pages from the source.*/
> > +__attribute__ (( unused )) /* Until later in patch series */
> > +static void postcopy_pmi_init(MigrationIncomingState *mis, size_t ram_pages)
> > +{
> > +    unsigned int tpb = qemu_target_page_bits();
> > +    unsigned long host_bits;
> > +
> > +    qemu_mutex_init(&mis->postcopy_pmi.mutex);
> > +    mis->postcopy_pmi.state0 = bitmap_new(ram_pages);
> > +    mis->postcopy_pmi.state1 = bitmap_new(ram_pages);
> > +    bitmap_clear(mis->postcopy_pmi.state0, 0, ram_pages);
> > +    bitmap_clear(mis->postcopy_pmi.state1, 0, ram_pages);
> > +    /*
> > +     * Each bit in the map represents one 'target page' which is no bigger
> > +     * than a host page but can be smaller.  It's useful to have some
> > +     * convenience masks for later
> > +     */
> > +
> > +    /*
> > +     * The number of bits one host page takes up in the bitmap
> > +     * e.g. on a 64k host page, 4k Target page, host_bits=64/4=16
> > +     */
> > +    host_bits = getpagesize() / (1ul << tpb);
> 
> That's equivalent to getpagesize() >> tpb, isn't it?

Yes, fixed.

> > +    assert(is_power_of_2(host_bits));
> > +
> > +    mis->postcopy_pmi.host_bits = host_bits;
> > +
> > +    if (host_bits < BITS_PER_LONG) {
> > +        /* A mask starting at bit 0 containing host_bits continuous set bits */
> > +        mis->postcopy_pmi.host_mask =  (1ul << host_bits) - 1;
> > +    } else {
> > +        /*
> > +         * This is a host where the ratio between host and target pages is
> > +         * bigger than the size of our longs, so we can't make a mask
> > +         * but we are only losing sanity checking if we just check one long's
> > +         * worth of bits.
> > +         */
> > +        mis->postcopy_pmi.host_mask = ~0l;
> > +    }
> > +
> > +
> > +    assert((ram_pages % host_bits) == 0);
> > +}
> > +
> > +void postcopy_pmi_destroy(MigrationIncomingState *mis)
> > +{
> > +    g_free(mis->postcopy_pmi.state0);
> > +    mis->postcopy_pmi.state0 = NULL;
> > +    g_free(mis->postcopy_pmi.state1);
> > +    mis->postcopy_pmi.state1 = NULL;
> > +    qemu_mutex_destroy(&mis->postcopy_pmi.mutex);
> > +}
> > +
> > +/*
> > + * Mark a set of pages in the PMI as being clear; this is used by the discard
> > + * at the start of postcopy, and before the postcopy stream starts.
> > + */
> > +void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> > +                                size_t start, size_t npages)
> > +{
> > +    /* Clear to state 0 = missing */
> > +    bitmap_clear(mis->postcopy_pmi.state0, start, npages);
> > +    bitmap_clear(mis->postcopy_pmi.state1, start, npages);
> > +}
> > +
> > +/*
> > + * Test a host-page worth of bits in the map starting at bitmap_index
> > + * The bits should all be consistent
> > + */
> > +static bool test_hpbits(MigrationIncomingState *mis,
> > +                        size_t bitmap_index, unsigned long *map)
> > +{
> > +    long masked;
> > +
> > +    assert((bitmap_index & (mis->postcopy_pmi.host_bits-1)) == 0);
> > +
> > +    masked = (map[BIT_WORD(bitmap_index)] >>
> > +               (bitmap_index % BITS_PER_LONG)) &
> > +             mis->postcopy_pmi.host_mask;
> > +
> > +    assert((masked == 0) || (masked == mis->postcopy_pmi.host_mask));
> > +    return !!masked;
> > +}
> > +
> > +/*
> > + * Set host-page worth of bits in the map starting at bitmap_index
> > + * to the given state
> > + */
> > +static void set_hp(MigrationIncomingState *mis,
> > +                   size_t bitmap_index, PostcopyPMIState state)
> > +{
> > +    long shifted_mask = mis->postcopy_pmi.host_mask <<
> > +                        (bitmap_index % BITS_PER_LONG);
> > +
> > +    assert((bitmap_index & (mis->postcopy_pmi.host_bits-1)) == 0);
> 
> assert(state != 0)?

It could do, although again I'm just trying to make this encode things.

> > +
> > +    if (state & 1) {
> 
> Using the symbolic constants for PostcopyPMIState values here might be
> better.

I was treating this as the thing that encoded/decoded the enum; it
doesn't need to know the meanings of the bits.

> 
> > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] |= shifted_mask;
> > +    } else {
> > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > +    }
> > +    if (state & 2) {
> > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] |= shifted_mask;
> > +    } else {
> > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > +    }
> > +}
> > +
> > +/*
> > + * Retrieve the state of the given page
> > + * Note: This version for use by callers already holding the lock
> > + */
> > +static PostcopyPMIState postcopy_pmi_get_state_nolock(
> > +                            MigrationIncomingState *mis,
> > +                            size_t bitmap_index)
> > +{
> > +    bool b0, b1;
> > +
> > +    b0 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state0);
> > +    b1 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state1);
> > +
> > +    return (b0 ? 1 : 0) + (b1 ? 2 : 0);
> 
> Ugh.. this is a hidden dependency on the PostcopyPMIState enum
> elements never changing value.  Safer to code it as:
>       if (!b0 && !b1) {
>           return POSTCOPY_PMI_MISSING;
>       } else if (...)
>            ...
> 
> and let gcc sort it out.

Again, I was trying to make this just the interface; so it doesn't
know or care about the enum mapping; we can change the enum mapping to
the bits without changing this function (or the callers) at all.

> > +}
> > +
> > +/* Retrieve the state of the given page */
> > +__attribute__ (( unused )) /* Until later in patch series */
> > +static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
> > +                                               size_t bitmap_index)
> > +{
> > +    PostcopyPMIState ret;
> > +    qemu_mutex_lock(&mis->postcopy_pmi.mutex);
> > +    ret = postcopy_pmi_get_state_nolock(mis, bitmap_index);
> > +    qemu_mutex_unlock(&mis->postcopy_pmi.mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +/*
> > + * Set the page state to the given state if the previous state was as expected
> > + * Return the actual previous state.
> > + */
> > +__attribute__ (( unused )) /* Until later in patch series */
> > +static PostcopyPMIState postcopy_pmi_change_state(MigrationIncomingState *mis,
> > +                                           size_t bitmap_index,
> > +                                           PostcopyPMIState expected_state,
> > +                                           PostcopyPMIState new_state)
> > +{
> > +    PostcopyPMIState old_state;
> > +
> > +    qemu_mutex_lock(&mis->postcopy_pmi.mutex);
> > +    old_state = postcopy_pmi_get_state_nolock(mis, bitmap_index);
> > +
> > +    if (old_state == expected_state) {
> > +        switch (new_state) {
> > +        case POSTCOPY_PMI_MISSING:
> > +            assert(0); /* This shouldn't happen - use discard_range */
> > +            break;
> > +
> > +        case POSTCOPY_PMI_REQUESTED:
> > +            assert(old_state == POSTCOPY_PMI_MISSING);
> > +            /* missing -> requested */
> > +            set_hp(mis, bitmap_index, POSTCOPY_PMI_REQUESTED);
> > +            break;
> > +
> > +        case POSTCOPY_PMI_RECEIVED:
> > +            assert(old_state == POSTCOPY_PMI_MISSING ||
> > +                   old_state == POSTCOPY_PMI_REQUESTED);
> > +            /* -> received */
> > +            set_hp(mis, bitmap_index, POSTCOPY_PMI_RECEIVED);
> > +            break;
> > +        }
> > +    }
> > +
> > +    qemu_mutex_unlock(&mis->postcopy_pmi.mutex);
> > +    return old_state;
> > +}
> > +
> > +/*
> > + * Useful when debugging postcopy, although if it failed early the
> > + * received map can be quite sparse and thus big when dumped.
> > + */
> > +void postcopy_pmi_dump(MigrationIncomingState *mis)
> > +{
> > +    fprintf(stderr, "postcopy_pmi_dump: bit 0\n");
> > +    ram_debug_dump_bitmap(mis->postcopy_pmi.state0, false);
> > +    fprintf(stderr, "postcopy_pmi_dump: bit 1\n");
> > +    ram_debug_dump_bitmap(mis->postcopy_pmi.state1, true);
> > +    fprintf(stderr, "postcopy_pmi_dump: end\n");
> > +}
> > +
> > +/* Called by ram_load prior to mapping the page */
> > +void postcopy_hook_early_receive(MigrationIncomingState *mis,
> > +                                 size_t bitmap_index)
> > +{
> > +    if (mis->postcopy_state == POSTCOPY_INCOMING_ADVISE) {
> > +        /*
> > +         * If we're in precopy-advise mode we need to track received pages even
> > +         * though we don't need to place pages atomically yet.
> > +         * In advise mode there's only a single thread, so don't need locks
> > +         */
> > +        set_bit(bitmap_index, mis->postcopy_pmi.state1); /* 2=received */
> 
> Yeah.. so this bypasses postcopy_pmi_{get,change}_state, which again
> makes me wonder whether the enum serves any purpose.

Yes, let me see how to fix that; this is the one place that deals in things
other than host-pages.

> > +    }
> > +}
> > +
> >  static bool ufd_version_check(int ufd)
> >  {
> >      struct uffdio_api api_struct;
> > @@ -71,6 +286,7 @@ static bool ufd_version_check(int ufd)
> >      return true;
> >  }
> >  
> > +
> 
> Extraneous whitespace change.

Oops; gone.

> 
> >  bool postcopy_ram_supported_by_host(void)
> >  {
> >      long pagesize = getpagesize();
> > @@ -157,5 +373,12 @@ bool postcopy_ram_supported_by_host(void)
> >      return false;
> >  }
> >  
> > +/* Called by ram_load prior to mapping the page */
> > +void postcopy_hook_early_receive(MigrationIncomingState *mis,
> > +                                 size_t bitmap_index)
> > +{
> > +    /* We don't support postcopy so don't care */
> > +}
> > +
> >  #endif
> >  

Thanks,

Dave
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream
  2015-03-13 11:51     ` Dr. David Alan Gilbert
@ 2015-03-16  6:16       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-16  6:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2801 bytes --]

On Fri, Mar 13, 2015 at 11:51:42AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:41PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > MIG_CMD_PACKAGED is a migration command that allows a chunk
> > > of migration stream to be sent in one go, and be received by
> > > a separate instance of the loadvm loop while not interacting
> > > with the migration stream.
> > 
> > Hrm.  I'd be more comfortable if the semantics of CMD_PACKAGED were
> > defined in terms of visible effects on the other end, rather than in
> > terms of how it's implemented internally.
> > 
> > > This is used by postcopy to load device state (from the package)
> > > while loading memory pages from the main stream.
> > 
> > Which makes the above paragraph a bit misleading - the whole point
> > here is that loading the package data *does* interact with the
> > migration stream - just that it's the migration stream after the end
> > of the package.
> 
> Hmm, how about:
> 
> 
> MIG_CMD_PACKAGED is a migration command that wraps a chunk of migration
> stream inside a package whose length can be determined purely by reading
> its header.  The destination guarantees that the whole MIG_CMD_PACKAGED is
> read off the stream prior to parsing the contents.
> 
> This is used by postcopy to load device state (from the package)
> while leaving the main stream free to receive memory pages.

It's an improvement.

I'm still a bit concerned that the semantics of CMD_PACKAGED are
unhealthily bound up with hw the current implementation works.

[snip]
> > > +/* Immediately following this command is a blob of data containing an embedded
> > > + * chunk of migration stream; read it and load it.
> > > + */
> > > +static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
> > > +                                      uint32_t length)
> > > +{
> > > +    int ret;
> > > +    uint8_t *buffer;
> > > +    QEMUSizedBuffer *qsb;
> > > +
> > > +    trace_loadvm_handle_cmd_packaged(length);
> > > +
> > > +    if (length > MAX_VM_CMD_PACKAGED_SIZE) {
> > > +        error_report("Unreasonably large packaged state: %u", length);
> > > +        return -1;
> > 
> > It would be a good idea to check this on the send side as well as
> > receive, wouldn't it?
> 
> Yes, I had been doing that in the postcopy code that called the
> send code; but I've now moved it down into savevm_send_packaged.

Good, better symmetry that way.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-13 10:19     ` Dr. David Alan Gilbert
@ 2015-03-16  6:18       ` David Gibson
  2015-03-20 12:37         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-16  6:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1116 bytes --]

On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Modify save_live_pending to return separate postcopiable and
> > > non-postcopiable counts.
> > > 
> > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > 
> > What's the purpose of the can_postcopy callback?  There are no callers
> > in this patch - is it still necessary with the change to
> > save_live_pending?
> 
> The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> to decide which devices must be completed at that point.

Couldn't they check for non-zero postcopiable state from
save_live_pending instead?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-03-13 10:41     ` Dr. David Alan Gilbert
@ 2015-03-16  6:22       ` David Gibson
  2015-03-30  8:14       ` Paolo Bonzini
  1 sibling, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-16  6:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 5022 bytes --]

On Fri, Mar 13, 2015 at 10:41:53AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:45PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Provide a check to see if the OS we're running on has all the bits
> > > needed for postcopy.
> > > 
> > > Creates postcopy-ram.c which will get most of the other helpers we need.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  include/migration/postcopy-ram.h |  19 +++++
> > >  migration/Makefile.objs          |   2 +-
> > >  migration/postcopy-ram.c         | 161 +++++++++++++++++++++++++++++++++++++++
> > >  savevm.c                         |   5 ++
> > >  4 files changed, 186 insertions(+), 1 deletion(-)
> > >  create mode 100644 include/migration/postcopy-ram.h
> > >  create mode 100644 migration/postcopy-ram.c
> > > 
> > > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > > new file mode 100644
> > > index 0000000..d81934f
> > > --- /dev/null
> > > +++ b/include/migration/postcopy-ram.h
> > > @@ -0,0 +1,19 @@
> > > +/*
> > > + * Postcopy migration for RAM
> > > + *
> > > + * Copyright 2013 Red Hat, Inc. and/or its affiliates
> > > + *
> > > + * Authors:
> > > + *  Dave Gilbert  <dgilbert@redhat.com>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +#ifndef QEMU_POSTCOPY_RAM_H
> > > +#define QEMU_POSTCOPY_RAM_H
> > > +
> > > +/* Return true if the host supports everything we need to do postcopy-ram */
> > > +bool postcopy_ram_supported_by_host(void);
> > > +
> > > +#endif
> > > diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> > > index d929e96..0cac6d7 100644
> > > --- a/migration/Makefile.objs
> > > +++ b/migration/Makefile.objs
> > > @@ -1,7 +1,7 @@
> > >  common-obj-y += migration.o tcp.o
> > >  common-obj-y += vmstate.o
> > >  common-obj-y += qemu-file.o qemu-file-buf.o qemu-file-unix.o qemu-file-stdio.o
> > > -common-obj-y += xbzrle.o
> > > +common-obj-y += xbzrle.o postcopy-ram.o
> > >  
> > >  common-obj-$(CONFIG_RDMA) += rdma.o
> > >  common-obj-$(CONFIG_POSIX) += exec.o unix.o fd.o
> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > > new file mode 100644
> > > index 0000000..a0e20b2
> > > --- /dev/null
> > > +++ b/migration/postcopy-ram.c
> > > @@ -0,0 +1,161 @@
> > > +/*
> > > + * Postcopy migration for RAM
> > > + *
> > > + * Copyright 2013-2014 Red Hat, Inc. and/or its affiliates
> > > + *
> > > + * Authors:
> > > + *  Dave Gilbert  <dgilbert@redhat.com>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +
> > > +/*
> > > + * Postcopy is a migration technique where the execution flips from the
> > > + * source to the destination before all the data has been copied.
> > > + */
> > > +
> > > +#include <glib.h>
> > > +#include <stdio.h>
> > > +#include <unistd.h>
> > > +
> > > +#include "qemu-common.h"
> > > +#include "migration/migration.h"
> > > +#include "migration/postcopy-ram.h"
> > > +#include "sysemu/sysemu.h"
> > > +#include "qemu/error-report.h"
> > > +#include "trace.h"
> > > +
> > > +/* Postcopy needs to detect accesses to pages that haven't yet been copied
> > > + * across, and efficiently map new pages in, the techniques for doing this
> > > + * are target OS specific.
> > > + */
> > > +#if defined(__linux__)
> > > +
> > > +#include <sys/mman.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/types.h>
> > > +#include <asm/types.h> /* for __u64 */
> > > +#include <linux/userfaultfd.h>
> > > +
> > > +#ifdef HOST_X86_64
> > > +#ifndef __NR_userfaultfd
> > > +#define __NR_userfaultfd 323
> > 
> > Sholdn't this come from the kernel headers imported in the previous
> > patch?  Rather than having an arch-specific hack.
> 
> The header, like the rest of the kernel headers, just provides
> the constant and structure definitions for the call; the syscall numbers
> come from arch specific headers.  I guess in the final world I wouldn't
> need this at all since it'll come from the system headers; but what's
> the right way to put this in for new syscalls?

Uh.. yeah.. I'm not sure actually.  I was assuming the imported kernel
headers, because part of the reason they're there is for getting
defines that aren't in the system headers (e.g. ioctl numbers and
structures).  But without asm/unistd.h it's not clear to be how a new
system call should be handled.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-13 11:19     ` Dr. David Alan Gilbert
@ 2015-03-16  6:23       ` David Gibson
  2015-03-18 17:59         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-16  6:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 5821 bytes --]

On Fri, Mar 13, 2015 at 11:19:06AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Once postcopy is enabled (with migrate_set_capability), the migration
> > > will still start on precopy mode.  To cause a transition into postcopy
> > > the:
> > > 
> > >   migrate_start_postcopy
> > > 
> > > command must be issued.  Postcopy will start sometime after this
> > > (when it's next checked in the migration loop).
> > > 
> > > Issuing the command before migration has started will error,
> > > and issuing after it has finished is ignored.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > Reviewed-by: Eric Blake <eblake@redhat.com>
> > > ---
> > >  hmp-commands.hx               | 15 +++++++++++++++
> > >  hmp.c                         |  7 +++++++
> > >  hmp.h                         |  1 +
> > >  include/migration/migration.h |  3 +++
> > >  migration/migration.c         | 22 ++++++++++++++++++++++
> > >  qapi-schema.json              |  8 ++++++++
> > >  qmp-commands.hx               | 19 +++++++++++++++++++
> > >  7 files changed, 75 insertions(+)
> > > 
> > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > index e37bc8b..03b8b78 100644
> > > --- a/hmp-commands.hx
> > > +++ b/hmp-commands.hx
> > > @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
> > >  ETEXI
> > >  
> > >      {
> > > +        .name       = "migrate_start_postcopy",
> > > +        .args_type  = "",
> > > +        .params     = "",
> > > +        .help       = "Switch migration to postcopy mode",
> > > +        .mhandler.cmd = hmp_migrate_start_postcopy,
> > > +    },
> > > +
> > > +STEXI
> > > +@item migrate_start_postcopy
> > > +@findex migrate_start_postcopy
> > > +Switch in-progress migration to postcopy mode. Ignored after the end of
> > > +migration (or once already in postcopy).
> > > +ETEXI
> > > +
> > > +    {
> > >          .name       = "client_migrate_info",
> > >          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
> > >          .params     = "protocol hostname port tls-port cert-subject",
> > > diff --git a/hmp.c b/hmp.c
> > > index b47f331..df9736c 100644
> > > --- a/hmp.c
> > > +++ b/hmp.c
> > > @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
> > >      }
> > >  }
> > >  
> > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> > > +{
> > > +    Error *err = NULL;
> > > +    qmp_migrate_start_postcopy(&err);
> > > +    hmp_handle_error(mon, &err);
> > > +}
> > > +
> > >  void hmp_set_password(Monitor *mon, const QDict *qdict)
> > >  {
> > >      const char *protocol  = qdict_get_str(qdict, "protocol");
> > > diff --git a/hmp.h b/hmp.h
> > > index 4bb5dca..da1334f 100644
> > > --- a/hmp.h
> > > +++ b/hmp.h
> > > @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> > >  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
> > >  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
> > >  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
> > >  void hmp_set_password(Monitor *mon, const QDict *qdict);
> > >  void hmp_expire_password(Monitor *mon, const QDict *qdict);
> > >  void hmp_eject(Monitor *mon, const QDict *qdict);
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index e6a814a..293c83e 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -104,6 +104,9 @@ struct MigrationState
> > >      int64_t xbzrle_cache_size;
> > >      int64_t setup_time;
> > >      int64_t dirty_sync_count;
> > > +
> > > +    /* Flag set once the migration has been asked to enter postcopy */
> > > +    bool start_postcopy;
> > >  };
> > >  
> > >  void process_incoming_migration(QEMUFile *f);
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index a4fc7d7..43ca656 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> > >      }
> > >  }
> > >  
> > > +void qmp_migrate_start_postcopy(Error **errp)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +
> > > +    if (!migrate_postcopy_ram()) {
> > > +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> > > +                         " the start of migration");
> > > +        return;
> > > +    }
> > > +
> > > +    if (s->state == MIG_STATE_NONE) {
> > > +        error_setg(errp, "Postcopy must be started after migration has been"
> > > +                         " started");
> > > +        return;
> > > +    }
> > > +    /*
> > > +     * we don't error if migration has finished since that would be racy
> > > +     * with issuing this command.
> > > +     */
> > > +    atomic_set(&s->start_postcopy, true);
> > 
> > Why atomic_set?
> 
> It's being read by the migration thread, this is happening in the main thread.
> 
> There's no strict ordering requirement or anything.

I don't think you need an atomic then.  AFAIK an atomic_set() in
isolation without some sort of atomic on the other side is pretty much
meaningless.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure
  2015-03-13 13:47     ` Dr. David Alan Gilbert
@ 2015-03-16  6:30       ` David Gibson
  2015-03-18 17:58         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-16  6:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

On Fri, Mar 13, 2015 at 01:47:53PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:49PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
[snip]
> > > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] |= shifted_mask;
> > > +    } else {
> > > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > > +    }
> > > +    if (state & 2) {
> > > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] |= shifted_mask;
> > > +    } else {
> > > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > > +    }
> > > +}
> > > +
> > > +/*
> > > + * Retrieve the state of the given page
> > > + * Note: This version for use by callers already holding the lock
> > > + */
> > > +static PostcopyPMIState postcopy_pmi_get_state_nolock(
> > > +                            MigrationIncomingState *mis,
> > > +                            size_t bitmap_index)
> > > +{
> > > +    bool b0, b1;
> > > +
> > > +    b0 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state0);
> > > +    b1 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state1);
> > > +
> > > +    return (b0 ? 1 : 0) + (b1 ? 2 : 0);
> > 
> > Ugh.. this is a hidden dependency on the PostcopyPMIState enum
> > elements never changing value.  Safer to code it as:
> >       if (!b0 && !b1) {
> >           return POSTCOPY_PMI_MISSING;
> >       } else if (...)
> >            ...
> > 
> > and let gcc sort it out.
> 
> Again, I was trying to make this just the interface; so it doesn't
> know or care about the enum mapping; we can change the enum mapping to
> the bits without changing this function (or the callers) at all.

So.. I'm not entirely clear what you mean by that.  I think what
you're saying is that this function basically returns an arbitrary bit
pattern derived from the state maps, and the enum provides the mapping
from those bit patterns to meaningful states?

That's.. subtle :/.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure
  2015-03-16  6:30       ` David Gibson
@ 2015-03-18 17:58         ` Dr. David Alan Gilbert
  2015-03-23  2:48           ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-18 17:58 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Fri, Mar 13, 2015 at 01:47:53PM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:49PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> [snip]
> > > > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] |= shifted_mask;
> > > > +    } else {
> > > > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > > > +    }
> > > > +    if (state & 2) {
> > > > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] |= shifted_mask;
> > > > +    } else {
> > > > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > > > +    }
> > > > +}
> > > > +
> > > > +/*
> > > > + * Retrieve the state of the given page
> > > > + * Note: This version for use by callers already holding the lock
> > > > + */
> > > > +static PostcopyPMIState postcopy_pmi_get_state_nolock(
> > > > +                            MigrationIncomingState *mis,
> > > > +                            size_t bitmap_index)
> > > > +{
> > > > +    bool b0, b1;
> > > > +
> > > > +    b0 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state0);
> > > > +    b1 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state1);
> > > > +
> > > > +    return (b0 ? 1 : 0) + (b1 ? 2 : 0);
> > > 
> > > Ugh.. this is a hidden dependency on the PostcopyPMIState enum
> > > elements never changing value.  Safer to code it as:
> > >       if (!b0 && !b1) {
> > >           return POSTCOPY_PMI_MISSING;
> > >       } else if (...)
> > >            ...
> > > 
> > > and let gcc sort it out.
> > 
> > Again, I was trying to make this just the interface; so it doesn't
> > know or care about the enum mapping; we can change the enum mapping to
> > the bits without changing this function (or the callers) at all.
> 
> So.. I'm not entirely clear what you mean by that.  I think what
> you're saying is that this function basically returns an arbitrary bit
> pattern derived from the state maps, and the enum provides the mapping
> from those bit patterns to meaningful states?
> 
> That's.. subtle :/.

I'm saying that I'd like everywhere to work in terms of the enum; but
since I can't store the array of enums I need to convert somewhere;
if I can keep the conversion to only being a couple of functions that know
about the bit layout and everything else uses those functions, then it
feels safe/clean.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-16  6:23       ` David Gibson
@ 2015-03-18 17:59         ` Dr. David Alan Gilbert
  2015-03-19  4:18           ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-18 17:59 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Fri, Mar 13, 2015 at 11:19:06AM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Once postcopy is enabled (with migrate_set_capability), the migration
> > > > will still start on precopy mode.  To cause a transition into postcopy
> > > > the:
> > > > 
> > > >   migrate_start_postcopy
> > > > 
> > > > command must be issued.  Postcopy will start sometime after this
> > > > (when it's next checked in the migration loop).
> > > > 
> > > > Issuing the command before migration has started will error,
> > > > and issuing after it has finished is ignored.
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > Reviewed-by: Eric Blake <eblake@redhat.com>
> > > > ---
> > > >  hmp-commands.hx               | 15 +++++++++++++++
> > > >  hmp.c                         |  7 +++++++
> > > >  hmp.h                         |  1 +
> > > >  include/migration/migration.h |  3 +++
> > > >  migration/migration.c         | 22 ++++++++++++++++++++++
> > > >  qapi-schema.json              |  8 ++++++++
> > > >  qmp-commands.hx               | 19 +++++++++++++++++++
> > > >  7 files changed, 75 insertions(+)
> > > > 
> > > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > > index e37bc8b..03b8b78 100644
> > > > --- a/hmp-commands.hx
> > > > +++ b/hmp-commands.hx
> > > > @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
> > > >  ETEXI
> > > >  
> > > >      {
> > > > +        .name       = "migrate_start_postcopy",
> > > > +        .args_type  = "",
> > > > +        .params     = "",
> > > > +        .help       = "Switch migration to postcopy mode",
> > > > +        .mhandler.cmd = hmp_migrate_start_postcopy,
> > > > +    },
> > > > +
> > > > +STEXI
> > > > +@item migrate_start_postcopy
> > > > +@findex migrate_start_postcopy
> > > > +Switch in-progress migration to postcopy mode. Ignored after the end of
> > > > +migration (or once already in postcopy).
> > > > +ETEXI
> > > > +
> > > > +    {
> > > >          .name       = "client_migrate_info",
> > > >          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
> > > >          .params     = "protocol hostname port tls-port cert-subject",
> > > > diff --git a/hmp.c b/hmp.c
> > > > index b47f331..df9736c 100644
> > > > --- a/hmp.c
> > > > +++ b/hmp.c
> > > > @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
> > > >      }
> > > >  }
> > > >  
> > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> > > > +{
> > > > +    Error *err = NULL;
> > > > +    qmp_migrate_start_postcopy(&err);
> > > > +    hmp_handle_error(mon, &err);
> > > > +}
> > > > +
> > > >  void hmp_set_password(Monitor *mon, const QDict *qdict)
> > > >  {
> > > >      const char *protocol  = qdict_get_str(qdict, "protocol");
> > > > diff --git a/hmp.h b/hmp.h
> > > > index 4bb5dca..da1334f 100644
> > > > --- a/hmp.h
> > > > +++ b/hmp.h
> > > > @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> > > >  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
> > > >  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
> > > >  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
> > > >  void hmp_set_password(Monitor *mon, const QDict *qdict);
> > > >  void hmp_expire_password(Monitor *mon, const QDict *qdict);
> > > >  void hmp_eject(Monitor *mon, const QDict *qdict);
> > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > index e6a814a..293c83e 100644
> > > > --- a/include/migration/migration.h
> > > > +++ b/include/migration/migration.h
> > > > @@ -104,6 +104,9 @@ struct MigrationState
> > > >      int64_t xbzrle_cache_size;
> > > >      int64_t setup_time;
> > > >      int64_t dirty_sync_count;
> > > > +
> > > > +    /* Flag set once the migration has been asked to enter postcopy */
> > > > +    bool start_postcopy;
> > > >  };
> > > >  
> > > >  void process_incoming_migration(QEMUFile *f);
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index a4fc7d7..43ca656 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> > > >      }
> > > >  }
> > > >  
> > > > +void qmp_migrate_start_postcopy(Error **errp)
> > > > +{
> > > > +    MigrationState *s = migrate_get_current();
> > > > +
> > > > +    if (!migrate_postcopy_ram()) {
> > > > +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> > > > +                         " the start of migration");
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    if (s->state == MIG_STATE_NONE) {
> > > > +        error_setg(errp, "Postcopy must be started after migration has been"
> > > > +                         " started");
> > > > +        return;
> > > > +    }
> > > > +    /*
> > > > +     * we don't error if migration has finished since that would be racy
> > > > +     * with issuing this command.
> > > > +     */
> > > > +    atomic_set(&s->start_postcopy, true);
> > > 
> > > Why atomic_set?
> > 
> > It's being read by the migration thread, this is happening in the main thread.
> > 
> > There's no strict ordering requirement or anything.
> 
> I don't think you need an atomic then.  AFAIK an atomic_set() in
> isolation without some sort of atomic on the other side is pretty much
> meaningless.

The other side has an atomic_read:

                if (migrate_postcopy_ram() &&
                    s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE &&
                    pend_nonpost <= max_size &&
                    atomic_read(&s->start_postcopy)) {

                    if (!postcopy_start(s, &old_vm_running)) {
                        current_active_type = MIGRATION_STATUS_POSTCOPY_ACTIVE;
                        entered_postcopy = true;
                    }

so it is at least symmetric.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-18 17:59         ` Dr. David Alan Gilbert
@ 2015-03-19  4:18           ` David Gibson
  2015-03-19  9:33             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-19  4:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 7366 bytes --]

On Wed, Mar 18, 2015 at 05:59:51PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Fri, Mar 13, 2015 at 11:19:06AM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Once postcopy is enabled (with migrate_set_capability), the migration
> > > > > will still start on precopy mode.  To cause a transition into postcopy
> > > > > the:
> > > > > 
> > > > >   migrate_start_postcopy
> > > > > 
> > > > > command must be issued.  Postcopy will start sometime after this
> > > > > (when it's next checked in the migration loop).
> > > > > 
> > > > > Issuing the command before migration has started will error,
> > > > > and issuing after it has finished is ignored.
> > > > > 
> > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > Reviewed-by: Eric Blake <eblake@redhat.com>
> > > > > ---
> > > > >  hmp-commands.hx               | 15 +++++++++++++++
> > > > >  hmp.c                         |  7 +++++++
> > > > >  hmp.h                         |  1 +
> > > > >  include/migration/migration.h |  3 +++
> > > > >  migration/migration.c         | 22 ++++++++++++++++++++++
> > > > >  qapi-schema.json              |  8 ++++++++
> > > > >  qmp-commands.hx               | 19 +++++++++++++++++++
> > > > >  7 files changed, 75 insertions(+)
> > > > > 
> > > > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > > > index e37bc8b..03b8b78 100644
> > > > > --- a/hmp-commands.hx
> > > > > +++ b/hmp-commands.hx
> > > > > @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
> > > > >  ETEXI
> > > > >  
> > > > >      {
> > > > > +        .name       = "migrate_start_postcopy",
> > > > > +        .args_type  = "",
> > > > > +        .params     = "",
> > > > > +        .help       = "Switch migration to postcopy mode",
> > > > > +        .mhandler.cmd = hmp_migrate_start_postcopy,
> > > > > +    },
> > > > > +
> > > > > +STEXI
> > > > > +@item migrate_start_postcopy
> > > > > +@findex migrate_start_postcopy
> > > > > +Switch in-progress migration to postcopy mode. Ignored after the end of
> > > > > +migration (or once already in postcopy).
> > > > > +ETEXI
> > > > > +
> > > > > +    {
> > > > >          .name       = "client_migrate_info",
> > > > >          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
> > > > >          .params     = "protocol hostname port tls-port cert-subject",
> > > > > diff --git a/hmp.c b/hmp.c
> > > > > index b47f331..df9736c 100644
> > > > > --- a/hmp.c
> > > > > +++ b/hmp.c
> > > > > @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
> > > > >      }
> > > > >  }
> > > > >  
> > > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> > > > > +{
> > > > > +    Error *err = NULL;
> > > > > +    qmp_migrate_start_postcopy(&err);
> > > > > +    hmp_handle_error(mon, &err);
> > > > > +}
> > > > > +
> > > > >  void hmp_set_password(Monitor *mon, const QDict *qdict)
> > > > >  {
> > > > >      const char *protocol  = qdict_get_str(qdict, "protocol");
> > > > > diff --git a/hmp.h b/hmp.h
> > > > > index 4bb5dca..da1334f 100644
> > > > > --- a/hmp.h
> > > > > +++ b/hmp.h
> > > > > @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> > > > >  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
> > > > >  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
> > > > >  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> > > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
> > > > >  void hmp_set_password(Monitor *mon, const QDict *qdict);
> > > > >  void hmp_expire_password(Monitor *mon, const QDict *qdict);
> > > > >  void hmp_eject(Monitor *mon, const QDict *qdict);
> > > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > > index e6a814a..293c83e 100644
> > > > > --- a/include/migration/migration.h
> > > > > +++ b/include/migration/migration.h
> > > > > @@ -104,6 +104,9 @@ struct MigrationState
> > > > >      int64_t xbzrle_cache_size;
> > > > >      int64_t setup_time;
> > > > >      int64_t dirty_sync_count;
> > > > > +
> > > > > +    /* Flag set once the migration has been asked to enter postcopy */
> > > > > +    bool start_postcopy;
> > > > >  };
> > > > >  
> > > > >  void process_incoming_migration(QEMUFile *f);
> > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > index a4fc7d7..43ca656 100644
> > > > > --- a/migration/migration.c
> > > > > +++ b/migration/migration.c
> > > > > @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> > > > >      }
> > > > >  }
> > > > >  
> > > > > +void qmp_migrate_start_postcopy(Error **errp)
> > > > > +{
> > > > > +    MigrationState *s = migrate_get_current();
> > > > > +
> > > > > +    if (!migrate_postcopy_ram()) {
> > > > > +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> > > > > +                         " the start of migration");
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    if (s->state == MIG_STATE_NONE) {
> > > > > +        error_setg(errp, "Postcopy must be started after migration has been"
> > > > > +                         " started");
> > > > > +        return;
> > > > > +    }
> > > > > +    /*
> > > > > +     * we don't error if migration has finished since that would be racy
> > > > > +     * with issuing this command.
> > > > > +     */
> > > > > +    atomic_set(&s->start_postcopy, true);
> > > > 
> > > > Why atomic_set?
> > > 
> > > It's being read by the migration thread, this is happening in the main thread.
> > > 
> > > There's no strict ordering requirement or anything.
> > 
> > I don't think you need an atomic then.  AFAIK an atomic_set() in
> > isolation without some sort of atomic on the other side is pretty much
> > meaningless.
> 
> The other side has an atomic_read:
> 
>                 if (migrate_postcopy_ram() &&
>                     s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE &&
>                     pend_nonpost <= max_size &&
>                     atomic_read(&s->start_postcopy)) {
> 
>                     if (!postcopy_start(s, &old_vm_running)) {
>                         current_active_type = MIGRATION_STATUS_POSTCOPY_ACTIVE;
>                         entered_postcopy = true;
>                     }
> 
> so it is at least symmetric.

But still pointless.  Atomicity isn't magic pixie dust; it only makes
sense if you're making atomic specific operations that need to be.
Simple integer loads and stores are already atomic.  Unless at least
some of the atomic operations are something more complex, there's
really no point to atomically marked operations.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-19  4:18           ` David Gibson
@ 2015-03-19  9:33             ` Dr. David Alan Gilbert
  2015-03-23  2:20               ` David Gibson
  2015-03-30  8:17               ` Paolo Bonzini
  0 siblings, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-19  9:33 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Mar 18, 2015 at 05:59:51PM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Fri, Mar 13, 2015 at 11:19:06AM +0000, Dr. David Alan Gilbert wrote:
> > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > Once postcopy is enabled (with migrate_set_capability), the migration
> > > > > > will still start on precopy mode.  To cause a transition into postcopy
> > > > > > the:
> > > > > > 
> > > > > >   migrate_start_postcopy
> > > > > > 
> > > > > > command must be issued.  Postcopy will start sometime after this
> > > > > > (when it's next checked in the migration loop).
> > > > > > 
> > > > > > Issuing the command before migration has started will error,
> > > > > > and issuing after it has finished is ignored.
> > > > > > 
> > > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > > Reviewed-by: Eric Blake <eblake@redhat.com>
> > > > > > ---
> > > > > >  hmp-commands.hx               | 15 +++++++++++++++
> > > > > >  hmp.c                         |  7 +++++++
> > > > > >  hmp.h                         |  1 +
> > > > > >  include/migration/migration.h |  3 +++
> > > > > >  migration/migration.c         | 22 ++++++++++++++++++++++
> > > > > >  qapi-schema.json              |  8 ++++++++
> > > > > >  qmp-commands.hx               | 19 +++++++++++++++++++
> > > > > >  7 files changed, 75 insertions(+)
> > > > > > 
> > > > > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > > > > index e37bc8b..03b8b78 100644
> > > > > > --- a/hmp-commands.hx
> > > > > > +++ b/hmp-commands.hx
> > > > > > @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
> > > > > >  ETEXI
> > > > > >  
> > > > > >      {
> > > > > > +        .name       = "migrate_start_postcopy",
> > > > > > +        .args_type  = "",
> > > > > > +        .params     = "",
> > > > > > +        .help       = "Switch migration to postcopy mode",
> > > > > > +        .mhandler.cmd = hmp_migrate_start_postcopy,
> > > > > > +    },
> > > > > > +
> > > > > > +STEXI
> > > > > > +@item migrate_start_postcopy
> > > > > > +@findex migrate_start_postcopy
> > > > > > +Switch in-progress migration to postcopy mode. Ignored after the end of
> > > > > > +migration (or once already in postcopy).
> > > > > > +ETEXI
> > > > > > +
> > > > > > +    {
> > > > > >          .name       = "client_migrate_info",
> > > > > >          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
> > > > > >          .params     = "protocol hostname port tls-port cert-subject",
> > > > > > diff --git a/hmp.c b/hmp.c
> > > > > > index b47f331..df9736c 100644
> > > > > > --- a/hmp.c
> > > > > > +++ b/hmp.c
> > > > > > @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
> > > > > >      }
> > > > > >  }
> > > > > >  
> > > > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> > > > > > +{
> > > > > > +    Error *err = NULL;
> > > > > > +    qmp_migrate_start_postcopy(&err);
> > > > > > +    hmp_handle_error(mon, &err);
> > > > > > +}
> > > > > > +
> > > > > >  void hmp_set_password(Monitor *mon, const QDict *qdict)
> > > > > >  {
> > > > > >      const char *protocol  = qdict_get_str(qdict, "protocol");
> > > > > > diff --git a/hmp.h b/hmp.h
> > > > > > index 4bb5dca..da1334f 100644
> > > > > > --- a/hmp.h
> > > > > > +++ b/hmp.h
> > > > > > @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> > > > > >  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
> > > > > >  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
> > > > > >  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> > > > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
> > > > > >  void hmp_set_password(Monitor *mon, const QDict *qdict);
> > > > > >  void hmp_expire_password(Monitor *mon, const QDict *qdict);
> > > > > >  void hmp_eject(Monitor *mon, const QDict *qdict);
> > > > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > > > index e6a814a..293c83e 100644
> > > > > > --- a/include/migration/migration.h
> > > > > > +++ b/include/migration/migration.h
> > > > > > @@ -104,6 +104,9 @@ struct MigrationState
> > > > > >      int64_t xbzrle_cache_size;
> > > > > >      int64_t setup_time;
> > > > > >      int64_t dirty_sync_count;
> > > > > > +
> > > > > > +    /* Flag set once the migration has been asked to enter postcopy */
> > > > > > +    bool start_postcopy;
> > > > > >  };
> > > > > >  
> > > > > >  void process_incoming_migration(QEMUFile *f);
> > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > index a4fc7d7..43ca656 100644
> > > > > > --- a/migration/migration.c
> > > > > > +++ b/migration/migration.c
> > > > > > @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> > > > > >      }
> > > > > >  }
> > > > > >  
> > > > > > +void qmp_migrate_start_postcopy(Error **errp)
> > > > > > +{
> > > > > > +    MigrationState *s = migrate_get_current();
> > > > > > +
> > > > > > +    if (!migrate_postcopy_ram()) {
> > > > > > +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> > > > > > +                         " the start of migration");
> > > > > > +        return;
> > > > > > +    }
> > > > > > +
> > > > > > +    if (s->state == MIG_STATE_NONE) {
> > > > > > +        error_setg(errp, "Postcopy must be started after migration has been"
> > > > > > +                         " started");
> > > > > > +        return;
> > > > > > +    }
> > > > > > +    /*
> > > > > > +     * we don't error if migration has finished since that would be racy
> > > > > > +     * with issuing this command.
> > > > > > +     */
> > > > > > +    atomic_set(&s->start_postcopy, true);
> > > > > 
> > > > > Why atomic_set?
> > > > 
> > > > It's being read by the migration thread, this is happening in the main thread.
> > > > 
> > > > There's no strict ordering requirement or anything.
> > > 
> > > I don't think you need an atomic then.  AFAIK an atomic_set() in
> > > isolation without some sort of atomic on the other side is pretty much
> > > meaningless.
> > 
> > The other side has an atomic_read:
> > 
> >                 if (migrate_postcopy_ram() &&
> >                     s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE &&
> >                     pend_nonpost <= max_size &&
> >                     atomic_read(&s->start_postcopy)) {
> > 
> >                     if (!postcopy_start(s, &old_vm_running)) {
> >                         current_active_type = MIGRATION_STATUS_POSTCOPY_ACTIVE;
> >                         entered_postcopy = true;
> >                     }
> > 
> > so it is at least symmetric.
> 
> But still pointless.  Atomicity isn't magic pixie dust; it only makes
> sense if you're making atomic specific operations that need to be.
> Simple integer loads and stores are already atomic.  Unless at least
> some of the atomic operations are something more complex, there's
> really no point to atomically marked operations.

OK, I'll kill it off.

It'll work in practice, but I still believe that what you're saying isn't
safe C:
   1) There's no barrier after the write, so there's no guarantee the other
      thread will eventually see it (in practice we've got other pthread ops
      we take so we will get a barrier somewhere, and most CPUs eventually
      do propagate the store).

   2) The read side could legally be optimised out of the loop by the compiler.
      (but in practice wont be because compilers won't optimise that far).

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-16  6:18       ` David Gibson
@ 2015-03-20 12:37         ` Dr. David Alan Gilbert
  2015-03-23  2:25           ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-20 12:37 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Modify save_live_pending to return separate postcopiable and
> > > > non-postcopiable counts.
> > > > 
> > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > 
> > > What's the purpose of the can_postcopy callback?  There are no callers
> > > in this patch - is it still necessary with the change to
> > > save_live_pending?
> > 
> > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > to decide which devices must be completed at that point.
> 
> Couldn't they check for non-zero postcopiable state from
> save_live_pending instead?

That would be a bit weird.

At the moment for each device we call the:
       save_live_setup method (from qemu_savevm_state_begin)

   0...multiple times we call:
       save_live_pending
       save_live_iterate

   and then we always call
       save_live_complete


To my mind we have to call save_live_complete for any device
that we've called save_live_setup on (maybe it allocated something
in _setup that it clears up in _complete).

save_live_pending could perfectly well return 0 remaining at the end of
the migrate for our device, and thus if we used that then we wouldn't
call save_live_complete.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path
  2015-03-10  6:08   ` David Gibson
@ 2015-03-20 18:17     ` Dr. David Alan Gilbert
  2015-03-23  2:37       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-20 18:17 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:35PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Open a return path, and handle messages that are received upon it.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h |   8 ++
> >  migration/migration.c         | 178 +++++++++++++++++++++++++++++++++++++++++-
> >  trace-events                  |  13 +++
> >  3 files changed, 198 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 6775747..5242ead 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -73,6 +73,14 @@ struct MigrationState
> >  
> >      int state;
> >      MigrationParams params;
> > +
> > +    /* State related to return path */
> > +    struct {
> > +        QEMUFile     *file;
> > +        QemuThread    rp_thread;
> > +        bool          error;
> > +    } rp_state;
> > +
> >      double mbps;
> >      int64_t total_time;
> >      int64_t downtime;
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 80d234c..34cd4fe 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -237,6 +237,23 @@ MigrationCapabilityStatusList *qmp_query_migrate_capabilities(Error **errp)
> >      return head;
> >  }
> >  
> > +/*
> > + * Return true if we're already in the middle of a migration
> > + * (i.e. any of the active or setup states)
> > + */
> > +static bool migration_already_active(MigrationState *ms)
> > +{
> > +    switch (ms->state) {
> > +    case MIG_STATE_ACTIVE:
> > +    case MIG_STATE_SETUP:
> > +        return true;
> > +
> > +    default:
> > +        return false;
> > +
> > +    }
> > +}
> > +
> >  static void get_xbzrle_cache_stats(MigrationInfo *info)
> >  {
> >      if (migrate_use_xbzrle()) {
> > @@ -362,6 +379,21 @@ static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> >      }
> >  }
> >  
> > +static void migrate_fd_cleanup_src_rp(MigrationState *ms)
> > +{
> > +    QEMUFile *rp = ms->rp_state.file;
> > +
> > +    /*
> > +     * When stuff goes wrong (e.g. failing destination) on the rp, it can get
> > +     * cleaned up from a few threads; make sure not to do it twice in parallel
> > +     */
> > +    rp = atomic_cmpxchg(&ms->rp_state.file, rp, NULL);
> 
> A cmpxchg seems dangerously subtle for such a basic and infrequent
> operation, but ok.

I'll take other suggestions; but I'm trying to just do
'if the qemu_file still exists close it', and it didn't seem
worth introducing another state variable to atomically update
when we've already got the file pointer itself.

> > +    if (rp) {
> > +        trace_migrate_fd_cleanup_src_rp();
> > +        qemu_fclose(rp);
> > +    }
> > +}
> > +
> >  static void migrate_fd_cleanup(void *opaque)
> >  {
> >      MigrationState *s = opaque;
> > @@ -369,6 +401,8 @@ static void migrate_fd_cleanup(void *opaque)
> >      qemu_bh_delete(s->cleanup_bh);
> >      s->cleanup_bh = NULL;
> >  
> > +    migrate_fd_cleanup_src_rp(s);
> > +
> >      if (s->file) {
> >          trace_migrate_fd_cleanup();
> >          qemu_mutex_unlock_iothread();
> > @@ -406,6 +440,11 @@ static void migrate_fd_cancel(MigrationState *s)
> >      QEMUFile *f = migrate_get_current()->file;
> >      trace_migrate_fd_cancel();
> >  
> > +    if (s->rp_state.file) {
> > +        /* shutdown the rp socket, so causing the rp thread to shutdown */
> > +        qemu_file_shutdown(s->rp_state.file);
> 
> I missed where qemu_file_shutdown() was implemented.  Does this
> introduce a leftover socket dependency?

No, it shouldn't.  The shutdown() causes a shutdown(2) syscall to
be issued on the socket stopping anything blocking on it; it then
gets closed at the end after the rp thread has exited.

> > +    }
> > +
> >      do {
> >          old_state = s->state;
> >          if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
> > @@ -658,8 +697,145 @@ int64_t migrate_xbzrle_cache_size(void)
> >      return s->xbzrle_cache_size;
> >  }
> >  
> > -/* migration thread support */
> > +/*
> > + * Something bad happened to the RP stream, mark an error
> > + * The caller shall print something to indicate why
> > + */
> > +static void source_return_path_bad(MigrationState *s)
> > +{
> > +    s->rp_state.error = true;
> > +    migrate_fd_cleanup_src_rp(s);
> > +}
> > +
> > +/*
> > + * Handles messages sent on the return path towards the source VM
> > + *
> > + */
> > +static void *source_return_path_thread(void *opaque)
> > +{
> > +    MigrationState *ms = opaque;
> > +    QEMUFile *rp = ms->rp_state.file;
> > +    uint16_t expected_len, header_len, header_com;
> > +    const int max_len = 512;
> > +    uint8_t buf[max_len];
> > +    uint32_t tmp32;
> > +    int res;
> > +
> > +    trace_source_return_path_thread_entry();
> > +    while (rp && !qemu_file_get_error(rp) &&
> > +        migration_already_active(ms)) {
> > +        trace_source_return_path_thread_loop_top();
> > +        header_com = qemu_get_be16(rp);
> > +        header_len = qemu_get_be16(rp);
> > +
> > +        switch (header_com) {
> > +        case MIG_RP_CMD_SHUT:
> > +        case MIG_RP_CMD_PONG:
> > +            expected_len = 4;
> 
> Could the knowledge of expected lengths be folded into the switch
> below?  Switching twice on the same thing is a bit icky.

No, because the length at this point is used to valdiate the
length field in the header prior to reading the body.
The other switch processes the contents of the body that
have been read.

> > +            break;
> > +
> > +        default:
> > +            error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
> > +                    header_com, header_len);
> > +            source_return_path_bad(ms);
> > +            goto out;
> > +        }
> >  
> > +        if (header_len > expected_len) {
> > +            error_report("RP: Received command 0x%04x with"
> > +                    "incorrect length %d expecting %d",
> > +                    header_com, header_len,
> > +                    expected_len);
> > +            source_return_path_bad(ms);
> > +            goto out;
> > +        }
> > +
> > +        /* We know we've got a valid header by this point */
> > +        res = qemu_get_buffer(rp, buf, header_len);
> > +        if (res != header_len) {
> > +            trace_source_return_path_thread_failed_read_cmd_data();
> > +            source_return_path_bad(ms);
> > +            goto out;
> > +        }
> > +
> > +        /* OK, we have the command and the data */
> > +        switch (header_com) {
> > +        case MIG_RP_CMD_SHUT:
> > +            tmp32 = be32_to_cpup((uint32_t *)buf);
> > +            trace_source_return_path_thread_shut(tmp32);
> > +            if (tmp32) {
> > +                error_report("RP: Sibling indicated error %d", tmp32);
> > +                source_return_path_bad(ms);
> > +            }
> > +            /*
> > +             * We'll let the main thread deal with closing the RP
> > +             * we could do a shutdown(2) on it, but we're the only user
> > +             * anyway, so there's nothing gained.
> > +             */
> > +            goto out;
> > +
> > +        case MIG_RP_CMD_PONG:
> > +            tmp32 = be32_to_cpup((uint32_t *)buf);
> > +            trace_source_return_path_thread_pong(tmp32);
> > +            break;
> > +
> > +        default:
> > +            /* This shouldn't happen because we should catch this above */
> > +            trace_source_return_path_bad_header_com();
> > +        }
> > +        /* Latest command processed, now leave a gap for the next one */
> > +        header_com = MIG_RP_CMD_INVALID;
> 
> This assignment will always get overwritten.

Thanks; gone - it's a left over from an old version.

> > +    }
> > +    if (rp && qemu_file_get_error(rp)) {
> > +        trace_source_return_path_thread_bad_end();
> > +        source_return_path_bad(ms);
> > +    }
> > +
> > +    trace_source_return_path_thread_end();
> > +out:
> > +    return NULL;
> > +}
> > +
> > +__attribute__ (( unused )) /* Until later in patch series */
> > +static int open_outgoing_return_path(MigrationState *ms)
> 
> Uh.. surely this should be open_incoming_return_path(); it's designed
> to be used on the source side, AFAICT.
> 
> > +{
> > +
> > +    ms->rp_state.file = qemu_file_get_return_path(ms->file);
> > +    if (!ms->rp_state.file) {
> > +        return -1;
> > +    }
> > +
> > +    trace_open_outgoing_return_path();
> > +    qemu_thread_create(&ms->rp_state.rp_thread, "return path",
> > +                       source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
> > +
> > +    trace_open_outgoing_return_path_continue();
> > +
> > +    return 0;
> > +}
> > +
> > +__attribute__ (( unused )) /* Until later in patch series */
> > +static void await_outgoing_return_path_close(MigrationState *ms)
> 
> Likewise "incoming" here, surely.

I've changed those two  to open_source_return_path()  which seems less ambiguous;
that OK?

Dave

> 
> > +{
> > +    /*
> > +     * If this is a normal exit then the destination will send a SHUT and the
> > +     * rp_thread will exit, however if there's an error we need to cause
> > +     * it to exit, which we can do by a shutdown.
> > +     * (canceling must also shutdown to stop us getting stuck here if
> > +     * the destination died at just the wrong place)
> > +     */
> > +    if (qemu_file_get_error(ms->file) && ms->rp_state.file) {
> > +        qemu_file_shutdown(ms->rp_state.file);
> > +    }
> > +    trace_await_outgoing_return_path_joining();
> > +    qemu_thread_join(&ms->rp_state.rp_thread);
> > +    trace_await_outgoing_return_path_close();
> > +}
> > +
> > +/*
> > + * Master migration thread on the source VM.
> > + * It drives the migration and pumps the data down the outgoing channel.
> > + */
> >  static void *migration_thread(void *opaque)
> >  {
> >      MigrationState *s = opaque;
> > diff --git a/trace-events b/trace-events
> > index 4f3eff8..1951b25 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1374,12 +1374,25 @@ flic_no_device_api(int err) "flic: no Device Contral API support %d"
> >  flic_reset_failed(int err) "flic: reset failed %d"
> >  
> >  # migration.c
> > +await_outgoing_return_path_close(void) ""
> > +await_outgoing_return_path_joining(void) ""
> >  migrate_set_state(int new_state) "new state %d"
> >  migrate_fd_cleanup(void) ""
> > +migrate_fd_cleanup_src_rp(void) ""
> >  migrate_fd_error(void) ""
> >  migrate_fd_cancel(void) ""
> >  migrate_pending(uint64_t size, uint64_t max) "pending size %" PRIu64 " max %" PRIu64
> >  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> > +open_outgoing_return_path(void) ""
> > +open_outgoing_return_path_continue(void) ""
> > +source_return_path_thread_bad_end(void) ""
> > +source_return_path_bad_header_com(void) ""
> > +source_return_path_thread_end(void) ""
> > +source_return_path_thread_entry(void) ""
> > +source_return_path_thread_failed_read_cmd_data(void) ""
> > +source_return_path_thread_loop_top(void) ""
> > +source_return_path_thread_pong(uint32_t val) "%x"
> > +source_return_path_thread_shut(uint32_t val) "%x"
> >  migrate_transferred(uint64_t tranferred, uint64_t time_spent, double bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %g max_size %" PRId64
> >  
> >  # migration/rdma.c
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text
  2015-03-10  6:11   ` David Gibson
@ 2015-03-20 18:48     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-20 18:48 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:36PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Misses out lines that are all the expected value so the output
> > can be quite compact depending on the circumstance.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  arch_init.c                   | 39 +++++++++++++++++++++++++++++++++++++++
> >  include/migration/migration.h |  1 +
> >  2 files changed, 40 insertions(+)
> > 
> > diff --git a/arch_init.c b/arch_init.c
> > index 91645cc..fe0df0d 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -776,6 +776,45 @@ static void reset_ram_globals(void)
> >  
> >  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> >  
> > +/*
> > + * 'expected' is the value you expect the bitmap mostly to be full
> > + * of and it won't bother printing lines that are all this value
> > + * if 'todump' is null the migration bitmap is dumped.
> > + */
> > +void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
> > +{
> > +    int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
> > +
> > +    int64_t cur;
> > +    int64_t linelen = 128;
> > +    char linebuf[129];
> > +
> > +    if (!todump) {
> > +        todump = migration_bitmap;
> 
> Any reason not to just have the caller pass migration_bitmap, if
> that's what they want?

migration_bitmap is a static in this file; I wanted to be
able to add calls to this debug function from elsewhere, e.g.
before aborts/crashes whereever I was testing things that
were going wrong.

> > +    }
> > +
> > +    for (cur = 0; cur < ram_pages; cur += linelen) {
> > +        int64_t curb;
> > +        bool found = false;
> > +        /*
> > +         * Last line; catch the case where the line length
> > +         * is longer than remaining ram
> > +         */
> > +        if (cur+linelen > ram_pages) {
> > +            linelen = ram_pages - cur;
> > +        }
> > +        for (curb = 0; curb < linelen; curb++) {
> > +            bool thisbit = test_bit(cur+curb, todump);
> > +            linebuf[curb] = thisbit ? '1' : '.';
> > +            found = found || (thisbit != expected);
> > +        }
> > +        if (found) {
> > +            linebuf[curb] = '\0';
> > +            fprintf(stderr,  "0x%08" PRIx64 " : %s\n", cur, linebuf);
> 
> Might be slightly more readable if it printed GPAs instead of page
> numbers.

Indeed it would; but translating bitmap position into GPA is non-trivial
anyway; RAM blocks aren't necessarily in order within the bitmap
and indeed may be mixed on any one line.

Dave

> 
> > +        }
> > +    }
> > +}
> > +
> >  static int ram_save_setup(QEMUFile *f, void *opaque)
> >  {
> >      RAMBlock *block;
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 5242ead..3776e86 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -156,6 +156,7 @@ uint64_t xbzrle_mig_pages_cache_miss(void);
> >  double xbzrle_mig_cache_miss_rate(void);
> >  
> >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
> > +void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
> >  
> >  /**
> >   * @migrate_add_blocker - prevent migration from proceeding
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-19  9:33             ` Dr. David Alan Gilbert
@ 2015-03-23  2:20               ` David Gibson
  2015-03-30  8:19                 ` Paolo Bonzini
  2015-03-30  8:17               ` Paolo Bonzini
  1 sibling, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  2:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 9323 bytes --]

On Thu, Mar 19, 2015 at 09:33:31AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Mar 18, 2015 at 05:59:51PM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Fri, Mar 13, 2015 at 11:19:06AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > On Wed, Feb 25, 2015 at 04:51:46PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > 
> > > > > > > Once postcopy is enabled (with migrate_set_capability), the migration
> > > > > > > will still start on precopy mode.  To cause a transition into postcopy
> > > > > > > the:
> > > > > > > 
> > > > > > >   migrate_start_postcopy
> > > > > > > 
> > > > > > > command must be issued.  Postcopy will start sometime after this
> > > > > > > (when it's next checked in the migration loop).
> > > > > > > 
> > > > > > > Issuing the command before migration has started will error,
> > > > > > > and issuing after it has finished is ignored.
> > > > > > > 
> > > > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > > > Reviewed-by: Eric Blake <eblake@redhat.com>
> > > > > > > ---
> > > > > > >  hmp-commands.hx               | 15 +++++++++++++++
> > > > > > >  hmp.c                         |  7 +++++++
> > > > > > >  hmp.h                         |  1 +
> > > > > > >  include/migration/migration.h |  3 +++
> > > > > > >  migration/migration.c         | 22 ++++++++++++++++++++++
> > > > > > >  qapi-schema.json              |  8 ++++++++
> > > > > > >  qmp-commands.hx               | 19 +++++++++++++++++++
> > > > > > >  7 files changed, 75 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > > > > > index e37bc8b..03b8b78 100644
> > > > > > > --- a/hmp-commands.hx
> > > > > > > +++ b/hmp-commands.hx
> > > > > > > @@ -985,6 +985,21 @@ Enable/Disable the usage of a capability @var{capability} for migration.
> > > > > > >  ETEXI
> > > > > > >  
> > > > > > >      {
> > > > > > > +        .name       = "migrate_start_postcopy",
> > > > > > > +        .args_type  = "",
> > > > > > > +        .params     = "",
> > > > > > > +        .help       = "Switch migration to postcopy mode",
> > > > > > > +        .mhandler.cmd = hmp_migrate_start_postcopy,
> > > > > > > +    },
> > > > > > > +
> > > > > > > +STEXI
> > > > > > > +@item migrate_start_postcopy
> > > > > > > +@findex migrate_start_postcopy
> > > > > > > +Switch in-progress migration to postcopy mode. Ignored after the end of
> > > > > > > +migration (or once already in postcopy).
> > > > > > > +ETEXI
> > > > > > > +
> > > > > > > +    {
> > > > > > >          .name       = "client_migrate_info",
> > > > > > >          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
> > > > > > >          .params     = "protocol hostname port tls-port cert-subject",
> > > > > > > diff --git a/hmp.c b/hmp.c
> > > > > > > index b47f331..df9736c 100644
> > > > > > > --- a/hmp.c
> > > > > > > +++ b/hmp.c
> > > > > > > @@ -1140,6 +1140,13 @@ void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict)
> > > > > > >      }
> > > > > > >  }
> > > > > > >  
> > > > > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict)
> > > > > > > +{
> > > > > > > +    Error *err = NULL;
> > > > > > > +    qmp_migrate_start_postcopy(&err);
> > > > > > > +    hmp_handle_error(mon, &err);
> > > > > > > +}
> > > > > > > +
> > > > > > >  void hmp_set_password(Monitor *mon, const QDict *qdict)
> > > > > > >  {
> > > > > > >      const char *protocol  = qdict_get_str(qdict, "protocol");
> > > > > > > diff --git a/hmp.h b/hmp.h
> > > > > > > index 4bb5dca..da1334f 100644
> > > > > > > --- a/hmp.h
> > > > > > > +++ b/hmp.h
> > > > > > > @@ -64,6 +64,7 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> > > > > > >  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
> > > > > > >  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
> > > > > > >  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> > > > > > > +void hmp_migrate_start_postcopy(Monitor *mon, const QDict *qdict);
> > > > > > >  void hmp_set_password(Monitor *mon, const QDict *qdict);
> > > > > > >  void hmp_expire_password(Monitor *mon, const QDict *qdict);
> > > > > > >  void hmp_eject(Monitor *mon, const QDict *qdict);
> > > > > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > > > > index e6a814a..293c83e 100644
> > > > > > > --- a/include/migration/migration.h
> > > > > > > +++ b/include/migration/migration.h
> > > > > > > @@ -104,6 +104,9 @@ struct MigrationState
> > > > > > >      int64_t xbzrle_cache_size;
> > > > > > >      int64_t setup_time;
> > > > > > >      int64_t dirty_sync_count;
> > > > > > > +
> > > > > > > +    /* Flag set once the migration has been asked to enter postcopy */
> > > > > > > +    bool start_postcopy;
> > > > > > >  };
> > > > > > >  
> > > > > > >  void process_incoming_migration(QEMUFile *f);
> > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > index a4fc7d7..43ca656 100644
> > > > > > > --- a/migration/migration.c
> > > > > > > +++ b/migration/migration.c
> > > > > > > @@ -372,6 +372,28 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
> > > > > > >      }
> > > > > > >  }
> > > > > > >  
> > > > > > > +void qmp_migrate_start_postcopy(Error **errp)
> > > > > > > +{
> > > > > > > +    MigrationState *s = migrate_get_current();
> > > > > > > +
> > > > > > > +    if (!migrate_postcopy_ram()) {
> > > > > > > +        error_setg(errp, "Enable postcopy with migration_set_capability before"
> > > > > > > +                         " the start of migration");
> > > > > > > +        return;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    if (s->state == MIG_STATE_NONE) {
> > > > > > > +        error_setg(errp, "Postcopy must be started after migration has been"
> > > > > > > +                         " started");
> > > > > > > +        return;
> > > > > > > +    }
> > > > > > > +    /*
> > > > > > > +     * we don't error if migration has finished since that would be racy
> > > > > > > +     * with issuing this command.
> > > > > > > +     */
> > > > > > > +    atomic_set(&s->start_postcopy, true);
> > > > > > 
> > > > > > Why atomic_set?
> > > > > 
> > > > > It's being read by the migration thread, this is happening in the main thread.
> > > > > 
> > > > > There's no strict ordering requirement or anything.
> > > > 
> > > > I don't think you need an atomic then.  AFAIK an atomic_set() in
> > > > isolation without some sort of atomic on the other side is pretty much
> > > > meaningless.
> > > 
> > > The other side has an atomic_read:
> > > 
> > >                 if (migrate_postcopy_ram() &&
> > >                     s->state != MIGRATION_STATUS_POSTCOPY_ACTIVE &&
> > >                     pend_nonpost <= max_size &&
> > >                     atomic_read(&s->start_postcopy)) {
> > > 
> > >                     if (!postcopy_start(s, &old_vm_running)) {
> > >                         current_active_type = MIGRATION_STATUS_POSTCOPY_ACTIVE;
> > >                         entered_postcopy = true;
> > >                     }
> > > 
> > > so it is at least symmetric.
> > 
> > But still pointless.  Atomicity isn't magic pixie dust; it only makes
> > sense if you're making atomic specific operations that need to be.
> > Simple integer loads and stores are already atomic.  Unless at least
> > some of the atomic operations are something more complex, there's
> > really no point to atomically marked operations.
> 
> OK, I'll kill it off.
> 
> It'll work in practice, but I still believe that what you're saying isn't
> safe C:
>    1) There's no barrier after the write, so there's no guarantee the other
>       thread will eventually see it (in practice we've got other pthread ops
>       we take so we will get a barrier somewhere, and most CPUs eventually
>       do propagate the store).

Sorry, I should have been clearer.  If you need a memory barrier, by
all means include a memory barrier.  But that should be explicit:
atomic set/read operations often include barriers, but it's not
obvious which side will include what barrier.

>    2) The read side could legally be optimised out of the loop by the compiler.
>       (but in practice wont be because compilers won't optimise that far).

That one's a trickier question.  Compilers are absolutely capable of
optimizing that far, *but* the C rules about when it's allowed to
assume in-memory values remain unchanged are pretty conservative.  I
think any function call in the loop will require it to reload the
value, for example.  That said, a (compiler only) memory barrier might
be appropriate to ensure that reload.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-20 12:37         ` Dr. David Alan Gilbert
@ 2015-03-23  2:25           ` David Gibson
  2015-03-24 20:04             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  2:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2533 bytes --]

On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Modify save_live_pending to return separate postcopiable and
> > > > > non-postcopiable counts.
> > > > > 
> > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > 
> > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > in this patch - is it still necessary with the change to
> > > > save_live_pending?
> > > 
> > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > to decide which devices must be completed at that point.
> > 
> > Couldn't they check for non-zero postcopiable state from
> > save_live_pending instead?
> 
> That would be a bit weird.
> 
> At the moment for each device we call the:
>        save_live_setup method (from qemu_savevm_state_begin)
> 
>    0...multiple times we call:
>        save_live_pending
>        save_live_iterate
> 
>    and then we always call
>        save_live_complete
> 
> 
> To my mind we have to call save_live_complete for any device
> that we've called save_live_setup on (maybe it allocated something
> in _setup that it clears up in _complete).
> 
> save_live_pending could perfectly well return 0 remaining at the end of
> the migrate for our device, and thus if we used that then we wouldn't
> call save_live_complete.

Um.. I don't follow.  I was suggesting that at the precopy->postcopy
transition point you call save_live_complete for everything that
reports 0 post-copiable state.


Then again, a different approach would be to split the
save_live_complete hook into (possibly NULL) "complete precopy" and
"complete postcopy" hooks.  The core would ensure that every chunk of
state has both completion hooks called (unless NULL).  That might also
address my concerns about the no longer entirely accurate
save_live_complete function name.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path
  2015-03-20 18:17     ` Dr. David Alan Gilbert
@ 2015-03-23  2:37       ` David Gibson
  2015-04-01 15:14         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  2:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 11224 bytes --]

On Fri, Mar 20, 2015 at 06:17:31PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:35PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Open a return path, and handle messages that are received upon it.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  include/migration/migration.h |   8 ++
> > >  migration/migration.c         | 178 +++++++++++++++++++++++++++++++++++++++++-
> > >  trace-events                  |  13 +++
> > >  3 files changed, 198 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index 6775747..5242ead 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -73,6 +73,14 @@ struct MigrationState
> > >  
> > >      int state;
> > >      MigrationParams params;
> > > +
> > > +    /* State related to return path */
> > > +    struct {
> > > +        QEMUFile     *file;
> > > +        QemuThread    rp_thread;
> > > +        bool          error;
> > > +    } rp_state;
> > > +
> > >      double mbps;
> > >      int64_t total_time;
> > >      int64_t downtime;
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 80d234c..34cd4fe 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -237,6 +237,23 @@ MigrationCapabilityStatusList *qmp_query_migrate_capabilities(Error **errp)
> > >      return head;
> > >  }
> > >  
> > > +/*
> > > + * Return true if we're already in the middle of a migration
> > > + * (i.e. any of the active or setup states)
> > > + */
> > > +static bool migration_already_active(MigrationState *ms)
> > > +{
> > > +    switch (ms->state) {
> > > +    case MIG_STATE_ACTIVE:
> > > +    case MIG_STATE_SETUP:
> > > +        return true;
> > > +
> > > +    default:
> > > +        return false;
> > > +
> > > +    }
> > > +}
> > > +
> > >  static void get_xbzrle_cache_stats(MigrationInfo *info)
> > >  {
> > >      if (migrate_use_xbzrle()) {
> > > @@ -362,6 +379,21 @@ static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> > >      }
> > >  }
> > >  
> > > +static void migrate_fd_cleanup_src_rp(MigrationState *ms)
> > > +{
> > > +    QEMUFile *rp = ms->rp_state.file;
> > > +
> > > +    /*
> > > +     * When stuff goes wrong (e.g. failing destination) on the rp, it can get
> > > +     * cleaned up from a few threads; make sure not to do it twice in parallel
> > > +     */
> > > +    rp = atomic_cmpxchg(&ms->rp_state.file, rp, NULL);
> > 
> > A cmpxchg seems dangerously subtle for such a basic and infrequent
> > operation, but ok.
> 
> I'll take other suggestions; but I'm trying to just do
> 'if the qemu_file still exists close it', and it didn't seem
> worth introducing another state variable to atomically update
> when we've already got the file pointer itself.

Yes, I see the rationale.  My concern is just that the more atomicity
mechanisms are scattered through the code, the harder it is to analyze
and be sure you haven't missed race cases (or introduced then with a
future change).

In short, I prefer to see a simple-as-possible, and preferably
documented, consistent overall concurrency scheme for a data
structure, rather than scattered atomic ops for various variable where
it's difficult to see how all the pieces might relate together.

> > > +    if (rp) {
> > > +        trace_migrate_fd_cleanup_src_rp();
> > > +        qemu_fclose(rp);
> > > +    }
> > > +}
> > > +
> > >  static void migrate_fd_cleanup(void *opaque)
> > >  {
> > >      MigrationState *s = opaque;
> > > @@ -369,6 +401,8 @@ static void migrate_fd_cleanup(void *opaque)
> > >      qemu_bh_delete(s->cleanup_bh);
> > >      s->cleanup_bh = NULL;
> > >  
> > > +    migrate_fd_cleanup_src_rp(s);
> > > +
> > >      if (s->file) {
> > >          trace_migrate_fd_cleanup();
> > >          qemu_mutex_unlock_iothread();
> > > @@ -406,6 +440,11 @@ static void migrate_fd_cancel(MigrationState *s)
> > >      QEMUFile *f = migrate_get_current()->file;
> > >      trace_migrate_fd_cancel();
> > >  
> > > +    if (s->rp_state.file) {
> > > +        /* shutdown the rp socket, so causing the rp thread to shutdown */
> > > +        qemu_file_shutdown(s->rp_state.file);
> > 
> > I missed where qemu_file_shutdown() was implemented.  Does this
> > introduce a leftover socket dependency?
> 
> No, it shouldn't.  The shutdown() causes a shutdown(2) syscall to
> be issued on the socket stopping anything blocking on it; it then
> gets closed at the end after the rp thread has exited.


Sorry, that's not what I meant.  I mean is this a hole in the
abstraction of the QemuFile, because it assumes that what you're
dealing with here is indeed a socket, rather than something else?

> > > +    }
> > > +
> > >      do {
> > >          old_state = s->state;
> > >          if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
> > > @@ -658,8 +697,145 @@ int64_t migrate_xbzrle_cache_size(void)
> > >      return s->xbzrle_cache_size;
> > >  }
> > >  
> > > -/* migration thread support */
> > > +/*
> > > + * Something bad happened to the RP stream, mark an error
> > > + * The caller shall print something to indicate why
> > > + */
> > > +static void source_return_path_bad(MigrationState *s)
> > > +{
> > > +    s->rp_state.error = true;
> > > +    migrate_fd_cleanup_src_rp(s);
> > > +}
> > > +
> > > +/*
> > > + * Handles messages sent on the return path towards the source VM
> > > + *
> > > + */
> > > +static void *source_return_path_thread(void *opaque)
> > > +{
> > > +    MigrationState *ms = opaque;
> > > +    QEMUFile *rp = ms->rp_state.file;
> > > +    uint16_t expected_len, header_len, header_com;
> > > +    const int max_len = 512;
> > > +    uint8_t buf[max_len];
> > > +    uint32_t tmp32;
> > > +    int res;
> > > +
> > > +    trace_source_return_path_thread_entry();
> > > +    while (rp && !qemu_file_get_error(rp) &&
> > > +        migration_already_active(ms)) {
> > > +        trace_source_return_path_thread_loop_top();
> > > +        header_com = qemu_get_be16(rp);
> > > +        header_len = qemu_get_be16(rp);
> > > +
> > > +        switch (header_com) {
> > > +        case MIG_RP_CMD_SHUT:
> > > +        case MIG_RP_CMD_PONG:
> > > +            expected_len = 4;
> > 
> > Could the knowledge of expected lengths be folded into the switch
> > below?  Switching twice on the same thing is a bit icky.
> 
> No, because the length at this point is used to valdiate the
> length field in the header prior to reading the body.
> The other switch processes the contents of the body that
> have been read.

Ok.

> > > +            break;
> > > +
> > > +        default:
> > > +            error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
> > > +                    header_com, header_len);
> > > +            source_return_path_bad(ms);
> > > +            goto out;
> > > +        }
> > >  
> > > +        if (header_len > expected_len) {
> > > +            error_report("RP: Received command 0x%04x with"
> > > +                    "incorrect length %d expecting %d",
> > > +                    header_com, header_len,
> > > +                    expected_len);
> > > +            source_return_path_bad(ms);
> > > +            goto out;
> > > +        }
> > > +
> > > +        /* We know we've got a valid header by this point */
> > > +        res = qemu_get_buffer(rp, buf, header_len);
> > > +        if (res != header_len) {
> > > +            trace_source_return_path_thread_failed_read_cmd_data();
> > > +            source_return_path_bad(ms);
> > > +            goto out;
> > > +        }
> > > +
> > > +        /* OK, we have the command and the data */
> > > +        switch (header_com) {
> > > +        case MIG_RP_CMD_SHUT:
> > > +            tmp32 = be32_to_cpup((uint32_t *)buf);
> > > +            trace_source_return_path_thread_shut(tmp32);
> > > +            if (tmp32) {
> > > +                error_report("RP: Sibling indicated error %d", tmp32);
> > > +                source_return_path_bad(ms);
> > > +            }
> > > +            /*
> > > +             * We'll let the main thread deal with closing the RP
> > > +             * we could do a shutdown(2) on it, but we're the only user
> > > +             * anyway, so there's nothing gained.
> > > +             */
> > > +            goto out;
> > > +
> > > +        case MIG_RP_CMD_PONG:
> > > +            tmp32 = be32_to_cpup((uint32_t *)buf);
> > > +            trace_source_return_path_thread_pong(tmp32);
> > > +            break;
> > > +
> > > +        default:
> > > +            /* This shouldn't happen because we should catch this above */
> > > +            trace_source_return_path_bad_header_com();
> > > +        }
> > > +        /* Latest command processed, now leave a gap for the next one */
> > > +        header_com = MIG_RP_CMD_INVALID;
> > 
> > This assignment will always get overwritten.
> 
> Thanks; gone - it's a left over from an old version.
> 
> > > +    }
> > > +    if (rp && qemu_file_get_error(rp)) {
> > > +        trace_source_return_path_thread_bad_end();
> > > +        source_return_path_bad(ms);
> > > +    }
> > > +
> > > +    trace_source_return_path_thread_end();
> > > +out:
> > > +    return NULL;
> > > +}
> > > +
> > > +__attribute__ (( unused )) /* Until later in patch series */
> > > +static int open_outgoing_return_path(MigrationState *ms)
> > 
> > Uh.. surely this should be open_incoming_return_path(); it's designed
> > to be used on the source side, AFAICT.
> > 
> > > +{
> > > +
> > > +    ms->rp_state.file = qemu_file_get_return_path(ms->file);
> > > +    if (!ms->rp_state.file) {
> > > +        return -1;
> > > +    }
> > > +
> > > +    trace_open_outgoing_return_path();
> > > +    qemu_thread_create(&ms->rp_state.rp_thread, "return path",
> > > +                       source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
> > > +
> > > +    trace_open_outgoing_return_path_continue();
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +__attribute__ (( unused )) /* Until later in patch series */
> > > +static void await_outgoing_return_path_close(MigrationState *ms)
> > 
> > Likewise "incoming" here, surely.
> 
> I've changed those two  to open_source_return_path()  which seems less ambiguous;
> that OK?

Uh.. not really, it just moves the ambiguity to a different place (is
"source return path" the return path *on* the source or *to* the
source).

Perhaps "open_return_path_on_source" and
"await_return_path_close_on_source"?  I'm not particularly fond of
those, but they're the best I've come up with yet.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure
  2015-03-18 17:58         ` Dr. David Alan Gilbert
@ 2015-03-23  2:48           ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-23  2:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 3034 bytes --]

On Wed, Mar 18, 2015 at 05:58:40PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Fri, Mar 13, 2015 at 01:47:53PM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Wed, Feb 25, 2015 at 04:51:49PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > [snip]
> > > > > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] |= shifted_mask;
> > > > > +    } else {
> > > > > +        mis->postcopy_pmi.state0[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > > > > +    }
> > > > > +    if (state & 2) {
> > > > > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] |= shifted_mask;
> > > > > +    } else {
> > > > > +        mis->postcopy_pmi.state1[BIT_WORD(bitmap_index)] &= ~shifted_mask;
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Retrieve the state of the given page
> > > > > + * Note: This version for use by callers already holding the lock
> > > > > + */
> > > > > +static PostcopyPMIState postcopy_pmi_get_state_nolock(
> > > > > +                            MigrationIncomingState *mis,
> > > > > +                            size_t bitmap_index)
> > > > > +{
> > > > > +    bool b0, b1;
> > > > > +
> > > > > +    b0 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state0);
> > > > > +    b1 = test_hpbits(mis, bitmap_index, mis->postcopy_pmi.state1);
> > > > > +
> > > > > +    return (b0 ? 1 : 0) + (b1 ? 2 : 0);
> > > > 
> > > > Ugh.. this is a hidden dependency on the PostcopyPMIState enum
> > > > elements never changing value.  Safer to code it as:
> > > >       if (!b0 && !b1) {
> > > >           return POSTCOPY_PMI_MISSING;
> > > >       } else if (...)
> > > >            ...
> > > > 
> > > > and let gcc sort it out.
> > > 
> > > Again, I was trying to make this just the interface; so it doesn't
> > > know or care about the enum mapping; we can change the enum mapping to
> > > the bits without changing this function (or the callers) at all.
> > 
> > So.. I'm not entirely clear what you mean by that.  I think what
> > you're saying is that this function basically returns an arbitrary bit
> > pattern derived from the state maps, and the enum provides the mapping
> > from those bit patterns to meaningful states?
> > 
> > That's.. subtle :/.
> 
> I'm saying that I'd like everywhere to work in terms of the enum; but
> since I can't store the array of enums I need to convert somewhere;
> if I can keep the conversion to only being a couple of functions that know
> about the bit layout and everything else uses those functions, then it
> feels safe/clean.

Hm, yeah, I guess.  It's all a bit confusing, but I see the internal
sense in your scheme.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard Dr. David Alan Gilbert (git)
@ 2015-03-23  3:30   ` David Gibson
  2015-03-23 14:36     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  3:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 24755 bytes --]

On Wed, Feb 25, 2015 at 04:51:50PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Where postcopy is preceeded by a period of precopy, the destination will
> have received pages that may have been dirtied on the source after the
> page was sent.  The destination must throw these pages away before
> starting it's CPUs.
> 
> Maintain a 'sentmap' of pages that have already been sent.
> Calculate list of sent & dirty pages
> Provide helpers on the destination side to discard these.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  arch_init.c                      | 275 ++++++++++++++++++++++++++++++++++++++-
>  include/migration/migration.h    |  12 ++
>  include/migration/postcopy-ram.h |  34 +++++
>  include/qemu/typedefs.h          |   1 +
>  migration/migration.c            |   1 +
>  migration/postcopy-ram.c         | 111 ++++++++++++++++
>  savevm.c                         |   3 -
>  trace-events                     |   4 +
>  8 files changed, 435 insertions(+), 6 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 7bc5fa6..21e7ebe 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -40,6 +40,7 @@
>  #include "hw/audio/audio.h"
>  #include "sysemu/kvm.h"
>  #include "migration/migration.h"
> +#include "migration/postcopy-ram.h"
>  #include "hw/i386/smbios.h"
>  #include "exec/address-spaces.h"
>  #include "hw/audio/pcspk.h"
> @@ -414,9 +415,17 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t **current_data,
>      return bytes_sent;
>  }
>  
> +/* mr: The region to search for dirty pages in
> + * start: Start address (typically so we can continue from previous page)
> + * ram_addr_abs: Pointer into which to store the address of the dirty page
> + *               within the global ram_addr space
> + *
> + * Returns: byte offset within memory region of the start of a dirty page
> + */
>  static inline
>  ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
> -                                                 ram_addr_t start)
> +                                                 ram_addr_t start,
> +                                                 ram_addr_t *ram_addr_abs)
>  {
>      unsigned long base = mr->ram_addr >> TARGET_PAGE_BITS;
>      unsigned long nr = base + (start >> TARGET_PAGE_BITS);
> @@ -435,6 +444,7 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
>          clear_bit(next, migration_bitmap);
>          migration_dirty_pages--;
>      }
> +    *ram_addr_abs = next << TARGET_PAGE_BITS;
>      return (next - base) << TARGET_PAGE_BITS;
>  }
>  
> @@ -571,6 +581,19 @@ static void migration_bitmap_sync(void)
>      }
>  }
>  
> +static RAMBlock *ram_find_block(const char *id)
> +{
> +    RAMBlock *block;
> +
> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> +        if (!strcmp(id, block->idstr)) {
> +            return block;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
>  /*
>   * ram_save_page: Send the given page to the stream
>   *
> @@ -659,13 +682,16 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
>      bool complete_round = false;
>      int bytes_sent = 0;
>      MemoryRegion *mr;
> +    ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
> +                                 ram_addr_t space */
>  
>      if (!block)
>          block = QTAILQ_FIRST(&ram_list.blocks);
>  
>      while (true) {
>          mr = block->mr;
> -        offset = migration_bitmap_find_and_reset_dirty(mr, offset);
> +        offset = migration_bitmap_find_and_reset_dirty(mr, offset,
> +                                                       &dirty_ram_abs);
>          if (complete_round && block == last_seen_block &&
>              offset >= last_offset) {
>              break;
> @@ -683,6 +709,11 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
>  
>              /* if page is unmodified, continue to the next */
>              if (bytes_sent > 0) {
> +                MigrationState *ms = migrate_get_current();
> +                if (ms->sentmap) {
> +                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
> +                }
> +
>                  last_sent_block = block;
>                  break;
>              }
> @@ -742,12 +773,19 @@ void free_xbzrle_decoded_buf(void)
>  
>  static void migration_end(void)
>  {
> +    MigrationState *s = migrate_get_current();
> +
>      if (migration_bitmap) {
>          memory_global_dirty_log_stop();
>          g_free(migration_bitmap);
>          migration_bitmap = NULL;
>      }
>  
> +    if (s->sentmap) {
> +        g_free(s->sentmap);
> +        s->sentmap = NULL;
> +    }
> +
>      XBZRLE_cache_lock();
>      if (XBZRLE.cache) {
>          cache_fini(XBZRLE.cache);
> @@ -815,6 +853,232 @@ void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
>      }
>  }
>  
> +/* **** functions for postcopy ***** */
> +
> +/*
> + * A helper to get 32 bits from a bit map; trivial for HOST_LONG_BITS=32
> + * messier for 64; the bitmaps are actually long's that are 32 or 64bit
> + */
> +static uint32_t get_32bits_map(unsigned long *map, int64_t start)
> +{
> +#if HOST_LONG_BITS == 64
> +    uint64_t tmp64;
> +
> +    tmp64 = map[start / 64];
> +    return (start & 32) ? (tmp64 >> 32) : (tmp64 & 0xffffffffu);
> +#elif HOST_LONG_BITS == 32
> +    /*
> +     * Irrespective of host endianness, sentmap[n] is for pages earlier
> +     * than sentmap[n+1] so we can't just cast up
> +     */
> +    return map[start / 32];
> +#else
> +#error "Host long other than 64/32 not supported"
> +#endif
> +}
> +
> +/*
> + * A helper to put 32 bits into a bit map; trivial for HOST_LONG_BITS=32
> + * messier for 64; the bitmaps are actually long's that are 32 or 64bit
> + */
> +__attribute__ (( unused )) /* Until later in patch series */
> +static void put_32bits_map(unsigned long *map, int64_t start,
> +                           uint32_t v)
> +{
> +#if HOST_LONG_BITS == 64
> +    uint64_t tmp64 = v;
> +    uint64_t mask = 0xffffffffu;
> +
> +    if (start & 32) {
> +        tmp64 = tmp64 << 32;
> +        mask =  mask << 32;
> +    }
> +
> +    map[start / 64] = (map[start / 64] & ~mask) | tmp64;
> +#elif HOST_LONG_BITS == 32
> +    /*
> +     * Irrespective of host endianness, sentmap[n] is for pages earlier
> +     * than sentmap[n+1] so we can't just cast up
> +     */
> +    map[start / 32] = v;
> +#else
> +#error "Host long other than 64/32 not supported"
> +#endif
> +}
> +
> +/*
> + * When working on 32bit chunks of a bitmap where the only valid section
> + * is between start..end (inclusive), generate a mask with only those
> + * valid bits set for the current 32bit word within that bitmask.
> + */
> +static int make_32bit_mask(unsigned long start, unsigned long end,
> +                           unsigned long cur32)
> +{
> +    unsigned long first32, last32;
> +    uint32_t mask = ~(uint32_t)0;
> +    first32 = start / 32;
> +    last32 = end / 32;
> +
> +    if ((cur32 == first32) && (start & 31)) {
> +        /* e.g. (start & 31) = 3
> +         *         1 << .    -> 2^3
> +         *         . - 1     -> 2^3 - 1 i.e. mask 2..0
> +         *         ~.        -> mask 31..3
> +         */
> +        mask &= ~((((uint32_t)1) << (start & 31)) - 1);
> +    }
> +
> +    if ((cur32 == last32) && ((end & 31) != 31)) {
> +        /* e.g. (end & 31) = 3
> +         *            .   +1 -> 4
> +         *         1 << .    -> 2^4
> +         *         . -1      -> 2^4 - 1
> +         *                   = mask set 3..0
> +         */
> +        mask &= (((uint32_t)1) << ((end & 31) + 1)) - 1;
> +    }
> +
> +    return mask;
> +}

Urgh.. looks correct, but continue to makes me question if this is a
sensible wire encoding of the discard map.

> +
> +/*
> + * Callback from ram_postcopy_each_ram_discard for each RAMBlock
> + * start,end: Indexes into the bitmap for the first and last bit
> + *            representing the named block
> + */
> +static int pc_send_discard_bm_ram(MigrationState *ms,
> +                                  PostcopyDiscardState *pds,
> +                                  unsigned long start, unsigned long end)

I know it's a long name, but I'd prefer to see this called
"postcopy_".  I have to keep reminding myself that "pc" means
"postcopy" here, not that it's a PC machine type specific function.

> +{
> +    /*
> +     * There is no guarantee that start, end are on convenient 32bit multiples
> +     * (We always send 32bit chunks over the wire, irrespective of long size)
> +     */
> +    unsigned long first32, last32, cur32;
> +    first32 = start / 32;
> +    last32 = end / 32;
> +
> +    for (cur32 = first32; cur32 <= last32; cur32++) {
> +        /* Deal with start/end not on alignment */
> +        uint32_t mask = make_32bit_mask(start, end, cur32);
> +
> +        uint32_t data = get_32bits_map(ms->sentmap, cur32 * 32);
> +        data &= mask;
> +
> +        if (data) {
> +            postcopy_discard_send_chunk(ms, pds, (cur32-first32) * 32, data);
> +        }
> +    }
> +
> +    return 0;
> +}


> +/*
> + * Utility for the outgoing postcopy code.
> + *   Calls postcopy_send_discard_bm_ram for each RAMBlock
> + *   passing it bitmap indexes and name.
> + * Returns: 0 on success
> + * (qemu_ram_foreach_block ends up passing unscaled lengths
> + *  which would mean postcopy code would have to deal with target page)
> + */
> +static int pc_each_ram_discard(MigrationState *ms)

s/discard/send_discard/

> +{
> +    struct RAMBlock *block;
> +    int ret;
> +
> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> +        unsigned long first = block->offset >> TARGET_PAGE_BITS;
> +        unsigned long last = (block->offset + (block->max_length-1))
> +                                >> TARGET_PAGE_BITS;
> +        PostcopyDiscardState *pds = postcopy_discard_send_init(ms,
> +                                                               first & 31,
> +                                                               block->idstr);
> +
> +        /*
> +         * Postcopy sends chunks of bitmap over the wire, but it
> +         * just needs indexes at this point, avoids it having
> +         * target page specific code.
> +         */
> +        ret = pc_send_discard_bm_ram(ms, pds, first, last);
> +        postcopy_discard_send_finish(ms, pds);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/*
> + * Transmit the set of pages to be discarded after precopy to the target
> + * these are pages that have been sent previously but have been dirtied
> + * Hopefully this is pretty sparse
> + */
> +int ram_postcopy_send_discard_bitmap(MigrationState *ms)
> +{
> +    /* This should be our last sync, the src is now paused */
> +    migration_bitmap_sync();
> +
> +    /*
> +     * Update the sentmap to be  sentmap&=dirty
> +     */
> +    bitmap_and(ms->sentmap, ms->sentmap, migration_bitmap,
> +               last_ram_offset() >> TARGET_PAGE_BITS);
> +
> +
> +    trace_ram_postcopy_send_discard_bitmap();
> +#ifdef DEBUG_POSTCOPY
> +    ram_debug_dump_bitmap(ms->sentmap, false);
> +#endif
> +
> +    return pc_each_ram_discard(ms);
> +}
> +
> +/*
> + * At the start of the postcopy phase of migration, any now-dirty
> + * precopied pages are discarded.
> + *
> + * start..end is an inclusive range of bits indexed in the source
> + *    VMs bitmap for this RAMBlock, source_target_page_bits tells
> + *    us what one of those bits represents.
> + *
> + * start/end are offsets from the start of the bitmap for RAMBlock 'block_name'
> + *
> + * Returns 0 on success.
> + */
> +int ram_discard_range(MigrationIncomingState *mis,
> +                      const char *block_name,
> +                      uint64_t start, uint64_t end)
> +{
> +    assert(end >= start);
> +
> +    RAMBlock *rb = ram_find_block(block_name);
> +
> +    if (!rb) {
> +        error_report("ram_discard_range: Failed to find block '%s'",
> +                     block_name);
> +        return -1;
> +    }
> +
> +    uint64_t index_offset = rb->offset >> TARGET_PAGE_BITS;
> +    postcopy_pmi_discard_range(mis, start + index_offset, (end - start) + 1);
> +
> +    /* +1 gives the byte after the end of the last page to be discarded */
> +    ram_addr_t end_offset = (end+1) << TARGET_PAGE_BITS;
> +    uint8_t *host_startaddr = rb->host + (start << TARGET_PAGE_BITS);
> +    uint8_t *host_endaddr;
> +
> +    if (end_offset <= rb->used_length) {
> +        host_endaddr   = rb->host + (end_offset-1);
> +        return postcopy_ram_discard_range(mis, host_startaddr, host_endaddr);
> +    } else {
> +        error_report("ram_discard_range: Overrun block '%s' (%" PRIu64
> +                     "/%" PRIu64 "/%zu)",
> +                     block_name, start, end, rb->used_length);
> +        return -1;
> +    }
> +}
> +
>  static int ram_save_setup(QEMUFile *f, void *opaque)
>  {
>      RAMBlock *block;
> @@ -854,7 +1118,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  
>          acct_clear();
>      }
> -
>      qemu_mutex_lock_iothread();
>      qemu_mutex_lock_ramlist();
>      bytes_transferred = 0;
> @@ -864,6 +1127,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      migration_bitmap = bitmap_new(ram_bitmap_pages);
>      bitmap_set(migration_bitmap, 0, ram_bitmap_pages);
>  
> +    if (migrate_postcopy_ram()) {
> +        MigrationState *s = migrate_get_current();
> +        s->sentmap = bitmap_new(ram_bitmap_pages);
> +        bitmap_clear(s->sentmap, 0, ram_bitmap_pages);
> +    }
> +
>      /*
>       * Count the total number of pages used by ram blocks not including any
>       * gaps due to alignment or unplugs.
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 86200b9..e749f4c 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -125,6 +125,13 @@ struct MigrationState
>  
>      /* Flag set once the migration has been asked to enter postcopy */
>      bool start_postcopy;
> +
> +    /* bitmap of pages that have been sent at least once
> +     * only maintained and used in postcopy at the moment
> +     * where it's used to send the dirtymap at the start
> +     * of the postcopy phase
> +     */
> +    unsigned long *sentmap;
>  };
>  
>  void process_incoming_migration(QEMUFile *f);
> @@ -194,6 +201,11 @@ double xbzrle_mig_cache_miss_rate(void);
>  
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>  void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
> +/* For outgoing discard bitmap */
> +int ram_postcopy_send_discard_bitmap(MigrationState *ms);
> +/* For incoming postcopy discard */
> +int ram_discard_range(MigrationIncomingState *mis, const char *block_name,
> +                      uint64_t start, uint64_t end);
>  
>  /**
>   * @migrate_add_blocker - prevent migration from proceeding
> diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> index e93ee8a..1fec1c1 100644
> --- a/include/migration/postcopy-ram.h
> +++ b/include/migration/postcopy-ram.h
> @@ -28,4 +28,38 @@ void postcopy_pmi_destroy(MigrationIncomingState *mis);
>  void postcopy_pmi_discard_range(MigrationIncomingState *mis,
>                                  size_t start, size_t npages);
>  void postcopy_pmi_dump(MigrationIncomingState *mis);
> +
> +/*
> + * Discard the contents of memory start..end inclusive.
> + * We can assume that if we've been called postcopy_ram_hosttest returned true
> + */
> +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> +                               uint8_t *end);
> +
> +
> +/*
> + * Called at the start of each RAMBlock by the bitmap code
> + * offset is the bit within the first 32bit chunk of mask
> + * that represents the first page of the RAM Block
> + * Returns a new PDS
> + */
> +PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
> +                                                 uint8_t offset,
> +                                                 const char *name);
> +
> +/*
> + * Called by the bitmap code for each chunk to discard
> + * May send a discard message, may just leave it queued to
> + * be sent later
> + */
> +void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
> +                                unsigned long pos, uint32_t bitmap);
> +
> +/*
> + * Called at the end of each RAMBlock by the bitmap code
> + * Sends any outstanding discard messages, frees the PDS
> + */
> +void postcopy_discard_send_finish(MigrationState *ms,
> +                                  PostcopyDiscardState *pds);
> +
>  #endif
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 924eeb6..0651275 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -61,6 +61,7 @@ typedef struct PCIExpressHost PCIExpressHost;
>  typedef struct PCIHostState PCIHostState;
>  typedef struct PCMCIACardState PCMCIACardState;
>  typedef struct PixelFormat PixelFormat;
> +typedef struct PostcopyDiscardState PostcopyDiscardState;
>  typedef struct PostcopyPMI PostcopyPMI;
>  typedef struct PropertyInfo PropertyInfo;
>  typedef struct Property Property;
> diff --git a/migration/migration.c b/migration/migration.c
> index 6b20b56..850fe1a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -22,6 +22,7 @@
>  #include "block/block.h"
>  #include "qemu/sockets.h"
>  #include "migration/block.h"
> +#include "migration/postcopy-ram.h"
>  #include "qemu/thread.h"
>  #include "qmp-commands.h"
>  #include "trace.h"
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 4f29055..391e9c6 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -28,6 +28,19 @@
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  
> +#define MAX_DISCARDS_PER_COMMAND 12
> +
> +struct PostcopyDiscardState {
> +    const char *name;
> +    uint16_t cur_entry;
> +    uint64_t addrlist[MAX_DISCARDS_PER_COMMAND];
> +    uint32_t masklist[MAX_DISCARDS_PER_COMMAND];
> +    uint8_t  offset;  /* Offset within 32bit mask at addr0 representing 1st
> +                         page of block */
> +    unsigned int nsentwords;
> +    unsigned int nsentcmds;
> +};
> +
>  /* Postcopy needs to detect accesses to pages that haven't yet been copied
>   * across, and efficiently map new pages in, the techniques for doing this
>   * are target OS specific.
> @@ -364,6 +377,21 @@ out:
>      return ret;
>  }
>  
> +/*
> + * Discard the contents of memory start..end inclusive.
> + * We can assume that if we've been called postcopy_ram_hosttest returned true
> + */
> +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> +                               uint8_t *end)
> +{
> +    if (madvise(start, (end-start)+1, MADV_DONTNEED)) {
> +        perror("postcopy_ram_discard_range MADV_DONTNEED");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  #else
>  /* No target OS support, stubs just fail */
>  
> @@ -380,5 +408,88 @@ void postcopy_hook_early_receive(MigrationIncomingState *mis,
>      /* We don't support postcopy so don't care */
>  }
>  
> +void postcopy_pmi_destroy(MigrationIncomingState *mis)
> +{
> +    /* Called in normal cleanup path - so it's OK */
> +}
> +
> +void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> +                                size_t start, size_t npages)
> +{
> +    assert(0);
> +}
> +
> +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> +                               uint8_t *end)
> +{
> +    assert(0);
> +}
>  #endif
>  
> +/* ------------------------------------------------------------------------- */
> +
> +/*
> + * Called at the start of each RAMBlock by the bitmap code
> + * offset is the bit within the first 64bit chunk of mask
> + * that represents the first page of the RAM Block
> + * Returns a new PDS
> + */
> +PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
> +                                                 uint8_t offset,
> +                                                 const char *name)
> +{
> +    PostcopyDiscardState *res = g_try_malloc(sizeof(PostcopyDiscardState));
> +
> +    if (res) {
> +        res->name = name;
> +        res->cur_entry = 0;
> +        res->nsentwords = 0;
> +        res->nsentcmds = 0;
> +        res->offset = offset;
> +    }
> +
> +    return res;
> +}
> +
> +/*
> + * Called by the bitmap code for each chunk to discard
> + * May send a discard message, may just leave it queued to
> + * be sent later
> + */
> +void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
> +                                unsigned long pos, uint32_t bitmap)
> +{
> +    pds->addrlist[pds->cur_entry] = pos;
> +    pds->masklist[pds->cur_entry] = bitmap;
> +    pds->cur_entry++;
> +    pds->nsentwords++;
> +
> +    if (pds->cur_entry == MAX_DISCARDS_PER_COMMAND) {
> +        /* Full set, ship it! */
> +        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
> +                                              pds->cur_entry, pds->offset,
> +                                              pds->addrlist, pds->masklist);
> +        pds->nsentcmds++;
> +        pds->cur_entry = 0;
> +    }
> +}
> +
> +/*
> + * Called at the end of each RAMBlock by the bitmap code
> + * Sends any outstanding discard messages, frees the PDS
> + */
> +void postcopy_discard_send_finish(MigrationState *ms, PostcopyDiscardState *pds)
> +{
> +    /* Anything unsent? */
> +    if (pds->cur_entry) {
> +        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
> +                                              pds->cur_entry, pds->offset,
> +                                              pds->addrlist, pds->masklist);
> +        pds->nsentcmds++;
> +    }
> +
> +    trace_postcopy_discard_send_finish(pds->name, pds->nsentwords,
> +                                       pds->nsentcmds);
> +
> +    g_free(pds);
> +}
> diff --git a/savevm.c b/savevm.c
> index 1e8d289..2589b8c 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1282,15 +1282,12 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
>               * we know there must be at least 1 bit set due to the loop entry
>               * If there is no 0 firstzero will be 32
>               */
> -            /* TODO - ram_discard_range gets added in a later patch
>              int ret = ram_discard_range(mis, ramid,
>                                  startaddr + firstset - first_bit_offset,
>                                  startaddr + (firstzero - 1) - first_bit_offset);
> -            ret = -1;
>              if (ret) {
>                  return ret;
>              }
> -            */
>  
>              /* mask= .?0000000000 */
>              /*         ^fz ^fs    */
> diff --git a/trace-events b/trace-events
> index a555b56..f985117 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1217,6 +1217,7 @@ qemu_file_fclose(void) ""
>  migration_bitmap_sync_start(void) ""
>  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
>  migration_throttle(void) ""
> +ram_postcopy_send_discard_bitmap(void) ""
>  
>  # hw/display/qxl.c
>  disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
> @@ -1478,6 +1479,9 @@ rdma_start_incoming_migration_after_rdma_listen(void) ""
>  rdma_start_outgoing_migration_after_rdma_connect(void) ""
>  rdma_start_outgoing_migration_after_rdma_source_init(void) ""
>  
> +# migration/postcopy-ram.c
> +postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
> +
>  # kvm-all.c
>  kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
>  kvm_vm_ioctl(int type, void *arg) "type 0x%x, arg %p"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation Dr. David Alan Gilbert (git)
@ 2015-03-23  3:41   ` David Gibson
  2015-03-23 13:46     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  3:41 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

On Wed, Feb 25, 2015 at 04:51:51PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Looks ok, apart from a misspelled comment:

[snip]
> +    /*
> +     * We need the whole of RAM to be truly empty for postcopy, so things
> +     * like ROMs and any data tables built during init must be zero'd
> +     * - we're going to get the copy from the source anyway.
> +     * (Precopy will just overwrite this data, so doesn't need the discard)
> +     */
> +    if (postcopy_ram_discard_range(mis, host_addr, (host_addr + length - 1))) {
> +        return -1;
> +    }
> +
> +    /*
> +     * We also need the area to be normal 4k pages, not huge pages
> +     * (otherwise we can't be sure we can atopically place the

s/atopically/atomically/


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 29/45] postcopy: ram_enable_notify to switch on userfault
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 29/45] postcopy: ram_enable_notify to switch on userfault Dr. David Alan Gilbert (git)
@ 2015-03-23  3:45   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-23  3:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Wed, Feb 25, 2015 at 04:51:52PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Mark the area of RAM as 'userfault'
> Start up a fault-thread to handle any userfaults we might receive
> from it (to be filled in later)
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread Dr. David Alan Gilbert (git)
@ 2015-03-23  4:20   ` David Gibson
  2015-03-26 11:05     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  4:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 11395 bytes --]

On Wed, Feb 25, 2015 at 04:51:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Rework the migration thread to setup and start postcopy.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |   3 +
>  migration/migration.c         | 161 ++++++++++++++++++++++++++++++++++++++++--
>  trace-events                  |   4 ++
>  3 files changed, 164 insertions(+), 4 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 821d561..2c607e7 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -131,6 +131,9 @@ struct MigrationState
>      /* Flag set once the migration has been asked to enter postcopy */
>      bool start_postcopy;
>  
> +    /* Flag set once the migration thread is running (and needs joining) */
> +    bool started_migration_thread;
> +
>      /* bitmap of pages that have been sent at least once
>       * only maintained and used in postcopy at the moment
>       * where it's used to send the dirtymap at the start
> diff --git a/migration/migration.c b/migration/migration.c
> index b1ad7b1..6bf9c8d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -468,7 +468,10 @@ static void migrate_fd_cleanup(void *opaque)
>      if (s->file) {
>          trace_migrate_fd_cleanup();
>          qemu_mutex_unlock_iothread();
> -        qemu_thread_join(&s->thread);
> +        if (s->started_migration_thread) {
> +            qemu_thread_join(&s->thread);
> +            s->started_migration_thread = false;
> +        }
>          qemu_mutex_lock_iothread();
>  
>          qemu_fclose(s->file);
> @@ -874,7 +877,6 @@ out:
>      return NULL;
>  }
>  
> -__attribute__ (( unused )) /* Until later in patch series */
>  static int open_outgoing_return_path(MigrationState *ms)
>  {
>  
> @@ -911,23 +913,141 @@ static void await_outgoing_return_path_close(MigrationState *ms)
>  }
>  
>  /*
> + * Switch from normal iteration to postcopy
> + * Returns non-0 on error
> + */
> +static int postcopy_start(MigrationState *ms, bool *old_vm_running)
> +{
> +    int ret;
> +    const QEMUSizedBuffer *qsb;
> +    int64_t time_at_stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +    migrate_set_state(ms, MIG_STATE_ACTIVE, MIG_STATE_POSTCOPY_ACTIVE);
> +
> +    trace_postcopy_start();
> +    qemu_mutex_lock_iothread();
> +    trace_postcopy_start_set_run();
> +
> +    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
> +    *old_vm_running = runstate_is_running();

I think that needs some explanation.  Why are you doing a wakeup on
the source host?


> +    ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> +
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    /*
> +     * in Finish migrate and with the io-lock held everything should
> +     * be quiet, but we've potentially still got dirty pages and we
> +     * need to tell the destination to throw any pages it's already received
> +     * that are dirty
> +     */
> +    if (ram_postcopy_send_discard_bitmap(ms)) {
> +        error_report("postcopy send discard bitmap failed");
> +        goto fail;
> +    }
> +
> +    /*
> +     * send rest of state - note things that are doing postcopy
> +     * will notice we're in MIG_STATE_POSTCOPY_ACTIVE and not actually
> +     * wrap their state up here
> +     */
> +    qemu_file_set_rate_limit(ms->file, INT64_MAX);
> +    /* Ping just for debugging, helps line traces up */
> +    qemu_savevm_send_ping(ms->file, 2);
> +
> +    /*
> +     * We need to leave the fd free for page transfers during the
> +     * loading of the device state, so wrap all the remaining
> +     * commands and state into a package that gets sent in one go
> +     */
> +    QEMUFile *fb = qemu_bufopen("w", NULL);
> +    if (!fb) {
> +        error_report("Failed to create buffered file");
> +        goto fail;
> +    }
> +
> +    qemu_savevm_state_complete(fb);
> +    qemu_savevm_send_ping(fb, 3);
> +
> +    qemu_savevm_send_postcopy_run(fb);
> +
> +    /* <><> end of stuff going into the package */
> +    qsb = qemu_buf_get(fb);
> +
> +    /* Now send that blob */
> +    if (qsb_get_length(qsb) > MAX_VM_CMD_PACKAGED_SIZE) {
> +        error_report("postcopy_start: Unreasonably large packaged state: %lu",
> +                     (unsigned long)(qsb_get_length(qsb)));
> +        goto fail_closefb;
> +    }
> +    qemu_savevm_send_packaged(ms->file, qsb);
> +    qemu_fclose(fb);
> +    ms->downtime =  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - time_at_stop;
> +
> +    qemu_mutex_unlock_iothread();
> +
> +    /*
> +     * Although this ping is just for debug, it could potentially be
> +     * used for getting a better measurement of downtime at the source.
> +     */
> +    qemu_savevm_send_ping(ms->file, 4);
> +
> +    ret = qemu_file_get_error(ms->file);
> +    if (ret) {
> +        error_report("postcopy_start: Migration stream errored");
> +        migrate_set_state(ms, MIG_STATE_POSTCOPY_ACTIVE, MIG_STATE_ERROR);
> +    }
> +
> +    return ret;
> +
> +fail_closefb:
> +    qemu_fclose(fb);
> +fail:
> +    migrate_set_state(ms, MIG_STATE_POSTCOPY_ACTIVE, MIG_STATE_ERROR);
> +    qemu_mutex_unlock_iothread();
> +    return -1;
> +}
> +
> +/*
>   * Master migration thread on the source VM.
>   * It drives the migration and pumps the data down the outgoing channel.
>   */
>  static void *migration_thread(void *opaque)
>  {
>      MigrationState *s = opaque;
> +    /* Used by the bandwidth calcs, updated later */
>      int64_t initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>      int64_t initial_bytes = 0;
>      int64_t max_size = 0;
>      int64_t start_time = initial_time;
>      bool old_vm_running = false;
> +    bool entered_postcopy = false;
> +    /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */
> +    enum MigrationPhase current_active_type = MIG_STATE_ACTIVE;
>  
>      qemu_savevm_state_header(s->file);
> +
> +    if (migrate_postcopy_ram()) {
> +        /* Now tell the dest that it should open its end so it can reply */
> +        qemu_savevm_send_open_return_path(s->file);
> +
> +        /* And do a ping that will make stuff easier to debug */
> +        qemu_savevm_send_ping(s->file, 1);
> +
> +        /*
> +         * Tell the destination that we *might* want to do postcopy later;
> +         * if the other end can't do postcopy it should fail now, nice and
> +         * early.
> +         */
> +        qemu_savevm_send_postcopy_advise(s->file);
> +    }
> +
>      qemu_savevm_state_begin(s->file, &s->params);
>  
>      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
> +    current_active_type = MIG_STATE_ACTIVE;
>      migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ACTIVE);
>  
>      trace_migration_thread_setup_complete();
> @@ -946,6 +1066,22 @@ static void *migration_thread(void *opaque)
>              trace_migrate_pending(pending_size, max_size,
>                                    pend_post, pend_nonpost);
>              if (pending_size && pending_size >= max_size) {
> +                /* Still a significant amount to transfer */
> +
> +                current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +                if (migrate_postcopy_ram() &&
> +                    s->state != MIG_STATE_POSTCOPY_ACTIVE &&
> +                    pend_nonpost <= max_size &&
> +                    atomic_read(&s->start_postcopy)) {
> +
> +                    if (!postcopy_start(s, &old_vm_running)) {
> +                        current_active_type = MIG_STATE_POSTCOPY_ACTIVE;
> +                        entered_postcopy = true;

Do you need entered_postcopy, or could you just use the existing
MIG_STATE variable?

> +                    }
> +
> +                    continue;
> +                }
> +                /* Just another iteration step */
>                  qemu_savevm_state_iterate(s->file);
>              } else {
>                  int ret;
> @@ -975,7 +1111,8 @@ static void *migration_thread(void *opaque)
>          }
>  
>          if (qemu_file_get_error(s->file)) {
> -            migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_ERROR);
> +            migrate_set_state(s, current_active_type, MIG_STATE_ERROR);
> +            trace_migration_thread_file_err();
>              break;
>          }
>          current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> @@ -1006,12 +1143,15 @@ static void *migration_thread(void *opaque)
>          }
>      }
>  
> +    trace_migration_thread_after_loop();
>      qemu_mutex_lock_iothread();
>      if (s->state == MIG_STATE_COMPLETED) {
>          int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>          uint64_t transferred_bytes = qemu_ftell(s->file);
>          s->total_time = end_time - s->total_time;
> -        s->downtime = end_time - start_time;
> +        if (!entered_postcopy) {
> +            s->downtime = end_time - start_time;
> +        }
>          if (s->total_time) {
>              s->mbps = (((double) transferred_bytes * 8.0) /
>                         ((double) s->total_time)) / 1000;
> @@ -1043,8 +1183,21 @@ void migrate_fd_connect(MigrationState *s)
>      /* Notify before starting migration thread */
>      notifier_list_notify(&migration_state_notifiers, s);
>  
> +    /* Open the return path; currently for postcopy but other things might
> +     * also want it.
> +     */
> +    if (migrate_postcopy_ram()) {
> +        if (open_outgoing_return_path(s)) {
> +            error_report("Unable to open return-path for postcopy");
> +            migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ERROR);
> +            migrate_fd_cleanup(s);
> +            return;
> +        }
> +    }
> +
>      qemu_thread_create(&s->thread, "migration", migration_thread, s,
>                         QEMU_THREAD_JOINABLE);
> +    s->started_migration_thread = true;
>  }
>  
>  PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> diff --git a/trace-events b/trace-events
> index 59dea4c..ed8bbe2 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1404,9 +1404,13 @@ migrate_fd_error(void) ""
>  migrate_fd_cancel(void) ""
>  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
>  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> +migration_thread_after_loop(void) ""
> +migration_thread_file_err(void) ""
>  migration_thread_setup_complete(void) ""
>  open_outgoing_return_path(void) ""
>  open_outgoing_return_path_continue(void) ""
> +postcopy_start(void) ""
> +postcopy_start_set_run(void) ""
>  source_return_path_thread_bad_end(void) ""
>  source_return_path_bad_header_com(void) ""
>  source_return_path_thread_end(void) ""

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy Dr. David Alan Gilbert (git)
@ 2015-03-23  4:32   ` David Gibson
  2015-03-23 12:21     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  4:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 4326 bytes --]

On Wed, Feb 25, 2015 at 04:52:08PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The userfault mechanism used for postcopy generates faults
> for us on pages that are 'not present', inflating a balloon in
> the guest causes host pages to be marked as 'not present'; doing
> this during a postcopy, as potentially the same pages were being
> received from the source, would confuse the state of the received
> page -> disable ballooning during postcopy.

That is a ludicrously long sentence, which I have great difficulty parsing.

> When disabled we drop balloon requests from the guest.  Since ballooning
> is generally initiated by the host, the management system should avoid
> initiating any balloon instructions to the guest during migration,
> although it's not possible to know how long it would take a guest to
> process a request made prior to the start of migration.

Yeah :/.  It would be nice if it could queue the guest actions,
instead of dropping them.

> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  balloon.c                  | 11 +++++++++++
>  hw/virtio/virtio-balloon.c |  4 +++-
>  include/sysemu/balloon.h   |  2 ++
>  migration/postcopy-ram.c   |  9 +++++++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/balloon.c b/balloon.c
> index dea19a4..faedb60 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -35,6 +35,17 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> +static bool balloon_inhibited;
> +
> +bool qemu_balloon_is_inhibited(void)
> +{
> +    return balloon_inhibited;
> +}
> +
> +void qemu_balloon_inhibit(bool state)
> +{
> +    balloon_inhibited = state;
> +}
>  
>  static bool have_ballon(Error **errp)
>  {
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 7bfbb75..b0e94ee 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -36,9 +36,11 @@
>  static void balloon_page(void *addr, int deflate)
>  {
>  #if defined(__linux__)
> -    if (!kvm_enabled() || kvm_has_sync_mmu())
> +    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> +                                         kvm_has_sync_mmu())) {
>          qemu_madvise(addr, TARGET_PAGE_SIZE,
>                  deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    }
>  #endif
>  }
>  
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index 0345e01..6851d99 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -23,5 +23,7 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>  			     QEMUBalloonStatus *stat_func, void *opaque);
>  void qemu_remove_balloon_handler(void *opaque);
> +bool qemu_balloon_is_inhibited(void);
> +void qemu_balloon_inhibit(bool state);
>  
>  #endif
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index d8f5ccd..b9f5848 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -24,6 +24,7 @@
>  #include "migration/migration.h"
>  #include "migration/postcopy-ram.h"
>  #include "sysemu/sysemu.h"
> +#include "sysemu/balloon.h"
>  #include "qemu/bitmap.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
> @@ -531,6 +532,8 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>          mis->have_fault_thread = false;
>      }
>  
> +    qemu_balloon_inhibit(false);
> +
>      if (enable_mlock) {
>          if (os_mlock() < 0) {
>              error_report("mlock: %s", strerror(errno));
> @@ -780,6 +783,12 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> +    /*
> +     * Ballooning can mark pages as absent while we're postcopying
> +     * that would cause false userfaults.
> +     */
> +    qemu_balloon_inhibit(true);
> +
>      trace_postcopy_ram_enable_notify();
>  
>      return 0;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 44/45] Disable mlock around incoming postcopy
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 44/45] Disable mlock around incoming postcopy Dr. David Alan Gilbert (git)
@ 2015-03-23  4:33   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-23  4:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 741 bytes --]

On Wed, Feb 25, 2015 at 04:52:07PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Userfault doesn't work with mlock; mlock is designed to nail down pages
> so they don't move, userfault is designed to tell you when they're not
> there.
> 
> munlock the pages we userfault protect before postcopy.
> mlock everything again at the end if mlock is enabled.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command Dr. David Alan Gilbert (git)
@ 2015-03-23  5:00   ` David Gibson
  2015-03-25 18:16     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-23  5:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 7322 bytes --]

On Wed, Feb 25, 2015 at 04:51:55PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add MIG_RP_CMD_REQ_PAGES command on Return path for the postcopy
> destination to request a page from the source.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |  4 +++
>  migration/migration.c         | 70 +++++++++++++++++++++++++++++++++++++++++++
>  trace-events                  |  1 +
>  3 files changed, 75 insertions(+)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 2c607e7..2c15d63 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -46,6 +46,8 @@ enum mig_rpcomm_cmd {
>      MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
>      MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
>      MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> +
> +    MIG_RP_CMD_REQ_PAGES,    /* data (start: be64, len: be64) */
>  };
>  
>  /* Postcopy page-map-incoming - data about each page on the inbound side */
> @@ -253,6 +255,8 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
>                            uint32_t value);
>  void migrate_send_rp_pong(MigrationIncomingState *mis,
>                            uint32_t value);
> +void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char* rbname,
> +                              ram_addr_t start, ram_addr_t len);
>  
>  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
> diff --git a/migration/migration.c b/migration/migration.c
> index bd066f6..2e9d0dd 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -138,6 +138,36 @@ void migrate_send_rp_pong(MigrationIncomingState *mis,
>      migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
>  }
>  
> +/* Request a range of pages from the source VM at the given
> + * start address.
> + *   rbname: Name of the RAMBlock to request the page in, if NULL it's the same
> + *           as the last request (a name must have been given previously)
> + *   Start: Address offset within the RB
> + *   Len: Length in bytes required - must be a multiple of pagesize
> + */
> +void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char *rbname,
> +                               ram_addr_t start, ram_addr_t len)
> +{
> +    uint8_t bufc[16+1+255]; /* start (8 byte), len (8 byte), rbname upto 256 */
> +    uint64_t *buf64 = (uint64_t *)bufc;
> +    size_t msglen = 16; /* start + len */
> +
> +    assert(!(len & 1));
> +    if (rbname) {
> +        int rbname_len = strlen(rbname);
> +        assert(rbname_len < 256);
> +
> +        len |= 1; /* Flag to say we've got a name */
> +        bufc[msglen++] = rbname_len;
> +        memcpy(bufc + msglen, rbname, rbname_len);
> +        msglen += rbname_len;
> +    }
> +
> +    buf64[0] = cpu_to_be64((uint64_t)start);
> +    buf64[1] = cpu_to_be64((uint64_t)len);
> +    migrate_send_rp_message(mis, MIG_RP_CMD_REQ_PAGES, msglen, bufc);

So.. what's the reason we actually need ramblock names on the wire,
rather than working purely from GPAs?

It occurs to me that referencing ramblock names from the wire protocol
exposes something that's kind of an internal detail, and may limit our
options for reworking the memory subsystem in future.

> +}
> +
>  void qemu_start_incoming_migration(const char *uri, Error **errp)
>  {
>      const char *p;
> @@ -789,6 +819,17 @@ static void source_return_path_bad(MigrationState *s)
>  }
>  
>  /*
> + * Process a request for pages received on the return path,
> + * We're allowed to send more than requested (e.g. to round to our page size)
> + * and we don't need to send pages that have already been sent.
> + */
> +static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> +                                       ram_addr_t start, ram_addr_t len)
> +{
> +    trace_migrate_handle_rp_req_pages(start, len);
> +}
> +
> +/*
>   * Handles messages sent on the return path towards the source VM
>   *
>   */
> @@ -800,6 +841,8 @@ static void *source_return_path_thread(void *opaque)
>      const int max_len = 512;
>      uint8_t buf[max_len];
>      uint32_t tmp32;
> +    ram_addr_t start, len;
> +    char *tmpstr;
>      int res;
>  
>      trace_source_return_path_thread_entry();
> @@ -815,6 +858,11 @@ static void *source_return_path_thread(void *opaque)
>              expected_len = 4;
>              break;
>  
> +        case MIG_RP_CMD_REQ_PAGES:
> +            /* 16 byte start/len _possibly_ plus an id str */
> +            expected_len = 16 + 256;

Isn't that the maximum length, rather than the minimum or typical length?

> +            break;
> +
>          default:
>              error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
>                      header_com, header_len);
> @@ -860,6 +908,28 @@ static void *source_return_path_thread(void *opaque)
>              trace_source_return_path_thread_pong(tmp32);
>              break;
>  
> +        case MIG_RP_CMD_REQ_PAGES:
> +            start = be64_to_cpup((uint64_t *)buf);
> +            len = be64_to_cpup(((uint64_t *)buf)+1);
> +            tmpstr = NULL;
> +            if (len & 1) {
> +                len -= 1; /* Remove the flag */
> +                /* Now we expect an idstr */
> +                tmp32 = buf[16]; /* Length of the following idstr */
> +                tmpstr = (char *)&buf[17];
> +                buf[17+tmp32] = '\0';
> +                expected_len = 16+1+tmp32;
> +            } else {
> +                expected_len = 16;

Ah.. so expected_len is changed here.  But then what was the point of
setting it in the earlier switch?

> +            }
> +            if (header_len != expected_len) {
> +                error_report("RP: Req_Page with length %d expecting %d",
> +                        header_len, expected_len);
> +                source_return_path_bad(ms);
> +            }
> +            migrate_handle_rp_req_pages(ms, tmpstr, start, len);
> +            break;
> +
>          default:
>              /* This shouldn't happen because we should catch this above */
>              trace_source_return_path_bad_header_com();
> diff --git a/trace-events b/trace-events
> index bcbdef8..9bedee4 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1404,6 +1404,7 @@ migrate_fd_error(void) ""
>  migrate_fd_cancel(void) ""
>  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
>  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> +migrate_handle_rp_req_pages(size_t start, size_t len) "at %zx for len %zx"
>  migration_thread_after_loop(void) ""
>  migration_thread_file_err(void) ""
>  migration_thread_setup_complete(void) ""

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy
  2015-03-23  4:32   ` David Gibson
@ 2015-03-23 12:21     ` Dr. David Alan Gilbert
  2015-03-24  1:25       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-23 12:21 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:52:08PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The userfault mechanism used for postcopy generates faults
> > for us on pages that are 'not present', inflating a balloon in
> > the guest causes host pages to be marked as 'not present'; doing
> > this during a postcopy, as potentially the same pages were being
> > received from the source, would confuse the state of the received
> > page -> disable ballooning during postcopy.
> 
> That is a ludicrously long sentence, which I have great difficulty parsing.

OK, how about:

-----
Postcopy detects accesses to pages that haven't been transferred yet
using userfaultfd, and it causes exceptions on pages that are 'not present'.
Ballooning also causes pages to be marked as 'not present' when the guest
inflates the balloon.
Potentially a balloon could be inflated to discard pages that are currently
inflight during postcopy and that may be arriving at about the same time.

To avoid this confusion, disable ballooning during postcopy.

-----

> > When disabled we drop balloon requests from the guest.  Since ballooning
> > is generally initiated by the host, the management system should avoid
> > initiating any balloon instructions to the guest during migration,
> > although it's not possible to know how long it would take a guest to
> > process a request made prior to the start of migration.
> 
> Yeah :/.  It would be nice if it could queue the guest actions,
> instead of dropping them.

Yes, I did look at that briefly; it's not trivial; for
example consider the situation where the guest discards some pages
by inflating, and then later deflates, it expects to lose that data
but then starts accessing that physical page again.  
If you replay that sequence at the end then you've lost newly accessed pages.
So you have to filter out inflates that have been deflated later,
and have to order those correctly with the sense of changes made to those
pages after the deflation occurs.

Dave

> 
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  balloon.c                  | 11 +++++++++++
> >  hw/virtio/virtio-balloon.c |  4 +++-
> >  include/sysemu/balloon.h   |  2 ++
> >  migration/postcopy-ram.c   |  9 +++++++++
> >  4 files changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/balloon.c b/balloon.c
> > index dea19a4..faedb60 100644
> > --- a/balloon.c
> > +++ b/balloon.c
> > @@ -35,6 +35,17 @@
> >  static QEMUBalloonEvent *balloon_event_fn;
> >  static QEMUBalloonStatus *balloon_stat_fn;
> >  static void *balloon_opaque;
> > +static bool balloon_inhibited;
> > +
> > +bool qemu_balloon_is_inhibited(void)
> > +{
> > +    return balloon_inhibited;
> > +}
> > +
> > +void qemu_balloon_inhibit(bool state)
> > +{
> > +    balloon_inhibited = state;
> > +}
> >  
> >  static bool have_ballon(Error **errp)
> >  {
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 7bfbb75..b0e94ee 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -36,9 +36,11 @@
> >  static void balloon_page(void *addr, int deflate)
> >  {
> >  #if defined(__linux__)
> > -    if (!kvm_enabled() || kvm_has_sync_mmu())
> > +    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> > +                                         kvm_has_sync_mmu())) {
> >          qemu_madvise(addr, TARGET_PAGE_SIZE,
> >                  deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > +    }
> >  #endif
> >  }
> >  
> > diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> > index 0345e01..6851d99 100644
> > --- a/include/sysemu/balloon.h
> > +++ b/include/sysemu/balloon.h
> > @@ -23,5 +23,7 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> >  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> >  			     QEMUBalloonStatus *stat_func, void *opaque);
> >  void qemu_remove_balloon_handler(void *opaque);
> > +bool qemu_balloon_is_inhibited(void);
> > +void qemu_balloon_inhibit(bool state);
> >  
> >  #endif
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index d8f5ccd..b9f5848 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -24,6 +24,7 @@
> >  #include "migration/migration.h"
> >  #include "migration/postcopy-ram.h"
> >  #include "sysemu/sysemu.h"
> > +#include "sysemu/balloon.h"
> >  #include "qemu/bitmap.h"
> >  #include "qemu/error-report.h"
> >  #include "trace.h"
> > @@ -531,6 +532,8 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >          mis->have_fault_thread = false;
> >      }
> >  
> > +    qemu_balloon_inhibit(false);
> > +
> >      if (enable_mlock) {
> >          if (os_mlock() < 0) {
> >              error_report("mlock: %s", strerror(errno));
> > @@ -780,6 +783,12 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >          return -1;
> >      }
> >  
> > +    /*
> > +     * Ballooning can mark pages as absent while we're postcopying
> > +     * that would cause false userfaults.
> > +     */
> > +    qemu_balloon_inhibit(true);
> > +
> >      trace_postcopy_ram_enable_notify();
> >  
> >      return 0;
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation
  2015-03-23  3:41   ` David Gibson
@ 2015-03-23 13:46     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-23 13:46 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:51PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> Looks ok, apart from a misspelled comment:
> 
> [snip]
> > +    /*
> > +     * We need the whole of RAM to be truly empty for postcopy, so things
> > +     * like ROMs and any data tables built during init must be zero'd
> > +     * - we're going to get the copy from the source anyway.
> > +     * (Precopy will just overwrite this data, so doesn't need the discard)
> > +     */
> > +    if (postcopy_ram_discard_range(mis, host_addr, (host_addr + length - 1))) {
> > +        return -1;
> > +    }
> > +
> > +    /*
> > +     * We also need the area to be normal 4k pages, not huge pages
> > +     * (otherwise we can't be sure we can atopically place the
> 
> s/atopically/atomically/

Thanks!

Dave

> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard
  2015-03-23  3:30   ` David Gibson
@ 2015-03-23 14:36     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-23 14:36 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:50PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Where postcopy is preceeded by a period of precopy, the destination will
> > have received pages that may have been dirtied on the source after the
> > page was sent.  The destination must throw these pages away before
> > starting it's CPUs.
> > 
> > Maintain a 'sentmap' of pages that have already been sent.
> > Calculate list of sent & dirty pages
> > Provide helpers on the destination side to discard these.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  arch_init.c                      | 275 ++++++++++++++++++++++++++++++++++++++-
> >  include/migration/migration.h    |  12 ++
> >  include/migration/postcopy-ram.h |  34 +++++
> >  include/qemu/typedefs.h          |   1 +
> >  migration/migration.c            |   1 +
> >  migration/postcopy-ram.c         | 111 ++++++++++++++++
> >  savevm.c                         |   3 -
> >  trace-events                     |   4 +
> >  8 files changed, 435 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch_init.c b/arch_init.c
> > index 7bc5fa6..21e7ebe 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -40,6 +40,7 @@
> >  #include "hw/audio/audio.h"
> >  #include "sysemu/kvm.h"
> >  #include "migration/migration.h"
> > +#include "migration/postcopy-ram.h"
> >  #include "hw/i386/smbios.h"
> >  #include "exec/address-spaces.h"
> >  #include "hw/audio/pcspk.h"
> > @@ -414,9 +415,17 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t **current_data,
> >      return bytes_sent;
> >  }
> >  
> > +/* mr: The region to search for dirty pages in
> > + * start: Start address (typically so we can continue from previous page)
> > + * ram_addr_abs: Pointer into which to store the address of the dirty page
> > + *               within the global ram_addr space
> > + *
> > + * Returns: byte offset within memory region of the start of a dirty page
> > + */
> >  static inline
> >  ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
> > -                                                 ram_addr_t start)
> > +                                                 ram_addr_t start,
> > +                                                 ram_addr_t *ram_addr_abs)
> >  {
> >      unsigned long base = mr->ram_addr >> TARGET_PAGE_BITS;
> >      unsigned long nr = base + (start >> TARGET_PAGE_BITS);
> > @@ -435,6 +444,7 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
> >          clear_bit(next, migration_bitmap);
> >          migration_dirty_pages--;
> >      }
> > +    *ram_addr_abs = next << TARGET_PAGE_BITS;
> >      return (next - base) << TARGET_PAGE_BITS;
> >  }
> >  
> > @@ -571,6 +581,19 @@ static void migration_bitmap_sync(void)
> >      }
> >  }
> >  
> > +static RAMBlock *ram_find_block(const char *id)
> > +{
> > +    RAMBlock *block;
> > +
> > +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> > +        if (!strcmp(id, block->idstr)) {
> > +            return block;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> >  /*
> >   * ram_save_page: Send the given page to the stream
> >   *
> > @@ -659,13 +682,16 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
> >      bool complete_round = false;
> >      int bytes_sent = 0;
> >      MemoryRegion *mr;
> > +    ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
> > +                                 ram_addr_t space */
> >  
> >      if (!block)
> >          block = QTAILQ_FIRST(&ram_list.blocks);
> >  
> >      while (true) {
> >          mr = block->mr;
> > -        offset = migration_bitmap_find_and_reset_dirty(mr, offset);
> > +        offset = migration_bitmap_find_and_reset_dirty(mr, offset,
> > +                                                       &dirty_ram_abs);
> >          if (complete_round && block == last_seen_block &&
> >              offset >= last_offset) {
> >              break;
> > @@ -683,6 +709,11 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
> >  
> >              /* if page is unmodified, continue to the next */
> >              if (bytes_sent > 0) {
> > +                MigrationState *ms = migrate_get_current();
> > +                if (ms->sentmap) {
> > +                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
> > +                }
> > +
> >                  last_sent_block = block;
> >                  break;
> >              }
> > @@ -742,12 +773,19 @@ void free_xbzrle_decoded_buf(void)
> >  
> >  static void migration_end(void)
> >  {
> > +    MigrationState *s = migrate_get_current();
> > +
> >      if (migration_bitmap) {
> >          memory_global_dirty_log_stop();
> >          g_free(migration_bitmap);
> >          migration_bitmap = NULL;
> >      }
> >  
> > +    if (s->sentmap) {
> > +        g_free(s->sentmap);
> > +        s->sentmap = NULL;
> > +    }
> > +
> >      XBZRLE_cache_lock();
> >      if (XBZRLE.cache) {
> >          cache_fini(XBZRLE.cache);
> > @@ -815,6 +853,232 @@ void ram_debug_dump_bitmap(unsigned long *todump, bool expected)
> >      }
> >  }
> >  
> > +/* **** functions for postcopy ***** */
> > +
> > +/*
> > + * A helper to get 32 bits from a bit map; trivial for HOST_LONG_BITS=32
> > + * messier for 64; the bitmaps are actually long's that are 32 or 64bit
> > + */
> > +static uint32_t get_32bits_map(unsigned long *map, int64_t start)
> > +{
> > +#if HOST_LONG_BITS == 64
> > +    uint64_t tmp64;
> > +
> > +    tmp64 = map[start / 64];
> > +    return (start & 32) ? (tmp64 >> 32) : (tmp64 & 0xffffffffu);
> > +#elif HOST_LONG_BITS == 32
> > +    /*
> > +     * Irrespective of host endianness, sentmap[n] is for pages earlier
> > +     * than sentmap[n+1] so we can't just cast up
> > +     */
> > +    return map[start / 32];
> > +#else
> > +#error "Host long other than 64/32 not supported"
> > +#endif
> > +}
> > +
> > +/*
> > + * A helper to put 32 bits into a bit map; trivial for HOST_LONG_BITS=32
> > + * messier for 64; the bitmaps are actually long's that are 32 or 64bit
> > + */
> > +__attribute__ (( unused )) /* Until later in patch series */
> > +static void put_32bits_map(unsigned long *map, int64_t start,
> > +                           uint32_t v)
> > +{
> > +#if HOST_LONG_BITS == 64
> > +    uint64_t tmp64 = v;
> > +    uint64_t mask = 0xffffffffu;
> > +
> > +    if (start & 32) {
> > +        tmp64 = tmp64 << 32;
> > +        mask =  mask << 32;
> > +    }
> > +
> > +    map[start / 64] = (map[start / 64] & ~mask) | tmp64;
> > +#elif HOST_LONG_BITS == 32
> > +    /*
> > +     * Irrespective of host endianness, sentmap[n] is for pages earlier
> > +     * than sentmap[n+1] so we can't just cast up
> > +     */
> > +    map[start / 32] = v;
> > +#else
> > +#error "Host long other than 64/32 not supported"
> > +#endif
> > +}
> > +
> > +/*
> > + * When working on 32bit chunks of a bitmap where the only valid section
> > + * is between start..end (inclusive), generate a mask with only those
> > + * valid bits set for the current 32bit word within that bitmask.
> > + */
> > +static int make_32bit_mask(unsigned long start, unsigned long end,
> > +                           unsigned long cur32)
> > +{
> > +    unsigned long first32, last32;
> > +    uint32_t mask = ~(uint32_t)0;
> > +    first32 = start / 32;
> > +    last32 = end / 32;
> > +
> > +    if ((cur32 == first32) && (start & 31)) {
> > +        /* e.g. (start & 31) = 3
> > +         *         1 << .    -> 2^3
> > +         *         . - 1     -> 2^3 - 1 i.e. mask 2..0
> > +         *         ~.        -> mask 31..3
> > +         */
> > +        mask &= ~((((uint32_t)1) << (start & 31)) - 1);
> > +    }
> > +
> > +    if ((cur32 == last32) && ((end & 31) != 31)) {
> > +        /* e.g. (end & 31) = 3
> > +         *            .   +1 -> 4
> > +         *         1 << .    -> 2^4
> > +         *         . -1      -> 2^4 - 1
> > +         *                   = mask set 3..0
> > +         */
> > +        mask &= (((uint32_t)1) << ((end & 31) + 1)) - 1;
> > +    }
> > +
> > +    return mask;
> > +}
> 
> Urgh.. looks correct, but continue to makes me question if this is a
> sensible wire encoding of the discard map.
> 
> > +
> > +/*
> > + * Callback from ram_postcopy_each_ram_discard for each RAMBlock
> > + * start,end: Indexes into the bitmap for the first and last bit
> > + *            representing the named block
> > + */
> > +static int pc_send_discard_bm_ram(MigrationState *ms,
> > +                                  PostcopyDiscardState *pds,
> > +                                  unsigned long start, unsigned long end)
> 
> I know it's a long name, but I'd prefer to see this called
> "postcopy_".  I have to keep reminding myself that "pc" means
> "postcopy" here, not that it's a PC machine type specific function.

OK, yeh I see that's a clash; done.

> 
> > +{
> > +    /*
> > +     * There is no guarantee that start, end are on convenient 32bit multiples
> > +     * (We always send 32bit chunks over the wire, irrespective of long size)
> > +     */
> > +    unsigned long first32, last32, cur32;
> > +    first32 = start / 32;
> > +    last32 = end / 32;
> > +
> > +    for (cur32 = first32; cur32 <= last32; cur32++) {
> > +        /* Deal with start/end not on alignment */
> > +        uint32_t mask = make_32bit_mask(start, end, cur32);
> > +
> > +        uint32_t data = get_32bits_map(ms->sentmap, cur32 * 32);
> > +        data &= mask;
> > +
> > +        if (data) {
> > +            postcopy_discard_send_chunk(ms, pds, (cur32-first32) * 32, data);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> 
> 
> > +/*
> > + * Utility for the outgoing postcopy code.
> > + *   Calls postcopy_send_discard_bm_ram for each RAMBlock
> > + *   passing it bitmap indexes and name.
> > + * Returns: 0 on success
> > + * (qemu_ram_foreach_block ends up passing unscaled lengths
> > + *  which would mean postcopy code would have to deal with target page)
> > + */
> > +static int pc_each_ram_discard(MigrationState *ms)
> 
> s/discard/send_discard/

Done.

> 
> > +{
> > +    struct RAMBlock *block;
> > +    int ret;
> > +
> > +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> > +        unsigned long first = block->offset >> TARGET_PAGE_BITS;
> > +        unsigned long last = (block->offset + (block->max_length-1))
> > +                                >> TARGET_PAGE_BITS;
> > +        PostcopyDiscardState *pds = postcopy_discard_send_init(ms,
> > +                                                               first & 31,
> > +                                                               block->idstr);
> > +
> > +        /*
> > +         * Postcopy sends chunks of bitmap over the wire, but it
> > +         * just needs indexes at this point, avoids it having
> > +         * target page specific code.
> > +         */
> > +        ret = pc_send_discard_bm_ram(ms, pds, first, last);
> > +        postcopy_discard_send_finish(ms, pds);
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/*
> > + * Transmit the set of pages to be discarded after precopy to the target
> > + * these are pages that have been sent previously but have been dirtied
> > + * Hopefully this is pretty sparse
> > + */
> > +int ram_postcopy_send_discard_bitmap(MigrationState *ms)
> > +{
> > +    /* This should be our last sync, the src is now paused */
> > +    migration_bitmap_sync();
> > +
> > +    /*
> > +     * Update the sentmap to be  sentmap&=dirty
> > +     */
> > +    bitmap_and(ms->sentmap, ms->sentmap, migration_bitmap,
> > +               last_ram_offset() >> TARGET_PAGE_BITS);
> > +
> > +
> > +    trace_ram_postcopy_send_discard_bitmap();
> > +#ifdef DEBUG_POSTCOPY
> > +    ram_debug_dump_bitmap(ms->sentmap, false);
> > +#endif
> > +
> > +    return pc_each_ram_discard(ms);
> > +}
> > +
> > +/*
> > + * At the start of the postcopy phase of migration, any now-dirty
> > + * precopied pages are discarded.
> > + *
> > + * start..end is an inclusive range of bits indexed in the source
> > + *    VMs bitmap for this RAMBlock, source_target_page_bits tells
> > + *    us what one of those bits represents.
> > + *
> > + * start/end are offsets from the start of the bitmap for RAMBlock 'block_name'
> > + *
> > + * Returns 0 on success.
> > + */
> > +int ram_discard_range(MigrationIncomingState *mis,
> > +                      const char *block_name,
> > +                      uint64_t start, uint64_t end)
> > +{
> > +    assert(end >= start);
> > +
> > +    RAMBlock *rb = ram_find_block(block_name);
> > +
> > +    if (!rb) {
> > +        error_report("ram_discard_range: Failed to find block '%s'",
> > +                     block_name);
> > +        return -1;
> > +    }
> > +
> > +    uint64_t index_offset = rb->offset >> TARGET_PAGE_BITS;
> > +    postcopy_pmi_discard_range(mis, start + index_offset, (end - start) + 1);
> > +
> > +    /* +1 gives the byte after the end of the last page to be discarded */
> > +    ram_addr_t end_offset = (end+1) << TARGET_PAGE_BITS;
> > +    uint8_t *host_startaddr = rb->host + (start << TARGET_PAGE_BITS);
> > +    uint8_t *host_endaddr;
> > +
> > +    if (end_offset <= rb->used_length) {
> > +        host_endaddr   = rb->host + (end_offset-1);
> > +        return postcopy_ram_discard_range(mis, host_startaddr, host_endaddr);
> > +    } else {
> > +        error_report("ram_discard_range: Overrun block '%s' (%" PRIu64
> > +                     "/%" PRIu64 "/%zu)",
> > +                     block_name, start, end, rb->used_length);
> > +        return -1;
> > +    }
> > +}
> > +
> >  static int ram_save_setup(QEMUFile *f, void *opaque)
> >  {
> >      RAMBlock *block;
> > @@ -854,7 +1118,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >  
> >          acct_clear();
> >      }
> > -
> >      qemu_mutex_lock_iothread();
> >      qemu_mutex_lock_ramlist();
> >      bytes_transferred = 0;
> > @@ -864,6 +1127,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >      migration_bitmap = bitmap_new(ram_bitmap_pages);
> >      bitmap_set(migration_bitmap, 0, ram_bitmap_pages);
> >  
> > +    if (migrate_postcopy_ram()) {
> > +        MigrationState *s = migrate_get_current();
> > +        s->sentmap = bitmap_new(ram_bitmap_pages);
> > +        bitmap_clear(s->sentmap, 0, ram_bitmap_pages);
> > +    }
> > +
> >      /*
> >       * Count the total number of pages used by ram blocks not including any
> >       * gaps due to alignment or unplugs.
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 86200b9..e749f4c 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -125,6 +125,13 @@ struct MigrationState
> >  
> >      /* Flag set once the migration has been asked to enter postcopy */
> >      bool start_postcopy;
> > +
> > +    /* bitmap of pages that have been sent at least once
> > +     * only maintained and used in postcopy at the moment
> > +     * where it's used to send the dirtymap at the start
> > +     * of the postcopy phase
> > +     */
> > +    unsigned long *sentmap;
> >  };
> >  
> >  void process_incoming_migration(QEMUFile *f);
> > @@ -194,6 +201,11 @@ double xbzrle_mig_cache_miss_rate(void);
> >  
> >  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
> >  void ram_debug_dump_bitmap(unsigned long *todump, bool expected);
> > +/* For outgoing discard bitmap */
> > +int ram_postcopy_send_discard_bitmap(MigrationState *ms);
> > +/* For incoming postcopy discard */
> > +int ram_discard_range(MigrationIncomingState *mis, const char *block_name,
> > +                      uint64_t start, uint64_t end);
> >  
> >  /**
> >   * @migrate_add_blocker - prevent migration from proceeding
> > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > index e93ee8a..1fec1c1 100644
> > --- a/include/migration/postcopy-ram.h
> > +++ b/include/migration/postcopy-ram.h
> > @@ -28,4 +28,38 @@ void postcopy_pmi_destroy(MigrationIncomingState *mis);
> >  void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> >                                  size_t start, size_t npages);
> >  void postcopy_pmi_dump(MigrationIncomingState *mis);
> > +
> > +/*
> > + * Discard the contents of memory start..end inclusive.
> > + * We can assume that if we've been called postcopy_ram_hosttest returned true
> > + */
> > +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > +                               uint8_t *end);
> > +
> > +
> > +/*
> > + * Called at the start of each RAMBlock by the bitmap code
> > + * offset is the bit within the first 32bit chunk of mask
> > + * that represents the first page of the RAM Block
> > + * Returns a new PDS
> > + */
> > +PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
> > +                                                 uint8_t offset,
> > +                                                 const char *name);
> > +
> > +/*
> > + * Called by the bitmap code for each chunk to discard
> > + * May send a discard message, may just leave it queued to
> > + * be sent later
> > + */
> > +void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
> > +                                unsigned long pos, uint32_t bitmap);
> > +
> > +/*
> > + * Called at the end of each RAMBlock by the bitmap code
> > + * Sends any outstanding discard messages, frees the PDS
> > + */
> > +void postcopy_discard_send_finish(MigrationState *ms,
> > +                                  PostcopyDiscardState *pds);
> > +
> >  #endif
> > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > index 924eeb6..0651275 100644
> > --- a/include/qemu/typedefs.h
> > +++ b/include/qemu/typedefs.h
> > @@ -61,6 +61,7 @@ typedef struct PCIExpressHost PCIExpressHost;
> >  typedef struct PCIHostState PCIHostState;
> >  typedef struct PCMCIACardState PCMCIACardState;
> >  typedef struct PixelFormat PixelFormat;
> > +typedef struct PostcopyDiscardState PostcopyDiscardState;
> >  typedef struct PostcopyPMI PostcopyPMI;
> >  typedef struct PropertyInfo PropertyInfo;
> >  typedef struct Property Property;
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 6b20b56..850fe1a 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -22,6 +22,7 @@
> >  #include "block/block.h"
> >  #include "qemu/sockets.h"
> >  #include "migration/block.h"
> > +#include "migration/postcopy-ram.h"
> >  #include "qemu/thread.h"
> >  #include "qmp-commands.h"
> >  #include "trace.h"
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 4f29055..391e9c6 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -28,6 +28,19 @@
> >  #include "qemu/error-report.h"
> >  #include "trace.h"
> >  
> > +#define MAX_DISCARDS_PER_COMMAND 12
> > +
> > +struct PostcopyDiscardState {
> > +    const char *name;
> > +    uint16_t cur_entry;
> > +    uint64_t addrlist[MAX_DISCARDS_PER_COMMAND];
> > +    uint32_t masklist[MAX_DISCARDS_PER_COMMAND];
> > +    uint8_t  offset;  /* Offset within 32bit mask at addr0 representing 1st
> > +                         page of block */
> > +    unsigned int nsentwords;
> > +    unsigned int nsentcmds;
> > +};
> > +
> >  /* Postcopy needs to detect accesses to pages that haven't yet been copied
> >   * across, and efficiently map new pages in, the techniques for doing this
> >   * are target OS specific.
> > @@ -364,6 +377,21 @@ out:
> >      return ret;
> >  }
> >  
> > +/*
> > + * Discard the contents of memory start..end inclusive.
> > + * We can assume that if we've been called postcopy_ram_hosttest returned true
> > + */
> > +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > +                               uint8_t *end)
> > +{
> > +    if (madvise(start, (end-start)+1, MADV_DONTNEED)) {
> > +        perror("postcopy_ram_discard_range MADV_DONTNEED");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  #else
> >  /* No target OS support, stubs just fail */
> >  
> > @@ -380,5 +408,88 @@ void postcopy_hook_early_receive(MigrationIncomingState *mis,
> >      /* We don't support postcopy so don't care */
> >  }
> >  
> > +void postcopy_pmi_destroy(MigrationIncomingState *mis)
> > +{
> > +    /* Called in normal cleanup path - so it's OK */
> > +}
> > +
> > +void postcopy_pmi_discard_range(MigrationIncomingState *mis,
> > +                                size_t start, size_t npages)
> > +{
> > +    assert(0);
> > +}
> > +
> > +int postcopy_ram_discard_range(MigrationIncomingState *mis, uint8_t *start,
> > +                               uint8_t *end)
> > +{
> > +    assert(0);
> > +}
> >  #endif
> >  
> > +/* ------------------------------------------------------------------------- */
> > +
> > +/*
> > + * Called at the start of each RAMBlock by the bitmap code
> > + * offset is the bit within the first 64bit chunk of mask
> > + * that represents the first page of the RAM Block
> > + * Returns a new PDS
> > + */
> > +PostcopyDiscardState *postcopy_discard_send_init(MigrationState *ms,
> > +                                                 uint8_t offset,
> > +                                                 const char *name)
> > +{
> > +    PostcopyDiscardState *res = g_try_malloc(sizeof(PostcopyDiscardState));
> > +
> > +    if (res) {
> > +        res->name = name;
> > +        res->cur_entry = 0;
> > +        res->nsentwords = 0;
> > +        res->nsentcmds = 0;
> > +        res->offset = offset;
> > +    }
> > +
> > +    return res;
> > +}
> > +
> > +/*
> > + * Called by the bitmap code for each chunk to discard
> > + * May send a discard message, may just leave it queued to
> > + * be sent later
> > + */
> > +void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
> > +                                unsigned long pos, uint32_t bitmap)
> > +{
> > +    pds->addrlist[pds->cur_entry] = pos;
> > +    pds->masklist[pds->cur_entry] = bitmap;
> > +    pds->cur_entry++;
> > +    pds->nsentwords++;
> > +
> > +    if (pds->cur_entry == MAX_DISCARDS_PER_COMMAND) {
> > +        /* Full set, ship it! */
> > +        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
> > +                                              pds->cur_entry, pds->offset,
> > +                                              pds->addrlist, pds->masklist);
> > +        pds->nsentcmds++;
> > +        pds->cur_entry = 0;
> > +    }
> > +}
> > +
> > +/*
> > + * Called at the end of each RAMBlock by the bitmap code
> > + * Sends any outstanding discard messages, frees the PDS
> > + */
> > +void postcopy_discard_send_finish(MigrationState *ms, PostcopyDiscardState *pds)
> > +{
> > +    /* Anything unsent? */
> > +    if (pds->cur_entry) {
> > +        qemu_savevm_send_postcopy_ram_discard(ms->file, pds->name,
> > +                                              pds->cur_entry, pds->offset,
> > +                                              pds->addrlist, pds->masklist);
> > +        pds->nsentcmds++;
> > +    }
> > +
> > +    trace_postcopy_discard_send_finish(pds->name, pds->nsentwords,
> > +                                       pds->nsentcmds);
> > +
> > +    g_free(pds);
> > +}
> > diff --git a/savevm.c b/savevm.c
> > index 1e8d289..2589b8c 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -1282,15 +1282,12 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
> >               * we know there must be at least 1 bit set due to the loop entry
> >               * If there is no 0 firstzero will be 32
> >               */
> > -            /* TODO - ram_discard_range gets added in a later patch
> >              int ret = ram_discard_range(mis, ramid,
> >                                  startaddr + firstset - first_bit_offset,
> >                                  startaddr + (firstzero - 1) - first_bit_offset);
> > -            ret = -1;
> >              if (ret) {
> >                  return ret;
> >              }
> > -            */
> >  
> >              /* mask= .?0000000000 */
> >              /*         ^fz ^fs    */
> > diff --git a/trace-events b/trace-events
> > index a555b56..f985117 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1217,6 +1217,7 @@ qemu_file_fclose(void) ""
> >  migration_bitmap_sync_start(void) ""
> >  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
> >  migration_throttle(void) ""
> > +ram_postcopy_send_discard_bitmap(void) ""
> >  
> >  # hw/display/qxl.c
> >  disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
> > @@ -1478,6 +1479,9 @@ rdma_start_incoming_migration_after_rdma_listen(void) ""
> >  rdma_start_outgoing_migration_after_rdma_connect(void) ""
> >  rdma_start_outgoing_migration_after_rdma_source_init(void) ""
> >  
> > +# migration/postcopy-ram.c
> > +postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
> > +
> >  # kvm-all.c
> >  kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
> >  kvm_vm_ioctl(int type, void *arg) "type 0x%x, arg %p"
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy
  2015-03-23 12:21     ` Dr. David Alan Gilbert
@ 2015-03-24  1:25       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-24  1:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 2533 bytes --]

On Mon, Mar 23, 2015 at 12:21:50PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:52:08PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > The userfault mechanism used for postcopy generates faults
> > > for us on pages that are 'not present', inflating a balloon in
> > > the guest causes host pages to be marked as 'not present'; doing
> > > this during a postcopy, as potentially the same pages were being
> > > received from the source, would confuse the state of the received
> > > page -> disable ballooning during postcopy.
> > 
> > That is a ludicrously long sentence, which I have great difficulty parsing.
> 
> OK, how about:
> 
> -----
> Postcopy detects accesses to pages that haven't been transferred yet
> using userfaultfd, and it causes exceptions on pages that are 'not present'.
> Ballooning also causes pages to be marked as 'not present' when the guest
> inflates the balloon.
> Potentially a balloon could be inflated to discard pages that are currently
> inflight during postcopy and that may be arriving at about the same time.
> 
> To avoid this confusion, disable ballooning during postcopy.

Better, thanks.

> -----
> 
> > > When disabled we drop balloon requests from the guest.  Since ballooning
> > > is generally initiated by the host, the management system should avoid
> > > initiating any balloon instructions to the guest during migration,
> > > although it's not possible to know how long it would take a guest to
> > > process a request made prior to the start of migration.
> > 
> > Yeah :/.  It would be nice if it could queue the guest actions,
> > instead of dropping them.
> 
> Yes, I did look at that briefly; it's not trivial; for
> example consider the situation where the guest discards some pages
> by inflating, and then later deflates, it expects to lose that data
> but then starts accessing that physical page again.  
> If you replay that sequence at the end then you've lost newly accessed pages.
> So you have to filter out inflates that have been deflated later,
> and have to order those correctly with the sense of changes made to those
> pages after the deflation occurs.

Ah, yes, I see.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request Dr. David Alan Gilbert (git)
@ 2015-03-24  1:53   ` David Gibson
  2015-03-25 17:37     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-24  1:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 9497 bytes --]

On Wed, Feb 25, 2015 at 04:51:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> On receiving MIG_RPCOMM_REQ_PAGES look up the address and
> queue the page.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  arch_init.c                   | 55 +++++++++++++++++++++++++++++++++++++++++++
>  include/exec/cpu-all.h        |  2 --
>  include/migration/migration.h | 21 +++++++++++++++++
>  include/qemu/typedefs.h       |  1 +
>  migration/migration.c         | 33 +++++++++++++++++++++++++-
>  trace-events                  |  3 ++-
>  6 files changed, 111 insertions(+), 4 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index d2c4457..9d8fc6b 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -669,6 +669,61 @@ static int ram_save_page(QEMUFile *f, RAMBlock* block, ram_addr_t offset,
>  }
>  
>  /*
> + * Queue the pages for transmission, e.g. a request from postcopy destination
> + *   ms: MigrationStatus in which the queue is held
> + *   rbname: The RAMBlock the request is for - may be NULL (to mean reuse last)
> + *   start: Offset from the start of the RAMBlock
> + *   len: Length (in bytes) to send
> + *   Return: 0 on success
> + */
> +int ram_save_queue_pages(MigrationState *ms, const char *rbname,
> +                         ram_addr_t start, ram_addr_t len)
> +{
> +    RAMBlock *ramblock;
> +
> +    if (!rbname) {
> +        /* Reuse last RAMBlock */
> +        ramblock = ms->last_req_rb;
> +
> +        if (!ramblock) {
> +            /*
> +             * Shouldn't happen, we can't reuse the last RAMBlock if
> +             * it's the 1st request.
> +             */
> +            error_report("ram_save_queue_pages no previous block");
> +            return -1;
> +        }
> +    } else {
> +        ramblock = ram_find_block(rbname);
> +
> +        if (!ramblock) {
> +            /* We shouldn't be asked for a non-existent RAMBlock */
> +            error_report("ram_save_queue_pages no block '%s'", rbname);
> +            return -1;
> +        }
> +    }
> +    trace_ram_save_queue_pages(ramblock->idstr, start, len);
> +    if (start+len > ramblock->used_length) {
> +        error_report("%s request overrun start=%zx len=%zx blocklen=%zx",
> +                     __func__, start, len, ramblock->used_length);
> +        return -1;
> +    }
> +
> +    struct MigrationSrcPageRequest *new_entry =
> +        g_malloc0(sizeof(struct MigrationSrcPageRequest));
> +    new_entry->rb = ramblock;
> +    new_entry->offset = start;
> +    new_entry->len = len;
> +    ms->last_req_rb = ramblock;
> +
> +    qemu_mutex_lock(&ms->src_page_req_mutex);
> +    QSIMPLEQ_INSERT_TAIL(&ms->src_page_requests, new_entry, next_req);
> +    qemu_mutex_unlock(&ms->src_page_req_mutex);
> +
> +    return 0;
> +}
> +
> +/*
>   * ram_find_and_save_block: Finds a page to send and sends it to f
>   *
>   * Returns:  The number of bytes written.
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index 2c48286..3088000 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -265,8 +265,6 @@ CPUArchState *cpu_copy(CPUArchState *env);
>  
>  /* memory API */
>  
> -typedef struct RAMBlock RAMBlock;
> -
>  struct RAMBlock {
>      struct MemoryRegion *mr;
>      uint8_t *host;
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 2c15d63..b1c7cad 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -100,6 +100,18 @@ MigrationIncomingState *migration_incoming_get_current(void);
>  MigrationIncomingState *migration_incoming_state_new(QEMUFile *f);
>  void migration_incoming_state_destroy(void);
>  
> +/*
> + * An outstanding page request, on the source, having been received
> + * and queued
> + */
> +struct MigrationSrcPageRequest {
> +    RAMBlock *rb;
> +    hwaddr    offset;
> +    hwaddr    len;
> +
> +    QSIMPLEQ_ENTRY(MigrationSrcPageRequest) next_req;
> +};
> +
>  struct MigrationState
>  {
>      int64_t bandwidth_limit;
> @@ -142,6 +154,12 @@ struct MigrationState
>       * of the postcopy phase
>       */
>      unsigned long *sentmap;
> +
> +    /* Queue of outstanding page requests from the destination */
> +    QemuMutex src_page_req_mutex;
> +    QSIMPLEQ_HEAD(src_page_requests, MigrationSrcPageRequest) src_page_requests;
> +    /* The RAMBlock used in the last src_page_request */
> +    RAMBlock *last_req_rb;
>  };
>  
>  void process_incoming_migration(QEMUFile *f);
> @@ -276,6 +294,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>                               ram_addr_t offset, size_t size,
>                               int *bytes_sent);
>  
> +int ram_save_queue_pages(MigrationState *ms, const char *rbname,
> +                         ram_addr_t start, ram_addr_t len);
> +
>  PostcopyState postcopy_state_get(MigrationIncomingState *mis);
>  
>  /* Set the state and return the old state */
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index 0651275..396044d 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -73,6 +73,7 @@ typedef struct QEMUSGList QEMUSGList;
>  typedef struct QEMUSizedBuffer QEMUSizedBuffer;
>  typedef struct QEMUTimerListGroup QEMUTimerListGroup;
>  typedef struct QEMUTimer QEMUTimer;
> +typedef struct RAMBlock RAMBlock;
>  typedef struct Range Range;
>  typedef struct SerialState SerialState;
>  typedef struct SHPCDevice SHPCDevice;
> diff --git a/migration/migration.c b/migration/migration.c
> index 2e9d0dd..939f426 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -26,6 +26,8 @@
>  #include "qemu/thread.h"
>  #include "qmp-commands.h"
>  #include "trace.h"
> +#include "exec/memory.h"
> +#include "exec/address-spaces.h"
>  
>  enum MigrationPhase {
>      MIG_STATE_ERROR = -1,
> @@ -495,6 +497,15 @@ static void migrate_fd_cleanup(void *opaque)
>  
>      migrate_fd_cleanup_src_rp(s);
>  
> +    /* This queue generally should be empty - but in the case of a failed
> +     * migration might have some droppings in.
> +     */
> +    struct MigrationSrcPageRequest *mspr, *next_mspr;
> +    QSIMPLEQ_FOREACH_SAFE(mspr, &s->src_page_requests, next_req, next_mspr) {
> +        QSIMPLEQ_REMOVE_HEAD(&s->src_page_requests, next_req);
> +        g_free(mspr);
> +    }
> +
>      if (s->file) {
>          trace_migrate_fd_cleanup();
>          qemu_mutex_unlock_iothread();
> @@ -613,6 +624,9 @@ MigrationState *migrate_init(const MigrationParams *params)
>      s->state = MIG_STATE_SETUP;
>      trace_migrate_set_state(MIG_STATE_SETUP);
>  
> +    qemu_mutex_init(&s->src_page_req_mutex);
> +    QSIMPLEQ_INIT(&s->src_page_requests);
> +
>      s->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      return s;
>  }
> @@ -826,7 +840,24 @@ static void source_return_path_bad(MigrationState *s)
>  static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
>                                         ram_addr_t start, ram_addr_t len)
>  {
> -    trace_migrate_handle_rp_req_pages(start, len);
> +    trace_migrate_handle_rp_req_pages(rbname, start, len);
> +
> +    /* Round everything up to our host page size */
> +    long our_host_ps = getpagesize();
> +    if (start & (our_host_ps-1)) {
> +        long roundings = start & (our_host_ps-1);
> +        start -= roundings;
> +        len += roundings;
> +    }
> +    if (len & (our_host_ps-1)) {
> +        long roundings = len & (our_host_ps-1);
> +        len -= roundings;
> +        len += our_host_ps;
> +    }

Why is it necessary to round out to host page size on the source?  I
understand why the host page size is relevant on the destination, due
to the userfaultfd and atomic populate constraints, but not on the source.

> +    if (ram_save_queue_pages(ms, rbname, start, len)) {
> +        source_return_path_bad(ms);
> +    }
>  }
>  
>  /*
> diff --git a/trace-events b/trace-events
> index 9bedee4..8a0d70d 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1218,6 +1218,7 @@ migration_bitmap_sync_start(void) ""
>  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
>  migration_throttle(void) ""
>  ram_postcopy_send_discard_bitmap(void) ""
> +ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
>  
>  # hw/display/qxl.c
>  disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
> @@ -1404,7 +1405,7 @@ migrate_fd_error(void) ""
>  migrate_fd_cancel(void) ""
>  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
>  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> -migrate_handle_rp_req_pages(size_t start, size_t len) "at %zx for len %zx"
> +migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at %zx len %zx"
>  migration_thread_after_loop(void) ""
>  migration_thread_file_err(void) ""
>  migration_thread_setup_complete(void) ""

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue Dr. David Alan Gilbert (git)
@ 2015-03-24  2:15   ` David Gibson
  2015-06-16 10:48     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-24  2:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 11162 bytes --]

On Wed, Feb 25, 2015 at 04:51:57PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> When transmitting RAM pages, consume pages that have been queued by
> MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
> 
> Note:
>   a) After a queued page the linear walk carries on from after the
> unqueued page; there is a reasonable chance that the destination
> was about to ask for other closeby pages anyway.
> 
>   b) We have to be careful of any assumptions that the page walking
> code makes, in particular it does some short cuts on its first linear
> walk that break as soon as we do a queued page.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  arch_init.c  | 154 +++++++++++++++++++++++++++++++++++++++++++++++++----------
>  trace-events |   2 +
>  2 files changed, 131 insertions(+), 25 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 9d8fc6b..acf65e1 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -328,6 +328,7 @@ static RAMBlock *last_seen_block;
>  /* This is the last block from where we have sent data */
>  static RAMBlock *last_sent_block;
>  static ram_addr_t last_offset;
> +static bool last_was_from_queue;
>  static unsigned long *migration_bitmap;
>  static uint64_t migration_dirty_pages;
>  static uint32_t last_version;
> @@ -461,6 +462,19 @@ static inline bool migration_bitmap_set_dirty(ram_addr_t addr)
>      return ret;
>  }
>  
> +static inline bool migration_bitmap_clear_dirty(ram_addr_t addr)
> +{
> +    bool ret;
> +    int nr = addr >> TARGET_PAGE_BITS;
> +
> +    ret = test_and_clear_bit(nr, migration_bitmap);
> +
> +    if (ret) {
> +        migration_dirty_pages--;
> +    }
> +    return ret;
> +}
> +
>  static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
>  {
>      ram_addr_t addr;
> @@ -669,6 +683,39 @@ static int ram_save_page(QEMUFile *f, RAMBlock* block, ram_addr_t offset,
>  }
>  
>  /*
> + * Unqueue a page from the queue fed by postcopy page requests
> + *
> + * Returns:      The RAMBlock* to transmit from (or NULL if the queue is empty)
> + *      ms:      MigrationState in
> + *  offset:      the byte offset within the RAMBlock for the start of the page
> + * ram_addr_abs: global offset in the dirty/sent bitmaps
> + */
> +static RAMBlock *ram_save_unqueue_page(MigrationState *ms, ram_addr_t *offset,
> +                                       ram_addr_t *ram_addr_abs)
> +{
> +    RAMBlock *result = NULL;
> +    qemu_mutex_lock(&ms->src_page_req_mutex);
> +    if (!QSIMPLEQ_EMPTY(&ms->src_page_requests)) {
> +        struct MigrationSrcPageRequest *entry =
> +                                    QSIMPLEQ_FIRST(&ms->src_page_requests);
> +        result = entry->rb;
> +        *offset = entry->offset;
> +        *ram_addr_abs = (entry->offset + entry->rb->offset) & TARGET_PAGE_MASK;
> +
> +        if (entry->len > TARGET_PAGE_SIZE) {
> +            entry->len -= TARGET_PAGE_SIZE;
> +            entry->offset += TARGET_PAGE_SIZE;
> +        } else {
> +            QSIMPLEQ_REMOVE_HEAD(&ms->src_page_requests, next_req);
> +            g_free(entry);
> +        }
> +    }
> +    qemu_mutex_unlock(&ms->src_page_req_mutex);
> +
> +    return result;
> +}
> +
> +/*
>   * Queue the pages for transmission, e.g. a request from postcopy destination
>   *   ms: MigrationStatus in which the queue is held
>   *   rbname: The RAMBlock the request is for - may be NULL (to mean reuse last)
> @@ -732,46 +779,102 @@ int ram_save_queue_pages(MigrationState *ms, const char *rbname,
>  
>  static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
>  {
> +    MigrationState *ms = migrate_get_current();
>      RAMBlock *block = last_seen_block;
> +    RAMBlock *tmpblock;
>      ram_addr_t offset = last_offset;
> +    ram_addr_t tmpoffset;
>      bool complete_round = false;
>      int bytes_sent = 0;
> -    MemoryRegion *mr;
>      ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
>                                   ram_addr_t space */
> +    unsigned long hps = sysconf(_SC_PAGESIZE);
>  
> -    if (!block)
> +    if (!block) {
>          block = QTAILQ_FIRST(&ram_list.blocks);
> +        last_was_from_queue = false;
> +    }
>  
> -    while (true) {
> -        mr = block->mr;
> -        offset = migration_bitmap_find_and_reset_dirty(mr, offset,
> -                                                       &dirty_ram_abs);
> -        if (complete_round && block == last_seen_block &&
> -            offset >= last_offset) {
> -            break;
> +    while (true) { /* Until we send a block or run out of stuff to send */
> +        tmpblock = NULL;
> +
> +        /*
> +         * Don't break host-page chunks up with queue items
> +         * so only unqueue if,
> +         *   a) The last item came from the queue anyway
> +         *   b) The last sent item was the last target-page in a host page
> +         */
> +        if (last_was_from_queue || !last_sent_block ||
> +            ((last_offset & (hps - 1)) == (hps - TARGET_PAGE_SIZE))) {
> +            tmpblock = ram_save_unqueue_page(ms, &tmpoffset, &dirty_ram_abs);

This test for whether we've completed a host page is pretty awkward.
I think it would be cleaner to have an inner loop / helper function
that completes sending an entire host page (whether requested or
scanned), before allowing the outer loop to even come back to here to
reconsider the queue.

>          }
> -        if (offset >= block->used_length) {
> -            offset = 0;
> -            block = QTAILQ_NEXT(block, next);
> -            if (!block) {
> -                block = QTAILQ_FIRST(&ram_list.blocks);
> -                complete_round = true;
> -                ram_bulk_stage = false;
> +
> +        if (tmpblock) {
> +            /* We've got a block from the postcopy queue */
> +            trace_ram_find_and_save_block_postcopy(tmpblock->idstr,
> +                                                   (uint64_t)tmpoffset,
> +                                                   (uint64_t)dirty_ram_abs);
> +            /*
> +             * We're sending this page, and since it's postcopy nothing else
> +             * will dirty it, and we must make sure it doesn't get sent again.
> +             */
> +            if (!migration_bitmap_clear_dirty(dirty_ram_abs)) {
> +                trace_ram_find_and_save_block_postcopy_not_dirty(
> +                    tmpblock->idstr, (uint64_t)tmpoffset,
> +                    (uint64_t)dirty_ram_abs,
> +                    test_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap));
> +
> +                continue;
>              }
> +            /*
> +             * As soon as we start servicing pages out of order, then we have
> +             * to kill the bulk stage, since the bulk stage assumes
> +             * in (migration_bitmap_find_and_reset_dirty) that every page is
> +             * dirty, that's no longer true.
> +             */
> +            ram_bulk_stage = false;
> +            /*
> +             * We mustn't change block/offset unless it's to a valid one
> +             * otherwise we can go down some of the exit cases in the normal
> +             * path.
> +             */
> +            block = tmpblock;
> +            offset = tmpoffset;
> +            last_was_from_queue = true;

Hrm.  So now block can change during the execution of
ram_save_block(), which really suggests that splitting by block is no
longer a sensible subdivision of the loop surrounding ram_save_block.
I think it would make more sense to replace that entire surrounding
loop, so that the logic is essentially

    while (not finished) {
        if (something is queued) {
	    send that
        } else {
	    scan for an unsent block and offset
	    send that instead
	}


>          } else {
> -            bytes_sent = ram_save_page(f, block, offset, last_stage);
> -
> -            /* if page is unmodified, continue to the next */
> -            if (bytes_sent > 0) {
> -                MigrationState *ms = migrate_get_current();
> -                if (ms->sentmap) {
> -                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
> +            MemoryRegion *mr;
> +            /* priority queue empty, so just search for something dirty */
> +            mr = block->mr;
> +            offset = migration_bitmap_find_and_reset_dirty(mr, offset,
> +                                                           &dirty_ram_abs);
> +            if (complete_round && block == last_seen_block &&
> +                offset >= last_offset) {
> +                break;
> +            }
> +            if (offset >= block->used_length) {
> +                offset = 0;
> +                block = QTAILQ_NEXT(block, next);
> +                if (!block) {
> +                    block = QTAILQ_FIRST(&ram_list.blocks);
> +                    complete_round = true;
> +                    ram_bulk_stage = false;
>                  }
> +                continue; /* pick an offset in the new block */
> +            }
> +            last_was_from_queue = false;
> +        }
>  
> -                last_sent_block = block;
> -                break;
> +        /* We have a page to send, so send it */
> +        bytes_sent = ram_save_page(f, block, offset, last_stage);
> +
> +        /* if page is unmodified, continue to the next */
> +        if (bytes_sent > 0) {
> +            if (ms->sentmap) {
> +                set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
>              }
> +
> +            last_sent_block = block;
> +            break;
>          }
>      }
>      last_seen_block = block;
> @@ -865,6 +968,7 @@ static void reset_ram_globals(void)
>      last_offset = 0;
>      last_version = ram_list.version;
>      ram_bulk_stage = true;
> +    last_was_from_queue = false;
>  }
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> diff --git a/trace-events b/trace-events
> index 8a0d70d..781cf5c 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1217,6 +1217,8 @@ qemu_file_fclose(void) ""
>  migration_bitmap_sync_start(void) ""
>  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
>  migration_throttle(void) ""
> +ram_find_and_save_block_postcopy(const char *block_name, uint64_t tmp_offset, uint64_t ram_addr) "%s/%" PRIx64 " ram_addr=%" PRIx64
> +ram_find_and_save_block_postcopy_not_dirty(const char *block_name, uint64_t tmp_offset, uint64_t ram_addr, int sent) "%s/%" PRIx64 " ram_addr=%" PRIx64 " (sent=%d)"
>  ram_postcopy_send_discard_bitmap(void) ""
>  ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers Dr. David Alan Gilbert (git)
@ 2015-03-24  2:33   ` David Gibson
  2015-03-25 17:46     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-24  2:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 9623 bytes --]

On Wed, Feb 25, 2015 at 04:51:58PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> postcopy_place_page (etc) provide a way for postcopy to place a page
> into guests memory atomically (using the copy ioctl on the ufd).
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h    |   2 +
>  include/migration/postcopy-ram.h |  16 ++++++
>  migration/postcopy-ram.c         | 113 ++++++++++++++++++++++++++++++++++++++-
>  trace-events                     |   1 +
>  4 files changed, 130 insertions(+), 2 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index b1c7cad..139bb1b 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -94,6 +94,8 @@ struct MigrationIncomingState {
>      QEMUFile *return_path;
>      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
>      PostcopyPMI    postcopy_pmi;
> +    void          *postcopy_tmp_page;
> +    long           postcopy_place_skipped; /* Check for incorrect place ops */
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> index fbb2a93..3d30280 100644
> --- a/include/migration/postcopy-ram.h
> +++ b/include/migration/postcopy-ram.h
> @@ -80,4 +80,20 @@ void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
>  void postcopy_discard_send_finish(MigrationState *ms,
>                                    PostcopyDiscardState *pds);
>  
> +/*
> + * Place a page (from) at (host) efficiently
> + *    There are restrictions on how 'from' must be mapped, in general best
> + *    to use other postcopy_ routines to allocate.
> + * returns 0 on success
> + */
> +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> +                        long bitmap_offset, bool all_zero);
> +
> +/*
> + * Allocate a page of memory that can be mapped at a later point in time
> + * using postcopy_place_page
> + * Returns: Pointer to allocated page
> + */
> +void *postcopy_get_tmp_page(MigrationIncomingState *mis);
> +
>  #endif
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 33dd332..86fa5a0 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -197,7 +197,6 @@ static PostcopyPMIState postcopy_pmi_get_state_nolock(
>  }
>  
>  /* Retrieve the state of the given page */
> -__attribute__ (( unused )) /* Until later in patch series */
>  static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
>                                                 size_t bitmap_index)
>  {
> @@ -213,7 +212,6 @@ static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
>   * Set the page state to the given state if the previous state was as expected
>   * Return the actual previous state.
>   */
> -__attribute__ (( unused )) /* Until later in patch series */
>  static PostcopyPMIState postcopy_pmi_change_state(MigrationIncomingState *mis,
>                                             size_t bitmap_index,
>                                             PostcopyPMIState expected_state,
> @@ -477,6 +475,7 @@ static int cleanup_area(const char *block_name, void *host_addr,
>  int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
>  {
>      postcopy_pmi_init(mis, ram_pages);
> +    mis->postcopy_place_skipped = -1;
>  
>      if (qemu_ram_foreach_block(init_area, mis)) {
>          return -1;
> @@ -495,6 +494,10 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>          return -1;
>      }
>  
> +    if (mis->postcopy_tmp_page) {
> +        munmap(mis->postcopy_tmp_page, getpagesize());
> +        mis->postcopy_tmp_page = NULL;
> +    }
>      return 0;
>  }
>  
> @@ -561,6 +564,100 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>      return 0;
>  }
>  
> +/*
> + * Place a host page (from) at (host) tomically

s/tomically/atomically/

> + *    There are restrictions on how 'from' must be mapped, in general best
> + *    to use other postcopy_ routines to allocate.
> + * all_zero: Hint that the page being placed is 0 throughout
> + * returns 0 on success
> + * bitmap_offset: Index into the migration bitmaps
> + *
> + * State changes:
> + *   none -> received
> + *   requested -> received (ack)
> + *
> + * Note the UF thread is also updating the state, and maybe none->requested
> + * at the same time.

Hrm.. these facts do tend me towards thinking that separate and
explicit requested and received bits will be clearer than treating it
as an enum state variable

> + */
> +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> +                        long bitmap_offset, bool all_zero)
> +{
> +    PostcopyPMIState old_state, tmp_state, new_state;
> +
> +    if (!all_zero) {
> +        struct uffdio_copy copy_struct;
> +
> +        copy_struct.dst = (uint64_t)(uintptr_t)host;
> +        copy_struct.src = (uint64_t)(uintptr_t)from;
> +        copy_struct.len = getpagesize();
> +        copy_struct.mode = 0;
> +
> +        /* copy also acks to the kernel waking the stalled thread up
> +         * TODO: We can inhibit that ack and only do it if it was requested
> +         * which would be slightly cheaper, but we'd have to be careful
> +         * of the order of updating our page state.
> +         */
> +        if (ioctl(mis->userfault_fd, UFFDIO_COPY, &copy_struct)) {
> +            int e = errno;
> +            error_report("%s: %s copy host: %p from: %p pmi=%d",
> +                         __func__, strerror(e), host, from,
> +                         postcopy_pmi_get_state(mis, bitmap_offset));
> +
> +            return -e;
> +        }
> +    } else {
> +        struct uffdio_zeropage zero_struct;
> +
> +        zero_struct.range.start = (uint64_t)(uintptr_t)host;
> +        zero_struct.range.len = getpagesize();
> +        zero_struct.mode = 0;
> +
> +        if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
> +            int e = errno;
> +            error_report("%s: %s zero host: %p from: %p pmi=%d",
> +                         __func__, strerror(e), host, from,
> +                         postcopy_pmi_get_state(mis, bitmap_offset));
> +
> +            return -e;
> +        }
> +    }
> +
> +    bitmap_offset &= ~(mis->postcopy_pmi.host_bits-1);
> +    new_state = POSTCOPY_PMI_RECEIVED;
> +    tmp_state = postcopy_pmi_get_state(mis, bitmap_offset);
> +    do {
> +        old_state = tmp_state;
> +        tmp_state = postcopy_pmi_change_state(mis, bitmap_offset, old_state,
> +                                              new_state);
> +    } while (old_state != tmp_state);

Yeah.. see treating the state as two separate booleans, here you'd
just need to update received, without caring about whether requested
has changed.

> +    trace_postcopy_place_page(bitmap_offset, host, all_zero, old_state);
> +
> +    return 0;
> +}
> +
> +/*
> + * Returns a target page of memory that can be mapped at a later point in time
> + * using postcopy_place_page
> + * The same address is used repeatedly, postcopy_place_page just takes the
> + * backing page away.
> + * Returns: Pointer to allocated page
> + *
> + */
> +void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> +{
> +    if (!mis->postcopy_tmp_page) {
> +        mis->postcopy_tmp_page = mmap(NULL, getpagesize(),
> +                             PROT_READ | PROT_WRITE, MAP_PRIVATE |
> +                             MAP_ANONYMOUS, -1, 0);
> +        if (!mis->postcopy_tmp_page) {
> +            perror("mapping postcopy tmp page");
> +            return NULL;
> +        }
> +    }
> +
> +    return mis->postcopy_tmp_page;
> +}
> +
>  #else
>  /* No target OS support, stubs just fail */
>  bool postcopy_ram_supported_by_host(void)
> @@ -608,6 +705,18 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>  {
>      assert(0);
>  }
> +
> +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> +                        long bitmap_offset, bool all_zero)
> +{
> +    assert(0);
> +}
> +
> +void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> +{
> +    assert(0);
> +}
> +
>  #endif
>  
>  /* ------------------------------------------------------------------------- */
> diff --git a/trace-events b/trace-events
> index 781cf5c..16a91d9 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1497,6 +1497,7 @@ rdma_start_outgoing_migration_after_rdma_source_init(void) ""
>  postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
>  postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
>  postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
> +postcopy_place_page(unsigned long offset, void *host_addr, bool all_zero, int old_state) "offset=%lx host=%p all_zero=%d old_state=%d"
>  
>  # kvm-all.c
>  kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 36/45] Postcopy: Use helpers to map pages during migration
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 36/45] Postcopy: Use helpers to map pages during migration Dr. David Alan Gilbert (git)
@ 2015-03-24  4:51   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-24  4:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 10294 bytes --]

On Wed, Feb 25, 2015 at 04:51:59PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> In postcopy, the destination guest is running at the same time
> as it's receiving pages; as we receive new pages we must put
> them into the guests address space atomically to avoid a running
> CPU accessing a partially written page.
> 
> Use the helpers in postcopy-ram.c to map these pages.
> 
> Note, gcc 4.9.2 is giving me false uninitialized warnings in ram_load's
> switch, so anything conditionally set at the start of the switch needs
> initializing; filed as
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64614
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  arch_init.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 111 insertions(+), 26 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index acf65e1..7a1c9ea 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -1479,9 +1479,39 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, void *host)
>      return 0;
>  }
>  
> +/*
> + * Helper for host_from_stream_offset in the succesful case.
> + * Returns the host pointer for the given block and offset
> + *   calling the postcopy hook and filling in *rb.
> + */
> +static void *host_with_postcopy_hook(MigrationIncomingState *mis,
> +                                     ram_addr_t offset,
> +                                     RAMBlock *block,
> +                                     RAMBlock **rb)
> +{
> +    if (rb) {
> +        *rb = block;
> +    }
> +
> +    postcopy_hook_early_receive(mis,
> +        (offset + block->offset) >> TARGET_PAGE_BITS);

What does postcopy_hook_early_receive() do?

> +    return memory_region_get_ram_ptr(block->mr) + offset;
> +}
> +
> +/*
> + * Read a RAMBlock ID from the stream f, find the host address of the
> + * start of that block and add on 'offset'
> + *
> + * f: Stream to read from
> + * mis: MigrationIncomingState
> + * offset: Offset within the block
> + * flags: Page flags (mostly to see if it's a continuation of previous block)
> + * rb: Pointer to RAMBlock* that gets filled in with the RB we find
> + */
>  static inline void *host_from_stream_offset(QEMUFile *f,
> +                                            MigrationIncomingState *mis,
>                                              ram_addr_t offset,
> -                                            int flags)
> +                                            int flags, RAMBlock **rb)
>  {
>      static RAMBlock *block = NULL;
>      char id[256];
> @@ -1492,8 +1522,7 @@ static inline void *host_from_stream_offset(QEMUFile *f,
>              error_report("Ack, bad migration stream!");
>              return NULL;
>          }
> -
> -        return memory_region_get_ram_ptr(block->mr) + offset;
> +        return host_with_postcopy_hook(mis, offset, block, rb);
>      }
>  
>      len = qemu_get_byte(f);
> @@ -1503,7 +1532,7 @@ static inline void *host_from_stream_offset(QEMUFile *f,
>      QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>          if (!strncmp(id, block->idstr, sizeof(id)) &&
>              block->max_length > offset) {
> -            return memory_region_get_ram_ptr(block->mr) + offset;
> +            return host_with_postcopy_hook(mis, offset, block, rb);
>          }
>      }
>  
> @@ -1537,6 +1566,15 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>  {
>      int flags = 0, ret = 0;
>      static uint64_t seq_iter;
> +    /*
> +     * System is running in postcopy mode, page inserts to host memory must be
> +     * atomic
> +     */
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +    bool postcopy_running = postcopy_state_get(mis) >=
> +                            POSTCOPY_INCOMING_LISTENING;
> +    void *postcopy_host_page = NULL;
> +    bool postcopy_place_needed = false;
>  
>      seq_iter++;
>  
> @@ -1545,14 +1583,57 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>      }
>  
>      while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
> +        RAMBlock *rb = 0; /* =0 needed to silence compiler */
>          ram_addr_t addr, total_ram_bytes;
> -        void *host;
> +        void *host = 0;
> +        void *page_buffer = 0;
>          uint8_t ch;
> +        bool all_zero = false;
>  
>          addr = qemu_get_be64(f);
>          flags = addr & ~TARGET_PAGE_MASK;
>          addr &= TARGET_PAGE_MASK;
>  
> +        if (flags & (RAM_SAVE_FLAG_COMPRESS | RAM_SAVE_FLAG_PAGE |
> +                     RAM_SAVE_FLAG_XBZRLE)) {
> +            host = host_from_stream_offset(f, mis, addr, flags, &rb);
> +            if (!host) {
> +                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
> +                ret = -EINVAL;
> +                break;
> +            }
> +            if (!postcopy_running) {
> +                page_buffer = host;
> +            } else {
> +                /*
> +                 * Postcopy requires that we place whole host pages atomically.
> +                 * To make it atomic, the data is read into a temporary page
> +                 * that's moved into place later.
> +                 * The migration protocol uses,  possibly smaller, target-pages
> +                 * however the source ensures it always sends all the components
> +                 * of a host page in order.
> +                 */
> +                if (!postcopy_host_page) {
> +                    postcopy_host_page = postcopy_get_tmp_page(mis);
> +                }
> +                page_buffer = postcopy_host_page +
> +                              ((uintptr_t)host & ~qemu_host_page_mask);
> +                /* If all TP are zero then we can optimise the place */
> +                if (!((uintptr_t)host & ~qemu_host_page_mask)) {
> +                    all_zero = true;
> +                }
> +
> +                /*
> +                 * If it's the last part of a host page then we place the host
> +                 * page
> +                 */

So it's assumed in a bunch of places that the other end will send
target pages in order until you have a whole host page.  Which is
fine, but it would be nice to have some tests to verify that, just in
case somebody changes that behaviour in future, then wonders why
everything broke.

As I suggested on the earlier patch, I think some of this stuff might
become less clunky, if you had a "receive_host_page" subfunction with
its own inner loop.

> +                postcopy_place_needed = (((uintptr_t)host + TARGET_PAGE_SIZE) &
> +                                         ~qemu_host_page_mask) == 0;
> +            }
> +        } else {
> +            postcopy_place_needed = false;
> +        }
> +
>          switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
>          case RAM_SAVE_FLAG_MEM_SIZE:
>              /* Synchronize RAM block list */
> @@ -1590,32 +1671,27 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              }
>              break;
>          case RAM_SAVE_FLAG_COMPRESS:
> -            host = host_from_stream_offset(f, addr, flags);
> -            if (!host) {
> -                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
> -                ret = -EINVAL;
> -                break;
> -            }
> -
>              ch = qemu_get_byte(f);
> -            ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> -            break;
> -        case RAM_SAVE_FLAG_PAGE:
> -            host = host_from_stream_offset(f, addr, flags);
> -            if (!host) {
> -                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
> -                ret = -EINVAL;
> -                break;
> +            if (!postcopy_running) {
> +                ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> +            } else {
> +                memset(page_buffer, ch, TARGET_PAGE_SIZE);
> +                if (ch) {
> +                    all_zero = false;

It's a bit nasty that the non-postcopy cases are now out-of-line with
the postcopy cases in-line.  I think it would be nicer to alter the
helper functions so that they work on a supplied buffer, rather than
directly on the ram state.  Then you should be able to use the same
helpers for either case, with just different postcopy logic before and
after the actual page load.

> +                }
>              }
> +            break;
>  
> -            qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> +        case RAM_SAVE_FLAG_PAGE:
> +            all_zero = false;
> +            qemu_get_buffer(f, page_buffer, TARGET_PAGE_SIZE);
>              break;
> +
>          case RAM_SAVE_FLAG_XBZRLE:
> -            host = host_from_stream_offset(f, addr, flags);
> -            if (!host) {
> -                error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
> -                ret = -EINVAL;
> -                break;
> +            all_zero = false;
> +            if (postcopy_running) {
> +                error_report("XBZRLE RAM block in postcopy mode @%zx\n", addr);
> +                return -EINVAL;
>              }
>  
>              if (load_xbzrle(f, addr, host) < 0) {
> @@ -1637,6 +1713,15 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>                  ret = -EINVAL;
>              }
>          }
> +
> +        if (postcopy_place_needed) {
> +            /* This gets called at the last target page in the host page */
> +            ret = postcopy_place_page(mis, host + TARGET_PAGE_SIZE -
> +                                           qemu_host_page_size,
> +                                      postcopy_host_page,
> +                                      (addr + rb->offset) >> TARGET_PAGE_BITS,
> +                                      all_zero);
> +        }
>          if (!ret) {
>              ret = qemu_file_get_error(f);
>          }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 37/45] qemu_ram_block_from_host
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 37/45] qemu_ram_block_from_host Dr. David Alan Gilbert (git)
@ 2015-03-24  4:55   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-24  4:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

On Wed, Feb 25, 2015 at 04:52:00PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Postcopy sends RAMBlock names and offsets over the wire (since it can't
> rely on the order of ramaddr being the same), and it starts out with
> HVA fault addresses from the kernel.

Though as noted earlier, I do wonder if GPA might be a better idea.


> 
> qemu_ram_block_from_host translates a HVA into a RAMBlock, an offset
> in the RAMBlock, the global ram_addr_t value and it's bitmap position.
> 
> Rewrite qemu_ram_addr_from_host to use qemu_ram_block_from_host.
> 
> Provide qemu_ram_get_idstr since its the actual name text sent on the
> wire.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy Dr. David Alan Gilbert (git)
@ 2015-03-24  4:58   ` David Gibson
  2015-03-24  9:05     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-24  4:58 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

On Wed, Feb 25, 2015 at 04:52:01PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Once we're in postcopy the source processors are stopped and memory
> shouldn't change any more, so there's no need to look at the dirty
> map.
> 
> There are two notes to this:
>   1) If we do resync and a page had changed then the page would get
>      sent again, which the destination wouldn't allow (since it might
>      have also modified the page)

I'm not quite sure what you mean by "resync" in this context.

>   2) Before disabling this I'd seen very rare cases where a page had been
>      marked dirtied although the memory contents are apparently identical
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 39/45] Host page!=target page: Cleanup bitmaps
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 39/45] Host page!=target page: Cleanup bitmaps Dr. David Alan Gilbert (git)
@ 2015-03-24  5:23   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-24  5:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 1118 bytes --]

On Wed, Feb 25, 2015 at 04:52:02PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Prior to the start of postcopy, ensure that everything that will
> be transferred later is a whole host-page in size.
> 
> This is accomplished by discarding partially transferred host pages
> and marking any that are partially dirty as fully dirty.

Again, I wonder if this might be a bit more obvious with
send/receive_host_page() helpers.  Rather than jiggering the basic
data structures, you make the code only do the transmission in terms
of host page sized chunks, doing the dirty check against all the
necessary target page bits.

Or better yet, a migration_chunk_size variable, rather than host page
size.  Initially that would be initialized to host page size, but
gives easier flexibility to improve future handling of cases where
source hps != dest hps.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests
  2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests Dr. David Alan Gilbert (git)
@ 2015-03-24  5:38   ` David Gibson
  2015-03-26 11:59     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-24  5:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 14353 bytes --]

On Wed, Feb 25, 2015 at 04:52:03PM +0000, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> userfaultfd is a Linux syscall that gives an fd that receives a stream
> of notifications of accesses to pages registered with it and allows
> the program to acknowledge those stalls and tell the accessing
> thread to carry on.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/migration/migration.h |   4 +
>  migration/postcopy-ram.c      | 217 ++++++++++++++++++++++++++++++++++++++++--
>  trace-events                  |  12 +++
>  3 files changed, 223 insertions(+), 10 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 139bb1b..cec064f 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -86,11 +86,15 @@ struct MigrationIncomingState {
>  
>      PostcopyState postcopy_state;
>  
> +    bool           have_fault_thread;
>      QemuThread     fault_thread;
>      QemuSemaphore  fault_thread_sem;
>  
>      /* For the kernel to send us notifications */
>      int            userfault_fd;
> +    /* To tell the fault_thread to quit */
> +    int            userfault_quit_fd;
> +
>      QEMUFile *return_path;
>      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
>      PostcopyPMI    postcopy_pmi;
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 86fa5a0..abc039e 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -47,6 +47,8 @@ struct PostcopyDiscardState {
>   */
>  #if defined(__linux__)
>  
> +#include <poll.h>
> +#include <sys/eventfd.h>
>  #include <sys/mman.h>
>  #include <sys/ioctl.h>
>  #include <sys/types.h>
> @@ -264,7 +266,7 @@ void postcopy_pmi_dump(MigrationIncomingState *mis)
>  void postcopy_hook_early_receive(MigrationIncomingState *mis,
>                                   size_t bitmap_index)
>  {
> -    if (mis->postcopy_state == POSTCOPY_INCOMING_ADVISE) {
> +    if (postcopy_state_get(mis) == POSTCOPY_INCOMING_ADVISE) {

It kind of looks like that's a fix which should be folded into an
earlier patch.

>          /*
>           * If we're in precopy-advise mode we need to track received pages even
>           * though we don't need to place pages atomically yet.
> @@ -489,15 +491,40 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
>   */
>  int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>  {
> -    /* TODO: Join the fault thread once we're sure it will exit */
> -    if (qemu_ram_foreach_block(cleanup_area, mis)) {
> -        return -1;
> +    trace_postcopy_ram_incoming_cleanup_entry();
> +
> +    if (mis->have_fault_thread) {
> +        uint64_t tmp64;
> +
> +        if (qemu_ram_foreach_block(cleanup_area, mis)) {
> +            return -1;
> +        }
> +        /*
> +         * Tell the fault_thread to exit, it's an eventfd that should
> +         * currently be at 0, we're going to inc it to 1
> +         */
> +        tmp64 = 1;
> +        if (write(mis->userfault_quit_fd, &tmp64, 8) == 8) {
> +            trace_postcopy_ram_incoming_cleanup_join();
> +            qemu_thread_join(&mis->fault_thread);
> +        } else {
> +            /* Not much we can do here, but may as well report it */
> +            perror("incing userfault_quit_fd");
> +        }
> +        trace_postcopy_ram_incoming_cleanup_closeuf();
> +        close(mis->userfault_fd);
> +        close(mis->userfault_quit_fd);
> +        mis->have_fault_thread = false;
>      }
>  
> +    postcopy_state_set(mis, POSTCOPY_INCOMING_END);
> +    migrate_send_rp_shut(mis, qemu_file_get_error(mis->file) != 0);
> +
>      if (mis->postcopy_tmp_page) {
>          munmap(mis->postcopy_tmp_page, getpagesize());
>          mis->postcopy_tmp_page = NULL;
>      }
> +    trace_postcopy_ram_incoming_cleanup_exit();
>      return 0;
>  }
>  
> @@ -531,36 +558,206 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
>  }
>  
>  /*
> + * Tell the kernel that we've now got some memory it previously asked for.
> + */
> +static int ack_userfault(MigrationIncomingState *mis, void *start, size_t len)
> +{
> +    struct uffdio_range range_struct;
> +
> +    range_struct.start = (uint64_t)(uintptr_t)start;
> +    range_struct.len = (uint64_t)len;
> +
> +    errno = 0;
> +    if (ioctl(mis->userfault_fd, UFFDIO_WAKE, &range_struct)) {
> +        int e = errno;
> +
> +        if (e == ENOENT) {
> +            /* Kernel said it wasn't waiting - one case where this can
> +             * happen is where two threads triggered the userfault
> +             * and we receive the page and ack it just after we received
> +             * the 2nd request and that ends up deciding it should ack it
> +             * We could optimise it out, but it's rare.
> +             */
> +            /*fprintf(stderr, "ack_userfault: %p/%zx ENOENT\n", start, len); */
> +            return 0;
> +        }
> +        error_report("postcopy_ram: Failed to notify kernel for %p/%zx (%d)",
> +                     start, len, e);
> +        return -e;
> +    }
> +
> +    return 0;
> +}
> +
> +/*
>   * Handle faults detected by the USERFAULT markings
>   */
>  static void *postcopy_ram_fault_thread(void *opaque)
>  {
>      MigrationIncomingState *mis = (MigrationIncomingState *)opaque;
> -
> -    fprintf(stderr, "postcopy_ram_fault_thread\n");
> -    /* TODO: In later patch */
> +    uint64_t hostaddr; /* The kernel always gives us 64 bit, not a pointer */
> +    int ret;
> +    size_t hostpagesize = getpagesize();
> +    RAMBlock *rb = NULL;
> +    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
> +    uint8_t *local_tmp_page;
> +
> +    trace_postcopy_ram_fault_thread_entry();
>      qemu_sem_post(&mis->fault_thread_sem);
> -    while (1) {
> -        /* TODO: In later patch */
> +
> +    local_tmp_page = mmap(NULL, getpagesize(),
> +                          PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,
> +                          -1, 0);
> +    if (!local_tmp_page) {
> +        perror("mapping local tmp page");
> +        return NULL;
>      }
> +    if (madvise(local_tmp_page, getpagesize(), MADV_DONTFORK)) {
> +        munmap(local_tmp_page, getpagesize());
> +        perror("postcpy local page DONTFORK");
> +        return NULL;
> +    }
> +
> +    while (true) {
> +        PostcopyPMIState old_state, tmp_state;
> +        ram_addr_t rb_offset;
> +        ram_addr_t in_raspace;
> +        unsigned long bitmap_index;
> +        struct pollfd pfd[2];
> +
> +        /*
> +         * We're mainly waiting for the kernel to give us a faulting HVA,
> +         * however we can be told to quit via userfault_quit_fd which is
> +         * an eventfd
> +         */
> +        pfd[0].fd = mis->userfault_fd;
> +        pfd[0].events = POLLIN;
> +        pfd[0].revents = 0;
> +        pfd[1].fd = mis->userfault_quit_fd;
> +        pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
> +        pfd[1].revents = 0;
> +
> +        if (poll(pfd, 2, -1 /* Wait forever */) == -1) {
> +            perror("userfault poll");
> +            break;
> +        }
> +
> +        if (pfd[1].revents) {
> +            trace_postcopy_ram_fault_thread_quit();
> +            break;
> +        }
> +
> +        ret = read(mis->userfault_fd, &hostaddr, sizeof(hostaddr));
> +        if (ret != sizeof(hostaddr)) {
> +            if (ret < 0) {
> +                perror("Failed to read full userfault hostaddr");
> +                break;
> +            } else {
> +                error_report("%s: Read %d bytes from userfaultfd expected %zd",
> +                             __func__, ret, sizeof(hostaddr));
> +                break; /* Lost alignment, don't know what we'd read next */
> +            }
> +        }
> +
> +        rb = qemu_ram_block_from_host((void *)(uintptr_t)hostaddr, true,
> +                                      &in_raspace, &rb_offset, &bitmap_index);
> +        if (!rb) {
> +            error_report("postcopy_ram_fault_thread: Fault outside guest: %"
> +                         PRIx64, hostaddr);
> +            break;
> +        }
>  
> +        trace_postcopy_ram_fault_thread_request(hostaddr, bitmap_index,
> +                                                qemu_ram_get_idstr(rb),
> +                                                rb_offset);
> +
> +        tmp_state = postcopy_pmi_get_state(mis, bitmap_index);
> +        do {
> +            old_state = tmp_state;
> +
> +            switch (old_state) {
> +            case POSTCOPY_PMI_REQUESTED:
> +                /* Do nothing - it's already requested */
> +                break;
> +
> +            case POSTCOPY_PMI_RECEIVED:
> +                /* Already arrived - no state change, just kick the kernel */
> +                trace_postcopy_ram_fault_thread_notify_pre(hostaddr);
> +                if (ack_userfault(mis,
> +                                  (void *)((uintptr_t)hostaddr
> +                                           & ~(hostpagesize - 1)),
> +                                  hostpagesize)) {
> +                    assert(0);
> +                }
> +                break;
> +
> +            case POSTCOPY_PMI_MISSING:
> +                tmp_state = postcopy_pmi_change_state(mis, bitmap_index,
> +                                           old_state, POSTCOPY_PMI_REQUESTED);
> +                if (tmp_state == POSTCOPY_PMI_MISSING) {
> +                    /*
> +                     * Send the request to the source - we want to request one
> +                     * of our host page sizes (which is >= TPS)
> +                     */
> +                    if (rb != last_rb) {
> +                        last_rb = rb;
> +                        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> +                                                 rb_offset, hostpagesize);
> +                    } else {
> +                        /* Save some space */
> +                        migrate_send_rp_req_pages(mis, NULL,
> +                                                 rb_offset, hostpagesize);
> +                    }
> +                } /* else it just arrived from the source and the kernel will
> +                     be kicked during the receive */
> +                break;
> +           }
> +        } while (tmp_state != old_state);

Again, I think using separate requested/received bits rather than
treating them as a single state could avoid this clunky loop.

> +    }
> +    munmap(local_tmp_page, getpagesize());
> +    trace_postcopy_ram_fault_thread_exit();
>      return NULL;
>  }
>  
>  int postcopy_ram_enable_notify(MigrationIncomingState *mis)
>  {
> -    /* Create the fault handler thread and wait for it to be ready */
> +    /* Open the fd for the kernel to give us userfaults */
> +    mis->userfault_fd = syscall(__NR_userfaultfd, O_CLOEXEC);

I think it would be good to declare your own userfaultfd() wrappers
around syscall().  That way it will be easier to clean up once libc
knows about them.

> +    if (mis->userfault_fd == -1) {
> +        perror("Failed to open userfault fd");
> +        return -1;
> +    }
> +
> +    /*
> +     * Although the host check already tested the API, we need to
> +     * do the check again as an ABI handshake on the new fd.
> +     */
> +    if (!ufd_version_check(mis->userfault_fd)) {
> +        return -1;
> +    }
> +
> +    /* Now an eventfd we use to tell the fault-thread to quit */
> +    mis->userfault_quit_fd = eventfd(0, EFD_CLOEXEC);
> +    if (mis->userfault_quit_fd == -1) {
> +        perror("Opening userfault_quit_fd");
> +        close(mis->userfault_fd);
> +        return -1;
> +    }
> +
>      qemu_sem_init(&mis->fault_thread_sem, 0);
>      qemu_thread_create(&mis->fault_thread, "postcopy/fault",
>                         postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
>      qemu_sem_wait(&mis->fault_thread_sem);
>      qemu_sem_destroy(&mis->fault_thread_sem);
> +    mis->have_fault_thread = true;
>  
>      /* Mark so that we get notified of accesses to unwritten areas */
>      if (qemu_ram_foreach_block(ram_block_enable_notify, mis)) {
>          return -1;
>      }
>  
> +    trace_postcopy_ram_enable_notify();
> +
>      return 0;
>  }
>  
> diff --git a/trace-events b/trace-events
> index 16a91d9..d955a28 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1498,6 +1498,18 @@ postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s ma
>  postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
>  postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
>  postcopy_place_page(unsigned long offset, void *host_addr, bool all_zero, int old_state) "offset=%lx host=%p all_zero=%d old_state=%d"
> +postcopy_ram_enable_notify(void) ""
> +postcopy_ram_fault_thread_entry(void) ""
> +postcopy_ram_fault_thread_exit(void) ""
> +postcopy_ram_fault_thread_quit(void) ""
> +postcopy_ram_fault_thread_request(uint64_t hostaddr, unsigned long index, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " index=%lx rb=%s offset=%zx"
> +postcopy_ram_fault_thread_notify_pre(uint64_t hostaddr) "%" PRIx64
> +postcopy_ram_fault_thread_notify_zero(void *hostaddr) "%p"
> +postcopy_ram_fault_thread_notify_zero_ack(void *hostaddr, unsigned long bitmap_index) "%p %lx"
> +postcopy_ram_incoming_cleanup_closeuf(void) ""
> +postcopy_ram_incoming_cleanup_entry(void) ""
> +postcopy_ram_incoming_cleanup_exit(void) ""
> +postcopy_ram_incoming_cleanup_join(void) ""
>  
>  # kvm-all.c
>  kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy
  2015-03-24  4:58   ` David Gibson
@ 2015-03-24  9:05     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-24  9:05 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:52:01PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Once we're in postcopy the source processors are stopped and memory
> > shouldn't change any more, so there's no need to look at the dirty
> > map.
> > 
> > There are two notes to this:
> >   1) If we do resync and a page had changed then the page would get
> >      sent again, which the destination wouldn't allow (since it might
> >      have also modified the page)
> 
> I'm not quite sure what you mean by "resync" in this context.

I mean synchronise the raw dirty bitmap with the migration bitmap,
which we normally do at every cycle through migration.

> >   2) Before disabling this I'd seen very rare cases where a page had been
> >      marked dirtied although the memory contents are apparently identical
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Thanks,

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-23  2:25           ` David Gibson
@ 2015-03-24 20:04             ` Dr. David Alan Gilbert
  2015-03-24 22:32               ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-24 20:04 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > 
> > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > non-postcopiable counts.
> > > > > > 
> > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > 
> > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > in this patch - is it still necessary with the change to
> > > > > save_live_pending?
> > > > 
> > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > to decide which devices must be completed at that point.
> > > 
> > > Couldn't they check for non-zero postcopiable state from
> > > save_live_pending instead?
> > 
> > That would be a bit weird.
> > 
> > At the moment for each device we call the:
> >        save_live_setup method (from qemu_savevm_state_begin)
> > 
> >    0...multiple times we call:
> >        save_live_pending
> >        save_live_iterate
> > 
> >    and then we always call
> >        save_live_complete
> > 
> > 
> > To my mind we have to call save_live_complete for any device
> > that we've called save_live_setup on (maybe it allocated something
> > in _setup that it clears up in _complete).
> > 
> > save_live_pending could perfectly well return 0 remaining at the end of
> > the migrate for our device, and thus if we used that then we wouldn't
> > call save_live_complete.
> 
> Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> transition point you call save_live_complete for everything that
> reports 0 post-copiable state.
> 
> 
> Then again, a different approach would be to split the
> save_live_complete hook into (possibly NULL) "complete precopy" and
> "complete postcopy" hooks.  The core would ensure that every chunk of
> state has both completion hooks called (unless NULL).  That might also
> address my concerns about the no longer entirely accurate
> save_live_complete function name.

OK, that one I prefer.  Are you OK with:
    qemu_savevm_state_complete_precopy
       calls -> save_live_complete_precopy

    qemu_savevm_state_complete_postcopy
       calls -> save_live_complete_postcopy

?

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-24 20:04             ` Dr. David Alan Gilbert
@ 2015-03-24 22:32               ` David Gibson
  2015-03-25 15:00                 ` Dr. David Alan Gilbert
  2015-03-30  8:10                 ` Paolo Bonzini
  0 siblings, 2 replies; 181+ messages in thread
From: David Gibson @ 2015-03-24 22:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 3357 bytes --]

On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > 
> > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > non-postcopiable counts.
> > > > > > > 
> > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > 
> > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > in this patch - is it still necessary with the change to
> > > > > > save_live_pending?
> > > > > 
> > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > to decide which devices must be completed at that point.
> > > > 
> > > > Couldn't they check for non-zero postcopiable state from
> > > > save_live_pending instead?
> > > 
> > > That would be a bit weird.
> > > 
> > > At the moment for each device we call the:
> > >        save_live_setup method (from qemu_savevm_state_begin)
> > > 
> > >    0...multiple times we call:
> > >        save_live_pending
> > >        save_live_iterate
> > > 
> > >    and then we always call
> > >        save_live_complete
> > > 
> > > 
> > > To my mind we have to call save_live_complete for any device
> > > that we've called save_live_setup on (maybe it allocated something
> > > in _setup that it clears up in _complete).
> > > 
> > > save_live_pending could perfectly well return 0 remaining at the end of
> > > the migrate for our device, and thus if we used that then we wouldn't
> > > call save_live_complete.
> > 
> > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > transition point you call save_live_complete for everything that
> > reports 0 post-copiable state.
> > 
> > 
> > Then again, a different approach would be to split the
> > save_live_complete hook into (possibly NULL) "complete precopy" and
> > "complete postcopy" hooks.  The core would ensure that every chunk of
> > state has both completion hooks called (unless NULL).  That might also
> > address my concerns about the no longer entirely accurate
> > save_live_complete function name.
> 
> OK, that one I prefer.  Are you OK with:
>     qemu_savevm_state_complete_precopy
>        calls -> save_live_complete_precopy
> 
>     qemu_savevm_state_complete_postcopy
>        calls -> save_live_complete_postcopy
> 
> ?

Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
and complete_postcopy hooks should always be called.  For a
non-postcopy migration, the postcopy hooks would just be called
immediately after the precopy hooks.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-24 22:32               ` David Gibson
@ 2015-03-25 15:00                 ` Dr. David Alan Gilbert
  2015-03-25 16:40                   ` Dr. David Alan Gilbert
  2015-03-26  1:35                   ` David Gibson
  2015-03-30  8:10                 ` Paolo Bonzini
  1 sibling, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-25 15:00 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > 
> > > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > > non-postcopiable counts.
> > > > > > > > 
> > > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > > 
> > > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > > in this patch - is it still necessary with the change to
> > > > > > > save_live_pending?
> > > > > > 
> > > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > > to decide which devices must be completed at that point.
> > > > > 
> > > > > Couldn't they check for non-zero postcopiable state from
> > > > > save_live_pending instead?
> > > > 
> > > > That would be a bit weird.
> > > > 
> > > > At the moment for each device we call the:
> > > >        save_live_setup method (from qemu_savevm_state_begin)
> > > > 
> > > >    0...multiple times we call:
> > > >        save_live_pending
> > > >        save_live_iterate
> > > > 
> > > >    and then we always call
> > > >        save_live_complete
> > > > 
> > > > 
> > > > To my mind we have to call save_live_complete for any device
> > > > that we've called save_live_setup on (maybe it allocated something
> > > > in _setup that it clears up in _complete).
> > > > 
> > > > save_live_pending could perfectly well return 0 remaining at the end of
> > > > the migrate for our device, and thus if we used that then we wouldn't
> > > > call save_live_complete.
> > > 
> > > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > > transition point you call save_live_complete for everything that
> > > reports 0 post-copiable state.
> > > 
> > > 
> > > Then again, a different approach would be to split the
> > > save_live_complete hook into (possibly NULL) "complete precopy" and
> > > "complete postcopy" hooks.  The core would ensure that every chunk of
> > > state has both completion hooks called (unless NULL).  That might also
> > > address my concerns about the no longer entirely accurate
> > > save_live_complete function name.
> > 
> > OK, that one I prefer.  Are you OK with:
> >     qemu_savevm_state_complete_precopy
> >        calls -> save_live_complete_precopy
> > 
> >     qemu_savevm_state_complete_postcopy
> >        calls -> save_live_complete_postcopy
> > 
> > ?
> 
> Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
> and complete_postcopy hooks should always be called.  For a
> non-postcopy migration, the postcopy hooks would just be called
> immediately after the precopy hooks.

OK, I've made the change as described in my last mail; but I haven't called
the complete_postcopy hook in the precopy case.  If it was as simple as making
all devices use one or the other then it would work, however there are
existing (precopy) assumptions about ordering of device state on the wire that
I want to be careful not to alter; for example RAM must come first is the one
I know.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-25 15:00                 ` Dr. David Alan Gilbert
@ 2015-03-25 16:40                   ` Dr. David Alan Gilbert
  2015-03-26  1:35                     ` David Gibson
  2015-03-26  1:35                   ` David Gibson
  1 sibling, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-25 16:40 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > > 
> > > > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > > > non-postcopiable counts.
> > > > > > > > > 
> > > > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > > > 
> > > > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > > > in this patch - is it still necessary with the change to
> > > > > > > > save_live_pending?
> > > > > > > 
> > > > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > > > to decide which devices must be completed at that point.
> > > > > > 
> > > > > > Couldn't they check for non-zero postcopiable state from
> > > > > > save_live_pending instead?
> > > > > 
> > > > > That would be a bit weird.
> > > > > 
> > > > > At the moment for each device we call the:
> > > > >        save_live_setup method (from qemu_savevm_state_begin)
> > > > > 
> > > > >    0...multiple times we call:
> > > > >        save_live_pending
> > > > >        save_live_iterate
> > > > > 
> > > > >    and then we always call
> > > > >        save_live_complete
> > > > > 
> > > > > 
> > > > > To my mind we have to call save_live_complete for any device
> > > > > that we've called save_live_setup on (maybe it allocated something
> > > > > in _setup that it clears up in _complete).
> > > > > 
> > > > > save_live_pending could perfectly well return 0 remaining at the end of
> > > > > the migrate for our device, and thus if we used that then we wouldn't
> > > > > call save_live_complete.
> > > > 
> > > > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > > > transition point you call save_live_complete for everything that
> > > > reports 0 post-copiable state.
> > > > 
> > > > 
> > > > Then again, a different approach would be to split the
> > > > save_live_complete hook into (possibly NULL) "complete precopy" and
> > > > "complete postcopy" hooks.  The core would ensure that every chunk of
> > > > state has both completion hooks called (unless NULL).  That might also
> > > > address my concerns about the no longer entirely accurate
> > > > save_live_complete function name.
> > > 
> > > OK, that one I prefer.  Are you OK with:
> > >     qemu_savevm_state_complete_precopy
> > >        calls -> save_live_complete_precopy
> > > 
> > >     qemu_savevm_state_complete_postcopy
> > >        calls -> save_live_complete_postcopy
> > > 
> > > ?
> > 
> > Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
> > and complete_postcopy hooks should always be called.  For a
> > non-postcopy migration, the postcopy hooks would just be called
> > immediately after the precopy hooks.
> 
> OK, I've made the change as described in my last mail; but I haven't called
> the complete_postcopy hook in the precopy case.  If it was as simple as making
> all devices use one or the other then it would work, however there are
> existing (precopy) assumptions about ordering of device state on the wire that
> I want to be careful not to alter; for example RAM must come first is the one
> I know.

Actually, I spoke too soon; testing this found a bad breakage.

the functions in savevm.c add the per-section headers, and then call the _complete
methods on the devices.  Those _complete methods can't elect to do nothing, because
a header has already been planted.

I've ended up with something between the two;  we still have a complete_precopy and
complete_postcopy method on the devices; if the complete_postcopy method exists and
we're in postcopy mode, the complete_precopy method isn't called at all.
A device could decide to do something different in complete_postcopy from complete_precopy
but it must do something to complete the section.
Effectively the presence of the complete_postcopy is now doing what
can_postcopy() used to do.

Dave

> 
> Dave
> 
> > 
> > -- 
> > David Gibson			| I'll have my music baroque, and my code
> > david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> > 				| _way_ _around_!
> > http://www.ozlabs.org/~dgibson
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request
  2015-03-24  1:53   ` David Gibson
@ 2015-03-25 17:37     ` Dr. David Alan Gilbert
  2015-03-26  1:31       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-25 17:37 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > On receiving MIG_RPCOMM_REQ_PAGES look up the address and
> > queue the page.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  arch_init.c                   | 55 +++++++++++++++++++++++++++++++++++++++++++
> >  include/exec/cpu-all.h        |  2 --
> >  include/migration/migration.h | 21 +++++++++++++++++
> >  include/qemu/typedefs.h       |  1 +
> >  migration/migration.c         | 33 +++++++++++++++++++++++++-
> >  trace-events                  |  3 ++-
> >  6 files changed, 111 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch_init.c b/arch_init.c
> > index d2c4457..9d8fc6b 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c

<snip>

> >  static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> >                                         ram_addr_t start, ram_addr_t len)
> >  {
> > -    trace_migrate_handle_rp_req_pages(start, len);
> > +    trace_migrate_handle_rp_req_pages(rbname, start, len);
> > +
> > +    /* Round everything up to our host page size */
> > +    long our_host_ps = getpagesize();
> > +    if (start & (our_host_ps-1)) {
> > +        long roundings = start & (our_host_ps-1);
> > +        start -= roundings;
> > +        len += roundings;
> > +    }
> > +    if (len & (our_host_ps-1)) {
> > +        long roundings = len & (our_host_ps-1);
> > +        len -= roundings;
> > +        len += our_host_ps;
> > +    }
> 
> Why is it necessary to round out to host page size on the source?  I
> understand why the host page size is relevant on the destination, due
> to the userfaultfd and atomic populate constraints, but not on the source.

In principal the request you get from the destination should already
be nicely aligned; but of course you can't actually trust it, so you
have to at least test for alignment.

Since the code has to send whole host pages to keep the
destination happy, it expects the requests that come out of the queue
to be host page aligned.

At the moment we're only supporting matching page sizes, if we wanted
to support mismatches then it probably needs to round to the size of
destination host page sizes.

Dave

> > +    if (ram_save_queue_pages(ms, rbname, start, len)) {
> > +        source_return_path_bad(ms);
> > +    }
> >  }
> >  
> >  /*
> > diff --git a/trace-events b/trace-events
> > index 9bedee4..8a0d70d 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1218,6 +1218,7 @@ migration_bitmap_sync_start(void) ""
> >  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
> >  migration_throttle(void) ""
> >  ram_postcopy_send_discard_bitmap(void) ""
> > +ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
> >  
> >  # hw/display/qxl.c
> >  disable qxl_interface_set_mm_time(int qid, uint32_t mm_time) "%d %d"
> > @@ -1404,7 +1405,7 @@ migrate_fd_error(void) ""
> >  migrate_fd_cancel(void) ""
> >  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
> >  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> > -migrate_handle_rp_req_pages(size_t start, size_t len) "at %zx for len %zx"
> > +migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at %zx len %zx"
> >  migration_thread_after_loop(void) ""
> >  migration_thread_file_err(void) ""
> >  migration_thread_setup_complete(void) ""
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers
  2015-03-24  2:33   ` David Gibson
@ 2015-03-25 17:46     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-25 17:46 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:58PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > postcopy_place_page (etc) provide a way for postcopy to place a page
> > into guests memory atomically (using the copy ioctl on the ufd).
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h    |   2 +
> >  include/migration/postcopy-ram.h |  16 ++++++
> >  migration/postcopy-ram.c         | 113 ++++++++++++++++++++++++++++++++++++++-
> >  trace-events                     |   1 +
> >  4 files changed, 130 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index b1c7cad..139bb1b 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -94,6 +94,8 @@ struct MigrationIncomingState {
> >      QEMUFile *return_path;
> >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> >      PostcopyPMI    postcopy_pmi;
> > +    void          *postcopy_tmp_page;
> > +    long           postcopy_place_skipped; /* Check for incorrect place ops */
> >  };
> >  
> >  MigrationIncomingState *migration_incoming_get_current(void);
> > diff --git a/include/migration/postcopy-ram.h b/include/migration/postcopy-ram.h
> > index fbb2a93..3d30280 100644
> > --- a/include/migration/postcopy-ram.h
> > +++ b/include/migration/postcopy-ram.h
> > @@ -80,4 +80,20 @@ void postcopy_discard_send_chunk(MigrationState *ms, PostcopyDiscardState *pds,
> >  void postcopy_discard_send_finish(MigrationState *ms,
> >                                    PostcopyDiscardState *pds);
> >  
> > +/*
> > + * Place a page (from) at (host) efficiently
> > + *    There are restrictions on how 'from' must be mapped, in general best
> > + *    to use other postcopy_ routines to allocate.
> > + * returns 0 on success
> > + */
> > +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > +                        long bitmap_offset, bool all_zero);
> > +
> > +/*
> > + * Allocate a page of memory that can be mapped at a later point in time
> > + * using postcopy_place_page
> > + * Returns: Pointer to allocated page
> > + */
> > +void *postcopy_get_tmp_page(MigrationIncomingState *mis);
> > +
> >  #endif
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 33dd332..86fa5a0 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -197,7 +197,6 @@ static PostcopyPMIState postcopy_pmi_get_state_nolock(
> >  }
> >  
> >  /* Retrieve the state of the given page */
> > -__attribute__ (( unused )) /* Until later in patch series */
> >  static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
> >                                                 size_t bitmap_index)
> >  {
> > @@ -213,7 +212,6 @@ static PostcopyPMIState postcopy_pmi_get_state(MigrationIncomingState *mis,
> >   * Set the page state to the given state if the previous state was as expected
> >   * Return the actual previous state.
> >   */
> > -__attribute__ (( unused )) /* Until later in patch series */
> >  static PostcopyPMIState postcopy_pmi_change_state(MigrationIncomingState *mis,
> >                                             size_t bitmap_index,
> >                                             PostcopyPMIState expected_state,
> > @@ -477,6 +475,7 @@ static int cleanup_area(const char *block_name, void *host_addr,
> >  int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
> >  {
> >      postcopy_pmi_init(mis, ram_pages);
> > +    mis->postcopy_place_skipped = -1;
> >  
> >      if (qemu_ram_foreach_block(init_area, mis)) {
> >          return -1;
> > @@ -495,6 +494,10 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >          return -1;
> >      }
> >  
> > +    if (mis->postcopy_tmp_page) {
> > +        munmap(mis->postcopy_tmp_page, getpagesize());
> > +        mis->postcopy_tmp_page = NULL;
> > +    }
> >      return 0;
> >  }
> >  
> > @@ -561,6 +564,100 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >      return 0;
> >  }
> >  
> > +/*
> > + * Place a host page (from) at (host) tomically
> 
> s/tomically/atomically/

Done.

> > + *    There are restrictions on how 'from' must be mapped, in general best
> > + *    to use other postcopy_ routines to allocate.
> > + * all_zero: Hint that the page being placed is 0 throughout
> > + * returns 0 on success
> > + * bitmap_offset: Index into the migration bitmaps
> > + *
> > + * State changes:
> > + *   none -> received
> > + *   requested -> received (ack)
> > + *
> > + * Note the UF thread is also updating the state, and maybe none->requested
> > + * at the same time.
> 
> Hrm.. these facts do tend me towards thinking that separate and
> explicit requested and received bits will be clearer than treating it
> as an enum state variable

This is all about to get simpler; Andrea realised he can make a small
change to the kernel interface semantics that means that I no longer have
to keep the destination side bitmaps.  That saves all this messing about
updating page states at all.

> > + */
> > +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > +                        long bitmap_offset, bool all_zero)
> > +{
> > +    PostcopyPMIState old_state, tmp_state, new_state;
> > +
> > +    if (!all_zero) {
> > +        struct uffdio_copy copy_struct;
> > +
> > +        copy_struct.dst = (uint64_t)(uintptr_t)host;
> > +        copy_struct.src = (uint64_t)(uintptr_t)from;
> > +        copy_struct.len = getpagesize();
> > +        copy_struct.mode = 0;
> > +
> > +        /* copy also acks to the kernel waking the stalled thread up
> > +         * TODO: We can inhibit that ack and only do it if it was requested
> > +         * which would be slightly cheaper, but we'd have to be careful
> > +         * of the order of updating our page state.
> > +         */
> > +        if (ioctl(mis->userfault_fd, UFFDIO_COPY, &copy_struct)) {
> > +            int e = errno;
> > +            error_report("%s: %s copy host: %p from: %p pmi=%d",
> > +                         __func__, strerror(e), host, from,
> > +                         postcopy_pmi_get_state(mis, bitmap_offset));
> > +
> > +            return -e;
> > +        }
> > +    } else {
> > +        struct uffdio_zeropage zero_struct;
> > +
> > +        zero_struct.range.start = (uint64_t)(uintptr_t)host;
> > +        zero_struct.range.len = getpagesize();
> > +        zero_struct.mode = 0;
> > +
> > +        if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
> > +            int e = errno;
> > +            error_report("%s: %s zero host: %p from: %p pmi=%d",
> > +                         __func__, strerror(e), host, from,
> > +                         postcopy_pmi_get_state(mis, bitmap_offset));
> > +
> > +            return -e;
> > +        }
> > +    }
> > +
> > +    bitmap_offset &= ~(mis->postcopy_pmi.host_bits-1);
> > +    new_state = POSTCOPY_PMI_RECEIVED;
> > +    tmp_state = postcopy_pmi_get_state(mis, bitmap_offset);
> > +    do {
> > +        old_state = tmp_state;
> > +        tmp_state = postcopy_pmi_change_state(mis, bitmap_offset, old_state,
> > +                                              new_state);
> > +    } while (old_state != tmp_state);
> 
> Yeah.. see treating the state as two separate booleans, here you'd
> just need to update received, without caring about whether requested
> has changed.

It's all very dependent on the exact things you need to do with the states;
whenever you get into a situation where you have to do something on a combination
of the bits you suddenly have to be a lot more careful.

Anyway, as I say, both of those bits are about to disappear.

Dave

> 
> > +    trace_postcopy_place_page(bitmap_offset, host, all_zero, old_state);
> > +
> > +    return 0;
> > +}
> > +
> > +/*
> > + * Returns a target page of memory that can be mapped at a later point in time
> > + * using postcopy_place_page
> > + * The same address is used repeatedly, postcopy_place_page just takes the
> > + * backing page away.
> > + * Returns: Pointer to allocated page
> > + *
> > + */
> > +void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> > +{
> > +    if (!mis->postcopy_tmp_page) {
> > +        mis->postcopy_tmp_page = mmap(NULL, getpagesize(),
> > +                             PROT_READ | PROT_WRITE, MAP_PRIVATE |
> > +                             MAP_ANONYMOUS, -1, 0);
> > +        if (!mis->postcopy_tmp_page) {
> > +            perror("mapping postcopy tmp page");
> > +            return NULL;
> > +        }
> > +    }
> > +
> > +    return mis->postcopy_tmp_page;
> > +}
> > +
> >  #else
> >  /* No target OS support, stubs just fail */
> >  bool postcopy_ram_supported_by_host(void)
> > @@ -608,6 +705,18 @@ int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >  {
> >      assert(0);
> >  }
> > +
> > +int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
> > +                        long bitmap_offset, bool all_zero)
> > +{
> > +    assert(0);
> > +}
> > +
> > +void *postcopy_get_tmp_page(MigrationIncomingState *mis)
> > +{
> > +    assert(0);
> > +}
> > +
> >  #endif
> >  
> >  /* ------------------------------------------------------------------------- */
> > diff --git a/trace-events b/trace-events
> > index 781cf5c..16a91d9 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1497,6 +1497,7 @@ rdma_start_outgoing_migration_after_rdma_source_init(void) ""
> >  postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
> >  postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
> >  postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
> > +postcopy_place_page(unsigned long offset, void *host_addr, bool all_zero, int old_state) "offset=%lx host=%p all_zero=%d old_state=%d"
> >  
> >  # kvm-all.c
> >  kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command
  2015-03-23  5:00   ` David Gibson
@ 2015-03-25 18:16     ` Dr. David Alan Gilbert
  2015-03-26  1:28       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-25 18:16 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:55PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Add MIG_RP_CMD_REQ_PAGES command on Return path for the postcopy
> > destination to request a page from the source.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h |  4 +++
> >  migration/migration.c         | 70 +++++++++++++++++++++++++++++++++++++++++++
> >  trace-events                  |  1 +
> >  3 files changed, 75 insertions(+)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 2c607e7..2c15d63 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -46,6 +46,8 @@ enum mig_rpcomm_cmd {
> >      MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
> >      MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
> >      MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> > +
> > +    MIG_RP_CMD_REQ_PAGES,    /* data (start: be64, len: be64) */
> >  };
> >  
> >  /* Postcopy page-map-incoming - data about each page on the inbound side */
> > @@ -253,6 +255,8 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
> >                            uint32_t value);
> >  void migrate_send_rp_pong(MigrationIncomingState *mis,
> >                            uint32_t value);
> > +void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char* rbname,
> > +                              ram_addr_t start, ram_addr_t len);
> >  
> >  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
> >  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index bd066f6..2e9d0dd 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -138,6 +138,36 @@ void migrate_send_rp_pong(MigrationIncomingState *mis,
> >      migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
> >  }
> >  
> > +/* Request a range of pages from the source VM at the given
> > + * start address.
> > + *   rbname: Name of the RAMBlock to request the page in, if NULL it's the same
> > + *           as the last request (a name must have been given previously)
> > + *   Start: Address offset within the RB
> > + *   Len: Length in bytes required - must be a multiple of pagesize
> > + */
> > +void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char *rbname,
> > +                               ram_addr_t start, ram_addr_t len)
> > +{
> > +    uint8_t bufc[16+1+255]; /* start (8 byte), len (8 byte), rbname upto 256 */
> > +    uint64_t *buf64 = (uint64_t *)bufc;
> > +    size_t msglen = 16; /* start + len */
> > +
> > +    assert(!(len & 1));
> > +    if (rbname) {
> > +        int rbname_len = strlen(rbname);
> > +        assert(rbname_len < 256);
> > +
> > +        len |= 1; /* Flag to say we've got a name */
> > +        bufc[msglen++] = rbname_len;
> > +        memcpy(bufc + msglen, rbname, rbname_len);
> > +        msglen += rbname_len;
> > +    }
> > +
> > +    buf64[0] = cpu_to_be64((uint64_t)start);
> > +    buf64[1] = cpu_to_be64((uint64_t)len);
> > +    migrate_send_rp_message(mis, MIG_RP_CMD_REQ_PAGES, msglen, bufc);
> 
> So.. what's the reason we actually need ramblock names on the wire,
> rather than working purely from GPAs?
> 
> It occurs to me that referencing ramblock names from the wire protocol
> exposes something that's kind of an internal detail, and may limit our
> options for reworking the memory subsystem in future.

RAMBlock names are already exposed on the wire in precopy migration in
the forward direction anyway (see save_page_header), however there
are a few reasons:
  1) There's no guarantee that the page you are transmitting is currently
     mapped into the guest. The ACPI tables are never mapped.
  2) Aliases in GPA are allowed.

The only thing that's unique is a reference to the RAMBlock and an offset
within it.
(and yes, it does break when we change RAMBlock names but that's normally
accidental - it did happen sometime around QEMU 1.6 when the PCI strings
were accidentally changed with a knock on effect of renaming RAMBlocks).

> > +}
> > +
> >  void qemu_start_incoming_migration(const char *uri, Error **errp)
> >  {
> >      const char *p;
> > @@ -789,6 +819,17 @@ static void source_return_path_bad(MigrationState *s)
> >  }
> >  
> >  /*
> > + * Process a request for pages received on the return path,
> > + * We're allowed to send more than requested (e.g. to round to our page size)
> > + * and we don't need to send pages that have already been sent.
> > + */
> > +static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> > +                                       ram_addr_t start, ram_addr_t len)
> > +{
> > +    trace_migrate_handle_rp_req_pages(start, len);
> > +}
> > +
> > +/*
> >   * Handles messages sent on the return path towards the source VM
> >   *
> >   */
> > @@ -800,6 +841,8 @@ static void *source_return_path_thread(void *opaque)
> >      const int max_len = 512;
> >      uint8_t buf[max_len];
> >      uint32_t tmp32;
> > +    ram_addr_t start, len;
> > +    char *tmpstr;
> >      int res;
> >  
> >      trace_source_return_path_thread_entry();
> > @@ -815,6 +858,11 @@ static void *source_return_path_thread(void *opaque)
> >              expected_len = 4;
> >              break;
> >  
> > +        case MIG_RP_CMD_REQ_PAGES:
> > +            /* 16 byte start/len _possibly_ plus an id str */
> > +            expected_len = 16 + 256;
> 
> Isn't that the maximum length, rather than the minimum or typical length?

I'm just trying to be fairly paranoid prior to reading the header.
If at this point we're working our way through garbage received off a misaligned
stream then if we fail to spot it in this switch then we end up doing
a qemu_get_buffer on that bad length.  Checking it just for a maximum
would make it safe against something malicious, but if the destination
hadn't sent that much data then we'd just block here and never get around
to reporting an error.  This way, for most command types we're
being pretty careful.


> > +            break;
> > +
> >          default:
> >              error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
> >                      header_com, header_len);
> > @@ -860,6 +908,28 @@ static void *source_return_path_thread(void *opaque)
> >              trace_source_return_path_thread_pong(tmp32);
> >              break;
> >  
> > +        case MIG_RP_CMD_REQ_PAGES:
> > +            start = be64_to_cpup((uint64_t *)buf);
> > +            len = be64_to_cpup(((uint64_t *)buf)+1);
> > +            tmpstr = NULL;
> > +            if (len & 1) {
> > +                len -= 1; /* Remove the flag */
> > +                /* Now we expect an idstr */
> > +                tmp32 = buf[16]; /* Length of the following idstr */
> > +                tmpstr = (char *)&buf[17];
> > +                buf[17+tmp32] = '\0';
> > +                expected_len = 16+1+tmp32;
> > +            } else {
> > +                expected_len = 16;
> 
> Ah.. so expected_len is changed here.  But then what was the point of
> setting it in the earlier switch?

For the upper bound on the qemu_get_buffer.

Dave

> 
> > +            }
> > +            if (header_len != expected_len) {
> > +                error_report("RP: Req_Page with length %d expecting %d",
> > +                        header_len, expected_len);
> > +                source_return_path_bad(ms);
> > +            }
> > +            migrate_handle_rp_req_pages(ms, tmpstr, start, len);
> > +            break;
> > +
> >          default:
> >              /* This shouldn't happen because we should catch this above */
> >              trace_source_return_path_bad_header_com();
> > diff --git a/trace-events b/trace-events
> > index bcbdef8..9bedee4 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1404,6 +1404,7 @@ migrate_fd_error(void) ""
> >  migrate_fd_cancel(void) ""
> >  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
> >  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> > +migrate_handle_rp_req_pages(size_t start, size_t len) "at %zx for len %zx"
> >  migration_thread_after_loop(void) ""
> >  migration_thread_file_err(void) ""
> >  migration_thread_setup_complete(void) ""
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source
  2015-03-11  1:54       ` David Gibson
@ 2015-03-25 18:47         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-25 18:47 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Tue, Mar 10, 2015 at 02:34:03PM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:34PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Add migrate_send_rp_message to send a message from destination to source along the return path.
> > > >   (It uses a mutex to let it be called from multiple threads)
> > > > Add migrate_send_rp_shut to send a 'shut' message to indicate
> > > >   the destination is finished with the RP.
> > > > Add migrate_send_rp_ack to send a 'PONG' message in response to a PING
> > > >   Use it in the CMD_PING handler
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >  include/migration/migration.h | 17 ++++++++++++++++
> > > >  migration/migration.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
> > > >  savevm.c                      |  2 +-
> > > >  trace-events                  |  1 +
> > > >  4 files changed, 64 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > index c514dd4..6775747 100644
> > > > --- a/include/migration/migration.h
> > > > +++ b/include/migration/migration.h
> > > > @@ -41,6 +41,13 @@ struct MigrationParams {
> > > >      bool shared;
> > > >  };
> > > >  
> > > > +/* Commands sent on the return path from destination to source*/
> > > > +enum mig_rpcomm_cmd {
> > > 
> > > "command" doesn't seem like quite the right description for these rp
> > > messages.
> > 
> > Would you prefer 'message' ?
> 
> Perhaps "message type" to distinguish from the the blob including both
> tag and data.

OK, done:

/* Messages sent on the return path from destination to source */
enum mig_rp_message_type {
    MIG_RP_MSG_INVALID = 0,  /* Must be 0 */

<snip>

> > > > +/*
> > > > + * Send a 'PONG' message on the return channel with the given value
> > > > + * (normally in response to a 'PING')
> > > > + */
> > > > +void migrate_send_rp_pong(MigrationIncomingState *mis,
> > > > +                          uint32_t value)
> > > > +{
> > > > +    uint32_t buf;
> > > > +
> > > > +    buf = cpu_to_be32(value);
> > > > +    migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
> > > 
> > > It occurs to me that you could define PONG as returning the whole
> > > buffer that PING sends, instead of just 4-bytes.  Might allow for some
> > > more testing of variable sized messages.
> > 
> > Yes; although it would complicate things a lot if I made it fully generic
> > because I'd have to worry about allocating a buffer etc and I'm not
> > making vast use of the 4 bytes I've already got.
> 
> Couldn't migrate_send_rp_pong just take a buf pointer and length, then
> you can point that directly at the buffer in the ping message you've
> received.

The buffer is a few levels down at that point, so it's non-trivial
to do that, where as it's currently very very simple to just pass
that fixed length value; and it's only a bit of debug.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command
  2015-03-25 18:16     ` Dr. David Alan Gilbert
@ 2015-03-26  1:28       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-26  1:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 8250 bytes --]

On Wed, Mar 25, 2015 at 06:16:45PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:55PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Add MIG_RP_CMD_REQ_PAGES command on Return path for the postcopy
> > > destination to request a page from the source.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  include/migration/migration.h |  4 +++
> > >  migration/migration.c         | 70 +++++++++++++++++++++++++++++++++++++++++++
> > >  trace-events                  |  1 +
> > >  3 files changed, 75 insertions(+)
> > > 
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index 2c607e7..2c15d63 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -46,6 +46,8 @@ enum mig_rpcomm_cmd {
> > >      MIG_RP_CMD_INVALID = 0,  /* Must be 0 */
> > >      MIG_RP_CMD_SHUT,         /* sibling will not send any more RP messages */
> > >      MIG_RP_CMD_PONG,         /* Response to a PING; data (seq: be32 ) */
> > > +
> > > +    MIG_RP_CMD_REQ_PAGES,    /* data (start: be64, len: be64) */
> > >  };
> > >  
> > >  /* Postcopy page-map-incoming - data about each page on the inbound side */
> > > @@ -253,6 +255,8 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
> > >                            uint32_t value);
> > >  void migrate_send_rp_pong(MigrationIncomingState *mis,
> > >                            uint32_t value);
> > > +void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char* rbname,
> > > +                              ram_addr_t start, ram_addr_t len);
> > >  
> > >  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
> > >  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index bd066f6..2e9d0dd 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -138,6 +138,36 @@ void migrate_send_rp_pong(MigrationIncomingState *mis,
> > >      migrate_send_rp_message(mis, MIG_RP_CMD_PONG, 4, (uint8_t *)&buf);
> > >  }
> > >  
> > > +/* Request a range of pages from the source VM at the given
> > > + * start address.
> > > + *   rbname: Name of the RAMBlock to request the page in, if NULL it's the same
> > > + *           as the last request (a name must have been given previously)
> > > + *   Start: Address offset within the RB
> > > + *   Len: Length in bytes required - must be a multiple of pagesize
> > > + */
> > > +void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char *rbname,
> > > +                               ram_addr_t start, ram_addr_t len)
> > > +{
> > > +    uint8_t bufc[16+1+255]; /* start (8 byte), len (8 byte), rbname upto 256 */
> > > +    uint64_t *buf64 = (uint64_t *)bufc;
> > > +    size_t msglen = 16; /* start + len */
> > > +
> > > +    assert(!(len & 1));
> > > +    if (rbname) {
> > > +        int rbname_len = strlen(rbname);
> > > +        assert(rbname_len < 256);
> > > +
> > > +        len |= 1; /* Flag to say we've got a name */
> > > +        bufc[msglen++] = rbname_len;
> > > +        memcpy(bufc + msglen, rbname, rbname_len);
> > > +        msglen += rbname_len;
> > > +    }
> > > +
> > > +    buf64[0] = cpu_to_be64((uint64_t)start);
> > > +    buf64[1] = cpu_to_be64((uint64_t)len);
> > > +    migrate_send_rp_message(mis, MIG_RP_CMD_REQ_PAGES, msglen, bufc);
> > 
> > So.. what's the reason we actually need ramblock names on the wire,
> > rather than working purely from GPAs?
> > 
> > It occurs to me that referencing ramblock names from the wire protocol
> > exposes something that's kind of an internal detail, and may limit our
> > options for reworking the memory subsystem in future.
> 
> RAMBlock names are already exposed on the wire in precopy migration in
> the forward direction anyway (see save_page_header), however there
> are a few reasons:
>   1) There's no guarantee that the page you are transmitting is currently
>      mapped into the guest. The ACPI tables are never mapped.
>   2) Aliases in GPA are allowed.
> 
> The only thing that's unique is a reference to the RAMBlock and an offset
> within it.
> (and yes, it does break when we change RAMBlock names but that's normally
> accidental - it did happen sometime around QEMU 1.6 when the PCI strings
> were accidentally changed with a knock on effect of renaming RAMBlocks).

Ah, right, good point.


> > > +}
> > > +
> > >  void qemu_start_incoming_migration(const char *uri, Error **errp)
> > >  {
> > >      const char *p;
> > > @@ -789,6 +819,17 @@ static void source_return_path_bad(MigrationState *s)
> > >  }
> > >  
> > >  /*
> > > + * Process a request for pages received on the return path,
> > > + * We're allowed to send more than requested (e.g. to round to our page size)
> > > + * and we don't need to send pages that have already been sent.
> > > + */
> > > +static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> > > +                                       ram_addr_t start, ram_addr_t len)
> > > +{
> > > +    trace_migrate_handle_rp_req_pages(start, len);
> > > +}
> > > +
> > > +/*
> > >   * Handles messages sent on the return path towards the source VM
> > >   *
> > >   */
> > > @@ -800,6 +841,8 @@ static void *source_return_path_thread(void *opaque)
> > >      const int max_len = 512;
> > >      uint8_t buf[max_len];
> > >      uint32_t tmp32;
> > > +    ram_addr_t start, len;
> > > +    char *tmpstr;
> > >      int res;
> > >  
> > >      trace_source_return_path_thread_entry();
> > > @@ -815,6 +858,11 @@ static void *source_return_path_thread(void *opaque)
> > >              expected_len = 4;
> > >              break;
> > >  
> > > +        case MIG_RP_CMD_REQ_PAGES:
> > > +            /* 16 byte start/len _possibly_ plus an id str */
> > > +            expected_len = 16 + 256;
> > 
> > Isn't that the maximum length, rather than the minimum or typical length?
> 
> I'm just trying to be fairly paranoid prior to reading the header.
> If at this point we're working our way through garbage received off a misaligned
> stream then if we fail to spot it in this switch then we end up doing
> a qemu_get_buffer on that bad length.  Checking it just for a maximum
> would make it safe against something malicious, but if the destination
> hadn't sent that much data then we'd just block here and never get around
> to reporting an error.  This way, for most command types we're
> being pretty careful.
> 
> 
> > > +            break;
> > > +
> > >          default:
> > >              error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
> > >                      header_com, header_len);
> > > @@ -860,6 +908,28 @@ static void *source_return_path_thread(void *opaque)
> > >              trace_source_return_path_thread_pong(tmp32);
> > >              break;
> > >  
> > > +        case MIG_RP_CMD_REQ_PAGES:
> > > +            start = be64_to_cpup((uint64_t *)buf);
> > > +            len = be64_to_cpup(((uint64_t *)buf)+1);
> > > +            tmpstr = NULL;
> > > +            if (len & 1) {
> > > +                len -= 1; /* Remove the flag */
> > > +                /* Now we expect an idstr */
> > > +                tmp32 = buf[16]; /* Length of the following idstr */
> > > +                tmpstr = (char *)&buf[17];
> > > +                buf[17+tmp32] = '\0';
> > > +                expected_len = 16+1+tmp32;
> > > +            } else {
> > > +                expected_len = 16;
> > 
> > Ah.. so expected_len is changed here.  But then what was the point of
> > setting it in the earlier switch?
> 
> For the upper bound on the qemu_get_buffer.

Ok, I think I'd forgotten exactly how the checks worked from the
earlier patch.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request
  2015-03-25 17:37     ` Dr. David Alan Gilbert
@ 2015-03-26  1:31       ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-26  1:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 3146 bytes --]

On Wed, Mar 25, 2015 at 05:37:34PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:56PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > On receiving MIG_RPCOMM_REQ_PAGES look up the address and
> > > queue the page.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  arch_init.c                   | 55 +++++++++++++++++++++++++++++++++++++++++++
> > >  include/exec/cpu-all.h        |  2 --
> > >  include/migration/migration.h | 21 +++++++++++++++++
> > >  include/qemu/typedefs.h       |  1 +
> > >  migration/migration.c         | 33 +++++++++++++++++++++++++-
> > >  trace-events                  |  3 ++-
> > >  6 files changed, 111 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch_init.c b/arch_init.c
> > > index d2c4457..9d8fc6b 100644
> > > --- a/arch_init.c
> > > +++ b/arch_init.c
> 
> <snip>
> 
> > >  static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> > >                                         ram_addr_t start, ram_addr_t len)
> > >  {
> > > -    trace_migrate_handle_rp_req_pages(start, len);
> > > +    trace_migrate_handle_rp_req_pages(rbname, start, len);
> > > +
> > > +    /* Round everything up to our host page size */
> > > +    long our_host_ps = getpagesize();
> > > +    if (start & (our_host_ps-1)) {
> > > +        long roundings = start & (our_host_ps-1);
> > > +        start -= roundings;
> > > +        len += roundings;
> > > +    }
> > > +    if (len & (our_host_ps-1)) {
> > > +        long roundings = len & (our_host_ps-1);
> > > +        len -= roundings;
> > > +        len += our_host_ps;
> > > +    }
> > 
> > Why is it necessary to round out to host page size on the source?  I
> > understand why the host page size is relevant on the destination, due
> > to the userfaultfd and atomic populate constraints, but not on the source.
> 
> In principal the request you get from the destination should already
> be nicely aligned; but of course you can't actually trust it, so you
> have to at least test for alignment.
> 
> Since the code has to send whole host pages to keep the
> destination happy, it expects the requests that come out of the queue
> to be host page aligned.

I don't follow.  It sounds like you'll only send non-aligned things if
the destination (incorrectly) requests them.  But in that case the
only thing that the destination will mess up is itself, so where's the
requirement to do anything on the source side?

> At the moment we're only supporting matching page sizes, if we wanted
> to support mismatches then it probably needs to round to the size of
> destination host page sizes.

And can't that effectively be done by just answering the requests
exactly as the destination makes them?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-25 16:40                   ` Dr. David Alan Gilbert
@ 2015-03-26  1:35                     ` David Gibson
  2015-03-26 11:44                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-26  1:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 5434 bytes --]

On Wed, Mar 25, 2015 at 04:40:11PM +0000, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > > > 
> > > > > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > > > > non-postcopiable counts.
> > > > > > > > > > 
> > > > > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > > > > 
> > > > > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > > > > in this patch - is it still necessary with the change to
> > > > > > > > > save_live_pending?
> > > > > > > > 
> > > > > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > > > > to decide which devices must be completed at that point.
> > > > > > > 
> > > > > > > Couldn't they check for non-zero postcopiable state from
> > > > > > > save_live_pending instead?
> > > > > > 
> > > > > > That would be a bit weird.
> > > > > > 
> > > > > > At the moment for each device we call the:
> > > > > >        save_live_setup method (from qemu_savevm_state_begin)
> > > > > > 
> > > > > >    0...multiple times we call:
> > > > > >        save_live_pending
> > > > > >        save_live_iterate
> > > > > > 
> > > > > >    and then we always call
> > > > > >        save_live_complete
> > > > > > 
> > > > > > 
> > > > > > To my mind we have to call save_live_complete for any device
> > > > > > that we've called save_live_setup on (maybe it allocated something
> > > > > > in _setup that it clears up in _complete).
> > > > > > 
> > > > > > save_live_pending could perfectly well return 0 remaining at the end of
> > > > > > the migrate for our device, and thus if we used that then we wouldn't
> > > > > > call save_live_complete.
> > > > > 
> > > > > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > > > > transition point you call save_live_complete for everything that
> > > > > reports 0 post-copiable state.
> > > > > 
> > > > > 
> > > > > Then again, a different approach would be to split the
> > > > > save_live_complete hook into (possibly NULL) "complete precopy" and
> > > > > "complete postcopy" hooks.  The core would ensure that every chunk of
> > > > > state has both completion hooks called (unless NULL).  That might also
> > > > > address my concerns about the no longer entirely accurate
> > > > > save_live_complete function name.
> > > > 
> > > > OK, that one I prefer.  Are you OK with:
> > > >     qemu_savevm_state_complete_precopy
> > > >        calls -> save_live_complete_precopy
> > > > 
> > > >     qemu_savevm_state_complete_postcopy
> > > >        calls -> save_live_complete_postcopy
> > > > 
> > > > ?
> > > 
> > > Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
> > > and complete_postcopy hooks should always be called.  For a
> > > non-postcopy migration, the postcopy hooks would just be called
> > > immediately after the precopy hooks.
> > 
> > OK, I've made the change as described in my last mail; but I haven't called
> > the complete_postcopy hook in the precopy case.  If it was as simple as making
> > all devices use one or the other then it would work, however there are
> > existing (precopy) assumptions about ordering of device state on the wire that
> > I want to be careful not to alter; for example RAM must come first is the one
> > I know.
> 
> Actually, I spoke too soon; testing this found a bad breakage.
> 
> the functions in savevm.c add the per-section headers, and then call the _complete
> methods on the devices.  Those _complete methods can't elect to do nothing, because
> a header has already been planted.

Hrm.. couldn't you move the test for presence of the hook earlier so
you don't sent the header if the hook is NULL?

> I've ended up with something between the two;  we still have a complete_precopy and
> complete_postcopy method on the devices; if the complete_postcopy method exists and
> we're in postcopy mode, the complete_precopy method isn't called at all.
> A device could decide to do something different in complete_postcopy from complete_precopy
> but it must do something to complete the section.
> Effectively the presence of the complete_postcopy is now doing what
> can_postcopy() used to do.

Hmm.. but it means there's no per-device hook for the precopy to
postcopy transition point.  I'm not sure if that might matter.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-25 15:00                 ` Dr. David Alan Gilbert
  2015-03-25 16:40                   ` Dr. David Alan Gilbert
@ 2015-03-26  1:35                   ` David Gibson
  1 sibling, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-26  1:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]

On Wed, Mar 25, 2015 at 03:00:29PM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > > 
> > > > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > > > non-postcopiable counts.
> > > > > > > > > 
> > > > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > > > 
> > > > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > > > in this patch - is it still necessary with the change to
> > > > > > > > save_live_pending?
> > > > > > > 
> > > > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > > > to decide which devices must be completed at that point.
> > > > > > 
> > > > > > Couldn't they check for non-zero postcopiable state from
> > > > > > save_live_pending instead?
> > > > > 
> > > > > That would be a bit weird.
> > > > > 
> > > > > At the moment for each device we call the:
> > > > >        save_live_setup method (from qemu_savevm_state_begin)
> > > > > 
> > > > >    0...multiple times we call:
> > > > >        save_live_pending
> > > > >        save_live_iterate
> > > > > 
> > > > >    and then we always call
> > > > >        save_live_complete
> > > > > 
> > > > > 
> > > > > To my mind we have to call save_live_complete for any device
> > > > > that we've called save_live_setup on (maybe it allocated something
> > > > > in _setup that it clears up in _complete).
> > > > > 
> > > > > save_live_pending could perfectly well return 0 remaining at the end of
> > > > > the migrate for our device, and thus if we used that then we wouldn't
> > > > > call save_live_complete.
> > > > 
> > > > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > > > transition point you call save_live_complete for everything that
> > > > reports 0 post-copiable state.
> > > > 
> > > > 
> > > > Then again, a different approach would be to split the
> > > > save_live_complete hook into (possibly NULL) "complete precopy" and
> > > > "complete postcopy" hooks.  The core would ensure that every chunk of
> > > > state has both completion hooks called (unless NULL).  That might also
> > > > address my concerns about the no longer entirely accurate
> > > > save_live_complete function name.
> > > 
> > > OK, that one I prefer.  Are you OK with:
> > >     qemu_savevm_state_complete_precopy
> > >        calls -> save_live_complete_precopy
> > > 
> > >     qemu_savevm_state_complete_postcopy
> > >        calls -> save_live_complete_postcopy
> > > 
> > > ?
> > 
> > Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
> > and complete_postcopy hooks should always be called.  For a
> > non-postcopy migration, the postcopy hooks would just be called
> > immediately after the precopy hooks.
> 
> OK, I've made the change as described in my last mail; but I haven't called
> the complete_postcopy hook in the precopy case.  If it was as simple as making
> all devices use one or the other then it would work, however there are
> existing (precopy) assumptions about ordering of device state on the wire that
> I want to be careful not to alter; for example RAM must come first is the one
> I know.

It's not obvious to me why that matters to the hook scheme.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-03-23  4:20   ` David Gibson
@ 2015-03-26 11:05     ` Dr. David Alan Gilbert
  2015-03-30  8:31       ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-26 11:05 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:53PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Rework the migration thread to setup and start postcopy.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h |   3 +
> >  migration/migration.c         | 161 ++++++++++++++++++++++++++++++++++++++++--
> >  trace-events                  |   4 ++
> >  3 files changed, 164 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 821d561..2c607e7 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -131,6 +131,9 @@ struct MigrationState
> >      /* Flag set once the migration has been asked to enter postcopy */
> >      bool start_postcopy;
> >  
> > +    /* Flag set once the migration thread is running (and needs joining) */
> > +    bool started_migration_thread;
> > +
> >      /* bitmap of pages that have been sent at least once
> >       * only maintained and used in postcopy at the moment
> >       * where it's used to send the dirtymap at the start
> > diff --git a/migration/migration.c b/migration/migration.c
> > index b1ad7b1..6bf9c8d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -468,7 +468,10 @@ static void migrate_fd_cleanup(void *opaque)
> >      if (s->file) {
> >          trace_migrate_fd_cleanup();
> >          qemu_mutex_unlock_iothread();
> > -        qemu_thread_join(&s->thread);
> > +        if (s->started_migration_thread) {
> > +            qemu_thread_join(&s->thread);
> > +            s->started_migration_thread = false;
> > +        }
> >          qemu_mutex_lock_iothread();
> >  
> >          qemu_fclose(s->file);
> > @@ -874,7 +877,6 @@ out:
> >      return NULL;
> >  }
> >  
> > -__attribute__ (( unused )) /* Until later in patch series */
> >  static int open_outgoing_return_path(MigrationState *ms)
> >  {
> >  
> > @@ -911,23 +913,141 @@ static void await_outgoing_return_path_close(MigrationState *ms)
> >  }
> >  
> >  /*
> > + * Switch from normal iteration to postcopy
> > + * Returns non-0 on error
> > + */
> > +static int postcopy_start(MigrationState *ms, bool *old_vm_running)
> > +{
> > +    int ret;
> > +    const QEMUSizedBuffer *qsb;
> > +    int64_t time_at_stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +    migrate_set_state(ms, MIG_STATE_ACTIVE, MIG_STATE_POSTCOPY_ACTIVE);
> > +
> > +    trace_postcopy_start();
> > +    qemu_mutex_lock_iothread();
> > +    trace_postcopy_start_set_run();
> > +
> > +    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
> > +    *old_vm_running = runstate_is_running();
> 
> I think that needs some explanation.  Why are you doing a wakeup on
> the source host?

This matches the existing code in migration_thread for the end of precopy;
Paolo's explanation of what it does is here:
https://lists.gnu.org/archive/html/qemu-devel/2014-08/msg04880.html

> > +    ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> > +
> > +    if (ret < 0) {
> > +        goto fail;
> > +    }
> > +
> > +    /*
> > +     * in Finish migrate and with the io-lock held everything should
> > +     * be quiet, but we've potentially still got dirty pages and we
> > +     * need to tell the destination to throw any pages it's already received
> > +     * that are dirty
> > +     */
> > +    if (ram_postcopy_send_discard_bitmap(ms)) {
> > +        error_report("postcopy send discard bitmap failed");
> > +        goto fail;
> > +    }
> > +
> > +    /*
> > +     * send rest of state - note things that are doing postcopy
> > +     * will notice we're in MIG_STATE_POSTCOPY_ACTIVE and not actually
> > +     * wrap their state up here
> > +     */
> > +    qemu_file_set_rate_limit(ms->file, INT64_MAX);
> > +    /* Ping just for debugging, helps line traces up */
> > +    qemu_savevm_send_ping(ms->file, 2);
> > +
> > +    /*
> > +     * We need to leave the fd free for page transfers during the
> > +     * loading of the device state, so wrap all the remaining
> > +     * commands and state into a package that gets sent in one go
> > +     */
> > +    QEMUFile *fb = qemu_bufopen("w", NULL);
> > +    if (!fb) {
> > +        error_report("Failed to create buffered file");
> > +        goto fail;
> > +    }
> > +
> > +    qemu_savevm_state_complete(fb);
> > +    qemu_savevm_send_ping(fb, 3);
> > +
> > +    qemu_savevm_send_postcopy_run(fb);
> > +
> > +    /* <><> end of stuff going into the package */
> > +    qsb = qemu_buf_get(fb);
> > +
> > +    /* Now send that blob */
> > +    if (qsb_get_length(qsb) > MAX_VM_CMD_PACKAGED_SIZE) {
> > +        error_report("postcopy_start: Unreasonably large packaged state: %lu",
> > +                     (unsigned long)(qsb_get_length(qsb)));
> > +        goto fail_closefb;
> > +    }
> > +    qemu_savevm_send_packaged(ms->file, qsb);
> > +    qemu_fclose(fb);
> > +    ms->downtime =  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - time_at_stop;
> > +
> > +    qemu_mutex_unlock_iothread();
> > +
> > +    /*
> > +     * Although this ping is just for debug, it could potentially be
> > +     * used for getting a better measurement of downtime at the source.
> > +     */
> > +    qemu_savevm_send_ping(ms->file, 4);
> > +
> > +    ret = qemu_file_get_error(ms->file);
> > +    if (ret) {
> > +        error_report("postcopy_start: Migration stream errored");
> > +        migrate_set_state(ms, MIG_STATE_POSTCOPY_ACTIVE, MIG_STATE_ERROR);
> > +    }
> > +
> > +    return ret;
> > +
> > +fail_closefb:
> > +    qemu_fclose(fb);
> > +fail:
> > +    migrate_set_state(ms, MIG_STATE_POSTCOPY_ACTIVE, MIG_STATE_ERROR);
> > +    qemu_mutex_unlock_iothread();
> > +    return -1;
> > +}
> > +
> > +/*
> >   * Master migration thread on the source VM.
> >   * It drives the migration and pumps the data down the outgoing channel.
> >   */
> >  static void *migration_thread(void *opaque)
> >  {
> >      MigrationState *s = opaque;
> > +    /* Used by the bandwidth calcs, updated later */
> >      int64_t initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> >      int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
> >      int64_t initial_bytes = 0;
> >      int64_t max_size = 0;
> >      int64_t start_time = initial_time;
> >      bool old_vm_running = false;
> > +    bool entered_postcopy = false;
> > +    /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */
> > +    enum MigrationPhase current_active_type = MIG_STATE_ACTIVE;
> >  
> >      qemu_savevm_state_header(s->file);
> > +
> > +    if (migrate_postcopy_ram()) {
> > +        /* Now tell the dest that it should open its end so it can reply */
> > +        qemu_savevm_send_open_return_path(s->file);
> > +
> > +        /* And do a ping that will make stuff easier to debug */
> > +        qemu_savevm_send_ping(s->file, 1);
> > +
> > +        /*
> > +         * Tell the destination that we *might* want to do postcopy later;
> > +         * if the other end can't do postcopy it should fail now, nice and
> > +         * early.
> > +         */
> > +        qemu_savevm_send_postcopy_advise(s->file);
> > +    }
> > +
> >      qemu_savevm_state_begin(s->file, &s->params);
> >  
> >      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
> > +    current_active_type = MIG_STATE_ACTIVE;
> >      migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ACTIVE);
> >  
> >      trace_migration_thread_setup_complete();
> > @@ -946,6 +1066,22 @@ static void *migration_thread(void *opaque)
> >              trace_migrate_pending(pending_size, max_size,
> >                                    pend_post, pend_nonpost);
> >              if (pending_size && pending_size >= max_size) {
> > +                /* Still a significant amount to transfer */
> > +
> > +                current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > +                if (migrate_postcopy_ram() &&
> > +                    s->state != MIG_STATE_POSTCOPY_ACTIVE &&
> > +                    pend_nonpost <= max_size &&
> > +                    atomic_read(&s->start_postcopy)) {
> > +
> > +                    if (!postcopy_start(s, &old_vm_running)) {
> > +                        current_active_type = MIG_STATE_POSTCOPY_ACTIVE;
> > +                        entered_postcopy = true;
> 
> Do you need entered_postcopy, or could you just use the existing
> MIG_STATE variable?

I need the separate flag, because this is used at the end of migration
(when the existing state is MIGRATION_STATUS_COMPLETED) to know that
there has been a postcopy stage, and is used to stop the recalculation
of the 'downtime' which was previously incorrect. See below.


> > +                    }
> > +
> > +                    continue;
> > +                }
> > +                /* Just another iteration step */
> >                  qemu_savevm_state_iterate(s->file);
> >              } else {
> >                  int ret;
> > @@ -975,7 +1111,8 @@ static void *migration_thread(void *opaque)
> >          }
> >  
> >          if (qemu_file_get_error(s->file)) {
> > -            migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_ERROR);
> > +            migrate_set_state(s, current_active_type, MIG_STATE_ERROR);
> > +            trace_migration_thread_file_err();
> >              break;
> >          }
> >          current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > @@ -1006,12 +1143,15 @@ static void *migration_thread(void *opaque)
> >          }
> >      }
> >  
> > +    trace_migration_thread_after_loop();
> >      qemu_mutex_lock_iothread();
> >      if (s->state == MIG_STATE_COMPLETED) {
> >          int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> >          uint64_t transferred_bytes = qemu_ftell(s->file);
> >          s->total_time = end_time - s->total_time;
> > -        s->downtime = end_time - start_time;
> > +        if (!entered_postcopy) {
> > +            s->downtime = end_time - start_time;
> > +        }

Here's the use of entered_postcopy, and you see that the s->state
is always MIG_STATE_COMPLETED here.

Dave

> >          if (s->total_time) {
> >              s->mbps = (((double) transferred_bytes * 8.0) /
> >                         ((double) s->total_time)) / 1000;
> > @@ -1043,8 +1183,21 @@ void migrate_fd_connect(MigrationState *s)
> >      /* Notify before starting migration thread */
> >      notifier_list_notify(&migration_state_notifiers, s);
> >  
> > +    /* Open the return path; currently for postcopy but other things might
> > +     * also want it.
> > +     */
> > +    if (migrate_postcopy_ram()) {
> > +        if (open_outgoing_return_path(s)) {
> > +            error_report("Unable to open return-path for postcopy");
> > +            migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ERROR);
> > +            migrate_fd_cleanup(s);
> > +            return;
> > +        }
> > +    }
> > +
> >      qemu_thread_create(&s->thread, "migration", migration_thread, s,
> >                         QEMU_THREAD_JOINABLE);
> > +    s->started_migration_thread = true;
> >  }
> >  
> >  PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> > diff --git a/trace-events b/trace-events
> > index 59dea4c..ed8bbe2 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1404,9 +1404,13 @@ migrate_fd_error(void) ""
> >  migrate_fd_cancel(void) ""
> >  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
> >  migrate_send_rp_message(int cmd, uint16_t len) "cmd=%d, len=%d"
> > +migration_thread_after_loop(void) ""
> > +migration_thread_file_err(void) ""
> >  migration_thread_setup_complete(void) ""
> >  open_outgoing_return_path(void) ""
> >  open_outgoing_return_path_continue(void) ""
> > +postcopy_start(void) ""
> > +postcopy_start_set_run(void) ""
> >  source_return_path_thread_bad_end(void) ""
> >  source_return_path_bad_header_com(void) ""
> >  source_return_path_thread_end(void) ""
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-26  1:35                     ` David Gibson
@ 2015-03-26 11:44                       ` Dr. David Alan Gilbert
  2015-03-27  3:56                         ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-26 11:44 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Mar 25, 2015 at 04:40:11PM +0000, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > > > > 
> > > > > > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > > > > > non-postcopiable counts.
> > > > > > > > > > > 
> > > > > > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > > > > > 
> > > > > > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > > > > > in this patch - is it still necessary with the change to
> > > > > > > > > > save_live_pending?
> > > > > > > > > 
> > > > > > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > > > > > to decide which devices must be completed at that point.
> > > > > > > > 
> > > > > > > > Couldn't they check for non-zero postcopiable state from
> > > > > > > > save_live_pending instead?
> > > > > > > 
> > > > > > > That would be a bit weird.
> > > > > > > 
> > > > > > > At the moment for each device we call the:
> > > > > > >        save_live_setup method (from qemu_savevm_state_begin)
> > > > > > > 
> > > > > > >    0...multiple times we call:
> > > > > > >        save_live_pending
> > > > > > >        save_live_iterate
> > > > > > > 
> > > > > > >    and then we always call
> > > > > > >        save_live_complete
> > > > > > > 
> > > > > > > 
> > > > > > > To my mind we have to call save_live_complete for any device
> > > > > > > that we've called save_live_setup on (maybe it allocated something
> > > > > > > in _setup that it clears up in _complete).
> > > > > > > 
> > > > > > > save_live_pending could perfectly well return 0 remaining at the end of
> > > > > > > the migrate for our device, and thus if we used that then we wouldn't
> > > > > > > call save_live_complete.
> > > > > > 
> > > > > > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > > > > > transition point you call save_live_complete for everything that
> > > > > > reports 0 post-copiable state.
> > > > > > 
> > > > > > 
> > > > > > Then again, a different approach would be to split the
> > > > > > save_live_complete hook into (possibly NULL) "complete precopy" and
> > > > > > "complete postcopy" hooks.  The core would ensure that every chunk of
> > > > > > state has both completion hooks called (unless NULL).  That might also
> > > > > > address my concerns about the no longer entirely accurate
> > > > > > save_live_complete function name.
> > > > > 
> > > > > OK, that one I prefer.  Are you OK with:
> > > > >     qemu_savevm_state_complete_precopy
> > > > >        calls -> save_live_complete_precopy
> > > > > 
> > > > >     qemu_savevm_state_complete_postcopy
> > > > >        calls -> save_live_complete_postcopy
> > > > > 
> > > > > ?
> > > > 
> > > > Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
> > > > and complete_postcopy hooks should always be called.  For a
> > > > non-postcopy migration, the postcopy hooks would just be called
> > > > immediately after the precopy hooks.
> > > 
> > > OK, I've made the change as described in my last mail; but I haven't called
> > > the complete_postcopy hook in the precopy case.  If it was as simple as making
> > > all devices use one or the other then it would work, however there are
> > > existing (precopy) assumptions about ordering of device state on the wire that
> > > I want to be careful not to alter; for example RAM must come first is the one
> > > I know.
> > 
> > Actually, I spoke too soon; testing this found a bad breakage.
> > 
> > the functions in savevm.c add the per-section headers, and then call the _complete
> > methods on the devices.  Those _complete methods can't elect to do nothing, because
> > a header has already been planted.
> 
> Hrm.. couldn't you move the test for presence of the hook earlier so
> you don't sent the header if the hook is NULL?

There's two tests that you have to make:
      a) in qemu_savevm_state_complete_precopy do you call save_live_complete_precopy
      b) in qemu_savevm_state_complete_postcopy do you call save_live_complete_postcopy

The obvious case is if either hook is NULL you don't call it.
(a) is the harder cases, if we're doing postcopy then we don't want to call the
   save_live_complete_precopy method on a device which isn't expecting to complete until
   postcopy.
   The code in qemu_savevm_state_complete_precopy checks for the presence of the *postcopy*
   hook, and doesn't emit the header or call the precopy commit if the postcopy hook
   is present and we're in postcopy.

> > I've ended up with something between the two;  we still have a complete_precopy and
> > complete_postcopy method on the devices; if the complete_postcopy method exists and
> > we're in postcopy mode, the complete_precopy method isn't called at all.
> > A device could decide to do something different in complete_postcopy from complete_precopy
> > but it must do something to complete the section.
> > Effectively the presence of the complete_postcopy is now doing what
> > can_postcopy() used to do.
> 
> Hmm.. but it means there's no per-device hook for the precopy to
> postcopy transition point.  I'm not sure if that might matter.

This is true, but if we needed a generic hook for that (which might be useful)
it probably shouldn't be 'complete'.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests
  2015-03-24  5:38   ` David Gibson
@ 2015-03-26 11:59     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-26 11:59 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:52:03PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > userfaultfd is a Linux syscall that gives an fd that receives a stream
> > of notifications of accesses to pages registered with it and allows
> > the program to acknowledge those stalls and tell the accessing
> > thread to carry on.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h |   4 +
> >  migration/postcopy-ram.c      | 217 ++++++++++++++++++++++++++++++++++++++++--
> >  trace-events                  |  12 +++
> >  3 files changed, 223 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index 139bb1b..cec064f 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -86,11 +86,15 @@ struct MigrationIncomingState {
> >  
> >      PostcopyState postcopy_state;
> >  
> > +    bool           have_fault_thread;
> >      QemuThread     fault_thread;
> >      QemuSemaphore  fault_thread_sem;
> >  
> >      /* For the kernel to send us notifications */
> >      int            userfault_fd;
> > +    /* To tell the fault_thread to quit */
> > +    int            userfault_quit_fd;
> > +
> >      QEMUFile *return_path;
> >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> >      PostcopyPMI    postcopy_pmi;
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 86fa5a0..abc039e 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -47,6 +47,8 @@ struct PostcopyDiscardState {
> >   */
> >  #if defined(__linux__)
> >  
> > +#include <poll.h>
> > +#include <sys/eventfd.h>
> >  #include <sys/mman.h>
> >  #include <sys/ioctl.h>
> >  #include <sys/types.h>
> > @@ -264,7 +266,7 @@ void postcopy_pmi_dump(MigrationIncomingState *mis)
> >  void postcopy_hook_early_receive(MigrationIncomingState *mis,
> >                                   size_t bitmap_index)
> >  {
> > -    if (mis->postcopy_state == POSTCOPY_INCOMING_ADVISE) {
> > +    if (postcopy_state_get(mis) == POSTCOPY_INCOMING_ADVISE) {
> 
> It kind of looks like that's a fix which should be folded into an
> earlier patch.

Thanks; gone.

> 
> >          /*
> >           * If we're in precopy-advise mode we need to track received pages even
> >           * though we don't need to place pages atomically yet.
> > @@ -489,15 +491,40 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis, size_t ram_pages)
> >   */
> >  int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >  {
> > -    /* TODO: Join the fault thread once we're sure it will exit */
> > -    if (qemu_ram_foreach_block(cleanup_area, mis)) {
> > -        return -1;
> > +    trace_postcopy_ram_incoming_cleanup_entry();
> > +
> > +    if (mis->have_fault_thread) {
> > +        uint64_t tmp64;
> > +
> > +        if (qemu_ram_foreach_block(cleanup_area, mis)) {
> > +            return -1;
> > +        }
> > +        /*
> > +         * Tell the fault_thread to exit, it's an eventfd that should
> > +         * currently be at 0, we're going to inc it to 1
> > +         */
> > +        tmp64 = 1;
> > +        if (write(mis->userfault_quit_fd, &tmp64, 8) == 8) {
> > +            trace_postcopy_ram_incoming_cleanup_join();
> > +            qemu_thread_join(&mis->fault_thread);
> > +        } else {
> > +            /* Not much we can do here, but may as well report it */
> > +            perror("incing userfault_quit_fd");
> > +        }
> > +        trace_postcopy_ram_incoming_cleanup_closeuf();
> > +        close(mis->userfault_fd);
> > +        close(mis->userfault_quit_fd);
> > +        mis->have_fault_thread = false;
> >      }
> >  
> > +    postcopy_state_set(mis, POSTCOPY_INCOMING_END);
> > +    migrate_send_rp_shut(mis, qemu_file_get_error(mis->file) != 0);
> > +
> >      if (mis->postcopy_tmp_page) {
> >          munmap(mis->postcopy_tmp_page, getpagesize());
> >          mis->postcopy_tmp_page = NULL;
> >      }
> > +    trace_postcopy_ram_incoming_cleanup_exit();
> >      return 0;
> >  }
> >  
> > @@ -531,36 +558,206 @@ static int ram_block_enable_notify(const char *block_name, void *host_addr,
> >  }
> >  
> >  /*
> > + * Tell the kernel that we've now got some memory it previously asked for.
> > + */
> > +static int ack_userfault(MigrationIncomingState *mis, void *start, size_t len)
> > +{
> > +    struct uffdio_range range_struct;
> > +
> > +    range_struct.start = (uint64_t)(uintptr_t)start;
> > +    range_struct.len = (uint64_t)len;
> > +
> > +    errno = 0;
> > +    if (ioctl(mis->userfault_fd, UFFDIO_WAKE, &range_struct)) {
> > +        int e = errno;
> > +
> > +        if (e == ENOENT) {
> > +            /* Kernel said it wasn't waiting - one case where this can
> > +             * happen is where two threads triggered the userfault
> > +             * and we receive the page and ack it just after we received
> > +             * the 2nd request and that ends up deciding it should ack it
> > +             * We could optimise it out, but it's rare.
> > +             */
> > +            /*fprintf(stderr, "ack_userfault: %p/%zx ENOENT\n", start, len); */
> > +            return 0;
> > +        }
> > +        error_report("postcopy_ram: Failed to notify kernel for %p/%zx (%d)",
> > +                     start, len, e);
> > +        return -e;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/*
> >   * Handle faults detected by the USERFAULT markings
> >   */
> >  static void *postcopy_ram_fault_thread(void *opaque)
> >  {
> >      MigrationIncomingState *mis = (MigrationIncomingState *)opaque;
> > -
> > -    fprintf(stderr, "postcopy_ram_fault_thread\n");
> > -    /* TODO: In later patch */
> > +    uint64_t hostaddr; /* The kernel always gives us 64 bit, not a pointer */
> > +    int ret;
> > +    size_t hostpagesize = getpagesize();
> > +    RAMBlock *rb = NULL;
> > +    RAMBlock *last_rb = NULL; /* last RAMBlock we sent part of */
> > +    uint8_t *local_tmp_page;
> > +
> > +    trace_postcopy_ram_fault_thread_entry();
> >      qemu_sem_post(&mis->fault_thread_sem);
> > -    while (1) {
> > -        /* TODO: In later patch */
> > +
> > +    local_tmp_page = mmap(NULL, getpagesize(),
> > +                          PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,
> > +                          -1, 0);
> > +    if (!local_tmp_page) {
> > +        perror("mapping local tmp page");
> > +        return NULL;
> >      }
> > +    if (madvise(local_tmp_page, getpagesize(), MADV_DONTFORK)) {
> > +        munmap(local_tmp_page, getpagesize());
> > +        perror("postcpy local page DONTFORK");
> > +        return NULL;
> > +    }
> > +
> > +    while (true) {
> > +        PostcopyPMIState old_state, tmp_state;
> > +        ram_addr_t rb_offset;
> > +        ram_addr_t in_raspace;
> > +        unsigned long bitmap_index;
> > +        struct pollfd pfd[2];
> > +
> > +        /*
> > +         * We're mainly waiting for the kernel to give us a faulting HVA,
> > +         * however we can be told to quit via userfault_quit_fd which is
> > +         * an eventfd
> > +         */
> > +        pfd[0].fd = mis->userfault_fd;
> > +        pfd[0].events = POLLIN;
> > +        pfd[0].revents = 0;
> > +        pfd[1].fd = mis->userfault_quit_fd;
> > +        pfd[1].events = POLLIN; /* Waiting for eventfd to go positive */
> > +        pfd[1].revents = 0;
> > +
> > +        if (poll(pfd, 2, -1 /* Wait forever */) == -1) {
> > +            perror("userfault poll");
> > +            break;
> > +        }
> > +
> > +        if (pfd[1].revents) {
> > +            trace_postcopy_ram_fault_thread_quit();
> > +            break;
> > +        }
> > +
> > +        ret = read(mis->userfault_fd, &hostaddr, sizeof(hostaddr));
> > +        if (ret != sizeof(hostaddr)) {
> > +            if (ret < 0) {
> > +                perror("Failed to read full userfault hostaddr");
> > +                break;
> > +            } else {
> > +                error_report("%s: Read %d bytes from userfaultfd expected %zd",
> > +                             __func__, ret, sizeof(hostaddr));
> > +                break; /* Lost alignment, don't know what we'd read next */
> > +            }
> > +        }
> > +
> > +        rb = qemu_ram_block_from_host((void *)(uintptr_t)hostaddr, true,
> > +                                      &in_raspace, &rb_offset, &bitmap_index);
> > +        if (!rb) {
> > +            error_report("postcopy_ram_fault_thread: Fault outside guest: %"
> > +                         PRIx64, hostaddr);
> > +            break;
> > +        }
> >  
> > +        trace_postcopy_ram_fault_thread_request(hostaddr, bitmap_index,
> > +                                                qemu_ram_get_idstr(rb),
> > +                                                rb_offset);
> > +
> > +        tmp_state = postcopy_pmi_get_state(mis, bitmap_index);
> > +        do {
> > +            old_state = tmp_state;
> > +
> > +            switch (old_state) {
> > +            case POSTCOPY_PMI_REQUESTED:
> > +                /* Do nothing - it's already requested */
> > +                break;
> > +
> > +            case POSTCOPY_PMI_RECEIVED:
> > +                /* Already arrived - no state change, just kick the kernel */
> > +                trace_postcopy_ram_fault_thread_notify_pre(hostaddr);
> > +                if (ack_userfault(mis,
> > +                                  (void *)((uintptr_t)hostaddr
> > +                                           & ~(hostpagesize - 1)),
> > +                                  hostpagesize)) {
> > +                    assert(0);
> > +                }
> > +                break;
> > +
> > +            case POSTCOPY_PMI_MISSING:
> > +                tmp_state = postcopy_pmi_change_state(mis, bitmap_index,
> > +                                           old_state, POSTCOPY_PMI_REQUESTED);
> > +                if (tmp_state == POSTCOPY_PMI_MISSING) {
> > +                    /*
> > +                     * Send the request to the source - we want to request one
> > +                     * of our host page sizes (which is >= TPS)
> > +                     */
> > +                    if (rb != last_rb) {
> > +                        last_rb = rb;
> > +                        migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
> > +                                                 rb_offset, hostpagesize);
> > +                    } else {
> > +                        /* Save some space */
> > +                        migrate_send_rp_req_pages(mis, NULL,
> > +                                                 rb_offset, hostpagesize);
> > +                    }
> > +                } /* else it just arrived from the source and the kernel will
> > +                     be kicked during the receive */
> > +                break;
> > +           }
> > +        } while (tmp_state != old_state);
> 
> Again, I think using separate requested/received bits rather than
> treating them as a single state could avoid this clunky loop.

As mentioned; the PMI is simplified out so this all disappears.

> > +    }
> > +    munmap(local_tmp_page, getpagesize());
> > +    trace_postcopy_ram_fault_thread_exit();
> >      return NULL;
> >  }
> >  
> >  int postcopy_ram_enable_notify(MigrationIncomingState *mis)
> >  {
> > -    /* Create the fault handler thread and wait for it to be ready */
> > +    /* Open the fd for the kernel to give us userfaults */
> > +    mis->userfault_fd = syscall(__NR_userfaultfd, O_CLOEXEC);
> 
> I think it would be good to declare your own userfaultfd() wrappers
> around syscall().  That way it will be easier to clean up once libc
> knows about them.

OK, I'll look at that; this might depend if I find the better way of
dealing with a new syscall definition as mentioned in the other patch.

Dave

> > +    if (mis->userfault_fd == -1) {
> > +        perror("Failed to open userfault fd");
> > +        return -1;
> > +    }
> > +
> > +    /*
> > +     * Although the host check already tested the API, we need to
> > +     * do the check again as an ABI handshake on the new fd.
> > +     */
> > +    if (!ufd_version_check(mis->userfault_fd)) {
> > +        return -1;
> > +    }
> > +
> > +    /* Now an eventfd we use to tell the fault-thread to quit */
> > +    mis->userfault_quit_fd = eventfd(0, EFD_CLOEXEC);
> > +    if (mis->userfault_quit_fd == -1) {
> > +        perror("Opening userfault_quit_fd");
> > +        close(mis->userfault_fd);
> > +        return -1;
> > +    }
> > +
> >      qemu_sem_init(&mis->fault_thread_sem, 0);
> >      qemu_thread_create(&mis->fault_thread, "postcopy/fault",
> >                         postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINABLE);
> >      qemu_sem_wait(&mis->fault_thread_sem);
> >      qemu_sem_destroy(&mis->fault_thread_sem);
> > +    mis->have_fault_thread = true;
> >  
> >      /* Mark so that we get notified of accesses to unwritten areas */
> >      if (qemu_ram_foreach_block(ram_block_enable_notify, mis)) {
> >          return -1;
> >      }
> >  
> > +    trace_postcopy_ram_enable_notify();
> > +
> >      return 0;
> >  }
> >  
> > diff --git a/trace-events b/trace-events
> > index 16a91d9..d955a28 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1498,6 +1498,18 @@ postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s ma
> >  postcopy_cleanup_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
> >  postcopy_init_area(const char *ramblock, void *host_addr, size_t offset, size_t length) "%s: %p offset=%zx length=%zx"
> >  postcopy_place_page(unsigned long offset, void *host_addr, bool all_zero, int old_state) "offset=%lx host=%p all_zero=%d old_state=%d"
> > +postcopy_ram_enable_notify(void) ""
> > +postcopy_ram_fault_thread_entry(void) ""
> > +postcopy_ram_fault_thread_exit(void) ""
> > +postcopy_ram_fault_thread_quit(void) ""
> > +postcopy_ram_fault_thread_request(uint64_t hostaddr, unsigned long index, const char *ramblock, size_t offset) "Request for HVA=%" PRIx64 " index=%lx rb=%s offset=%zx"
> > +postcopy_ram_fault_thread_notify_pre(uint64_t hostaddr) "%" PRIx64
> > +postcopy_ram_fault_thread_notify_zero(void *hostaddr) "%p"
> > +postcopy_ram_fault_thread_notify_zero_ack(void *hostaddr, unsigned long bitmap_index) "%p %lx"
> > +postcopy_ram_incoming_cleanup_closeuf(void) ""
> > +postcopy_ram_incoming_cleanup_entry(void) ""
> > +postcopy_ram_incoming_cleanup_exit(void) ""
> > +postcopy_ram_incoming_cleanup_join(void) ""
> >  
> >  # kvm-all.c
> >  kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-12  9:30   ` David Gibson
@ 2015-03-26 16:33     ` Dr. David Alan Gilbert
  2015-03-27  4:13       ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-26 16:33 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

(Only replying to some of the items in this mail - the others I'll get
to another time).

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:40PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The state of the postcopy process is managed via a series of messages;
> >    * Add wrappers and handlers for sending/receiving these messages
> >    * Add state variable that track the current state of postcopy
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/migration/migration.h |  15 ++
> >  include/sysemu/sysemu.h       |  23 +++
> >  migration/migration.c         |  13 ++
> >  savevm.c                      | 325 ++++++++++++++++++++++++++++++++++++++++++
> >  trace-events                  |  11 ++
> >  5 files changed, 387 insertions(+)
> > 
> > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > index f94af5b..81cd1f2 100644
> > --- a/include/migration/migration.h
> > +++ b/include/migration/migration.h
> > @@ -52,6 +52,14 @@ typedef struct MigrationState MigrationState;
> >  
> >  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> >  
> > +typedef enum {
> > +    POSTCOPY_INCOMING_NONE = 0,  /* Initial state - no postcopy */
> > +    POSTCOPY_INCOMING_ADVISE,
> > +    POSTCOPY_INCOMING_LISTENING,
> > +    POSTCOPY_INCOMING_RUNNING,
> > +    POSTCOPY_INCOMING_END
> > +} PostcopyState;
> > +
> >  /* State for the incoming migration */
> >  struct MigrationIncomingState {
> >      QEMUFile *file;
> > @@ -59,6 +67,8 @@ struct MigrationIncomingState {
> >      /* See savevm.c */
> >      LoadStateEntry_Head loadvm_handlers;
> >  
> > +    PostcopyState postcopy_state;
> > +
> >      QEMUFile *return_path;
> >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> >  };
> > @@ -219,4 +229,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
> >                               ram_addr_t offset, size_t size,
> >                               int *bytes_sent);
> >  
> > +PostcopyState postcopy_state_get(MigrationIncomingState *mis);
> > +
> > +/* Set the state and return the old state */
> > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > +                                 PostcopyState new_state);
> >  #endif
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index 8da879f..d6a6d51 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -87,6 +87,18 @@ enum qemu_vm_cmd {
> >      MIG_CMD_INVALID = 0,       /* Must be 0 */
> >      MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
> >      MIG_CMD_PING,              /* Request a PONG on the RP */
> > +
> > +    MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
> > +                                      warn we might want to do PC */
> > +    MIG_CMD_POSTCOPY_LISTEN,       /* Start listening for incoming
> > +                                      pages as it's running. */
> > +    MIG_CMD_POSTCOPY_RUN,          /* Start execution */
> > +    MIG_CMD_POSTCOPY_END,          /* Postcopy is finished. */
> > +
> > +    MIG_CMD_POSTCOPY_RAM_DISCARD,  /* A list of pages to discard that
> > +                                      were previously sent during
> > +                                      precopy but are dirty. */
> > +
> >  };
> >  
> >  bool qemu_savevm_state_blocked(Error **errp);
> > @@ -101,6 +113,17 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
> >                                uint16_t len, uint8_t *data);
> >  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
> >  void qemu_savevm_send_open_return_path(QEMUFile *f);
> > +void qemu_savevm_send_postcopy_advise(QEMUFile *f);
> > +void qemu_savevm_send_postcopy_listen(QEMUFile *f);
> > +void qemu_savevm_send_postcopy_run(QEMUFile *f);
> > +void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status);
> > +
> > +void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
> > +                                           uint16_t len, uint8_t offset,
> > +                                           uint64_t *addrlist,
> > +                                           uint32_t *masklist);
> > +
> > +
> >  int qemu_loadvm_state(QEMUFile *f);
> >  
> >  /* SLIRP */
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 434864a..957115a 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -971,3 +971,16 @@ void migrate_fd_connect(MigrationState *s)
> >      qemu_thread_create(&s->thread, "migration", migration_thread, s,
> >                         QEMU_THREAD_JOINABLE);
> >  }
> > +
> > +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> > +{
> > +    return atomic_fetch_add(&mis->postcopy_state, 0);
> > +}
> > +
> > +/* Set the state and return the old state */
> > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > +                                 PostcopyState new_state)
> > +{
> > +    return atomic_xchg(&mis->postcopy_state, new_state);
> 
> Is there anything explaining what the overall atomicity requirements
> are for this state variable?  It's a bit hard to tell if an atomic
> xchg is necessary or sufficient without a description of what the
> overall concurrency scheme is with regards to this variable.

Can you tell me how to define the requirements?
It's a state variable tested and changed by at least two threads and
it's got to go through a correct sequence of states.
So generally you're doing a 'I expect to be in .... now change to ....'
so the exchange works well for that.

> > + *  n x
> > + *      be64   Page addresses for start of an invalidation range
> > + *      be32   mask of 32 pages, '1' to discard'
> 
> Is the extra compactness from this semi-sparse bitmap encoding
> actually worth it?  A simple list of page addresses, or address ranges
> to discard would be substantially simpler to get one's head around,
> and also seems like it might be more robust against future
> implementation changes as a wire format.

As previously discussed I really think it is;  what I'm tending to
see when I've been looking at these in debug is something that's
sparse but tends to be blobby with sets of pages discarded near
by.  However you do this you're going to have to walk this
bitmap and format out some sort of set of messages.

> > + *  Hopefully this is pretty sparse so we don't get too many entries,
> > + *  and using the mask should deal with most pagesize differences
> > + *  just ending up as a single full mask
> > + *
> > + * The mask is always 32bits irrespective of the long size
> > + *
> > + *  name:  RAMBlock name that these entries are part of
> > + *  len: Number of page entries
> > + *  addrlist: 'len' addresses
> > + *  masklist: 'len' masks (corresponding to the addresses)
> > + */
> > +void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
> > +                                           uint16_t len, uint8_t offset,
> > +                                           uint64_t *addrlist,
> > +                                           uint32_t *masklist)
> > +{
> > +    uint8_t *buf;
> > +    uint16_t tmplen;
> > +    uint16_t t;
> > +
> > +    trace_qemu_savevm_send_postcopy_ram_discard();
> > +    buf = g_malloc0(len*12 + strlen(name) + 3);
> > +    buf[0] = 0; /* Version */
> > +    buf[1] = offset;
> > +    assert(strlen(name) < 256);
> > +    buf[2] = strlen(name);
> > +    memcpy(buf+3, name, strlen(name));
> > +    tmplen = 3+strlen(name);
> 
> Repeated calls to strlen() always seem icky to me, although I guess
> it's all gcc builtins here, so they are probably optimized out by
> CSE.

Yeh, those are now gone.  Thanks.

> > +    for (t = 0; t < len; t++) {
> > +        cpu_to_be64w((uint64_t *)(buf + tmplen), addrlist[t]);
> > +        tmplen += 8;
> > +        cpu_to_be32w((uint32_t *)(buf + tmplen), masklist[t]);
> > +        tmplen += 4;
> > +    }
> > +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_RAM_DISCARD, tmplen, buf);
> > +    g_free(buf);
> > +}
> > +
> > +/* Get the destination into a state where it can receive postcopy data. */
> > +void qemu_savevm_send_postcopy_listen(QEMUFile *f)
> > +{
> > +    trace_savevm_send_postcopy_listen();
> > +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_LISTEN, 0, NULL);
> > +}
> > +
> > +/* Kick the destination into running */
> > +void qemu_savevm_send_postcopy_run(QEMUFile *f)
> > +{
> > +    trace_savevm_send_postcopy_run();
> > +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_RUN, 0, NULL);
> > +}
> 
> DISCARD will typically immediately precede LISTEN, won't it?  Is there
> a reason not to put the discard data into the LISTEN command?

Discard data can be quite large, so I potentially send multiple discard
commands.
(Also as you can tell generally I've got a preference for one message does one
thing, and thus I have tried to keep them separate).

> > +
> > +/* End of postcopy - with a status byte; 0 is good, anything else is a fail */
> > +void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status)
> > +{
> > +    trace_savevm_send_postcopy_end();
> > +    qemu_savevm_command_send(f, MIG_CMD_POSTCOPY_END, 1, &status);
> > +}
> 
> What's the distinction between the postcopy END command and the normal
> end of the migration stream?  Is there already a way to detect the end
> of stream normally?

OK, thanks I've killed that off.
The short answer is that the 'end of migration stream' is just a terminating
byte and I was hoping to put something better on there with a status
from the source side;  but that fix can be a general fix and doesn't
need to be postcopy.

> 
> >  bool qemu_savevm_state_blocked(Error **errp)
> >  {
> >      SaveStateEntry *se;
> > @@ -961,6 +1046,212 @@ enum LoadVMExitCodes {
> >  
> >  static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> >  
> > +/* ------ incoming postcopy messages ------ */
> > +/* 'advise' arrives before any transfers just to tell us that a postcopy
> > + * *might* happen - it might be skipped if precopy transferred everything
> > + * quickly.
> > + */
> > +static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
> > +                                         uint64_t remote_hps,
> > +                                         uint64_t remote_tps)
> > +{
> > +    PostcopyState ps = postcopy_state_get(mis);
> > +    trace_loadvm_postcopy_handle_advise();
> > +    if (ps != POSTCOPY_INCOMING_NONE) {
> > +        error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state (%d)", ps);
> > +        return -1;
> > +    }
> > +
> > +    if (remote_hps != getpagesize())  {
> > +        /*
> > +         * Some combinations of mismatch are probably possible but it gets
> > +         * a bit more complicated.  In particular we need to place whole
> > +         * host pages on the dest at once, and we need to ensure that we
> > +         * handle dirtying to make sure we never end up sending part of
> > +         * a hostpage on it's own.
> > +         */
> > +        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
> > +                     (int)remote_hps, getpagesize());
> > +        return -1;
> > +    }
> > +
> > +    if (remote_tps != (1ul << qemu_target_page_bits())) {
> > +        /*
> > +         * Again, some differences could be dealt with, but for now keep it
> > +         * simple.
> > +         */
> > +        error_report("Postcopy needs matching target page sizes (s=%d d=%d)",
> > +                     (int)remote_tps, 1 << qemu_target_page_bits());
> > +        return -1;
> > +    }
> > +
> > +    postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
> 
> Should you be checking the return value here to make sure it's still
> POSTCOPY_INCOMING_NONE?  Atomic xchgs seem overkill if you still have
> a race between the fetch at the top and the set here.
> 
> Or, in fact, should you be just doing an atomic exchange-then-check at
> the top, rather than checking at the top, then changing at the bottom.

There's no race at this point yet; going from None->advise we still only
have one thread.  The check at the top is a check against protocol
violations (e.g. getting two advise or something like that).

> > +    return 0;
> > +}
> > +
> > +/* After postcopy we will be told to throw some pages away since they're
> > + * dirty and will have to be demand fetched.  Must happen before CPU is
> > + * started.
> > + * There can be 0..many of these messages, each encoding multiple pages.
> > + * Bits set in the message represent a page in the source VMs bitmap, but
> > + * since the guest/target page sizes can be different on s/d then we have
> > + * to convert.
> 
> Uh.. I thought the checks in the ADVISE processing eliminated that possibility.

Old message; I was originally trying to keep it more general, but ripped
the conversions out when it became just too messy for now.
Gone.


Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-26 11:44                       ` Dr. David Alan Gilbert
@ 2015-03-27  3:56                         ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-27  3:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 7314 bytes --]

On Thu, Mar 26, 2015 at 11:44:36AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Mar 25, 2015 at 04:40:11PM +0000, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > On Tue, Mar 24, 2015 at 08:04:14PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > On Fri, Mar 20, 2015 at 12:37:59PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > > On Fri, Mar 13, 2015 at 10:19:54AM +0000, Dr. David Alan Gilbert wrote:
> > > > > > > > > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > > > > > > > > On Wed, Feb 25, 2015 at 04:51:43PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > > > > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > > > > > > > > 
> > > > > > > > > > > > Modify save_live_pending to return separate postcopiable and
> > > > > > > > > > > > non-postcopiable counts.
> > > > > > > > > > > > 
> > > > > > > > > > > > Add 'can_postcopy' to allow a device to state if it can postcopy
> > > > > > > > > > > 
> > > > > > > > > > > What's the purpose of the can_postcopy callback?  There are no callers
> > > > > > > > > > > in this patch - is it still necessary with the change to
> > > > > > > > > > > save_live_pending?
> > > > > > > > > > 
> > > > > > > > > > The patch 'qemu_savevm_state_complete: Postcopy changes' uses
> > > > > > > > > > it in qemu_savevm_state_postcopy_complete and qemu_savevm_state_complete
> > > > > > > > > > to decide which devices must be completed at that point.
> > > > > > > > > 
> > > > > > > > > Couldn't they check for non-zero postcopiable state from
> > > > > > > > > save_live_pending instead?
> > > > > > > > 
> > > > > > > > That would be a bit weird.
> > > > > > > > 
> > > > > > > > At the moment for each device we call the:
> > > > > > > >        save_live_setup method (from qemu_savevm_state_begin)
> > > > > > > > 
> > > > > > > >    0...multiple times we call:
> > > > > > > >        save_live_pending
> > > > > > > >        save_live_iterate
> > > > > > > > 
> > > > > > > >    and then we always call
> > > > > > > >        save_live_complete
> > > > > > > > 
> > > > > > > > 
> > > > > > > > To my mind we have to call save_live_complete for any device
> > > > > > > > that we've called save_live_setup on (maybe it allocated something
> > > > > > > > in _setup that it clears up in _complete).
> > > > > > > > 
> > > > > > > > save_live_pending could perfectly well return 0 remaining at the end of
> > > > > > > > the migrate for our device, and thus if we used that then we wouldn't
> > > > > > > > call save_live_complete.
> > > > > > > 
> > > > > > > Um.. I don't follow.  I was suggesting that at the precopy->postcopy
> > > > > > > transition point you call save_live_complete for everything that
> > > > > > > reports 0 post-copiable state.
> > > > > > > 
> > > > > > > 
> > > > > > > Then again, a different approach would be to split the
> > > > > > > save_live_complete hook into (possibly NULL) "complete precopy" and
> > > > > > > "complete postcopy" hooks.  The core would ensure that every chunk of
> > > > > > > state has both completion hooks called (unless NULL).  That might also
> > > > > > > address my concerns about the no longer entirely accurate
> > > > > > > save_live_complete function name.
> > > > > > 
> > > > > > OK, that one I prefer.  Are you OK with:
> > > > > >     qemu_savevm_state_complete_precopy
> > > > > >        calls -> save_live_complete_precopy
> > > > > > 
> > > > > >     qemu_savevm_state_complete_postcopy
> > > > > >        calls -> save_live_complete_postcopy
> > > > > > 
> > > > > > ?
> > > > > 
> > > > > Sounds ok to me.  Fwiw, I was thinking that both the complete_precopy
> > > > > and complete_postcopy hooks should always be called.  For a
> > > > > non-postcopy migration, the postcopy hooks would just be called
> > > > > immediately after the precopy hooks.
> > > > 
> > > > OK, I've made the change as described in my last mail; but I haven't called
> > > > the complete_postcopy hook in the precopy case.  If it was as simple as making
> > > > all devices use one or the other then it would work, however there are
> > > > existing (precopy) assumptions about ordering of device state on the wire that
> > > > I want to be careful not to alter; for example RAM must come first is the one
> > > > I know.
> > > 
> > > Actually, I spoke too soon; testing this found a bad breakage.
> > > 
> > > the functions in savevm.c add the per-section headers, and then call the _complete
> > > methods on the devices.  Those _complete methods can't elect to do nothing, because
> > > a header has already been planted.
> > 
> > Hrm.. couldn't you move the test for presence of the hook earlier so
> > you don't sent the header if the hook is NULL?
> 
> There's two tests that you have to make:
>       a) in qemu_savevm_state_complete_precopy do you call save_live_complete_precopy
>       b) in qemu_savevm_state_complete_postcopy do you call save_live_complete_postcopy
> 
> The obvious case is if either hook is NULL you don't call it.
> (a) is the harder cases, if we're doing postcopy then we don't want to call the
>    save_live_complete_precopy method on a device which isn't expecting to complete until
>    postcopy.

Uh.. no, I'm expecting the complete_precopy hook to be called every
time if it's non-NULL.  If it's a postcopy item that doesn't expect to
finish at the end of precopy, it should put its code in the
complete_postcopy hook instead.

That should be fine for the non-postcopy case too, because the
postcopy_complete hook will be called momentarily.

>    The code in qemu_savevm_state_complete_precopy checks for the presence of the *postcopy*
>    hook, and doesn't emit the header or call the precopy commit if the postcopy hook
>    is present and we're in postcopy.
> 
> > > I've ended up with something between the two;  we still have a complete_precopy and
> > > complete_postcopy method on the devices; if the complete_postcopy method exists and
> > > we're in postcopy mode, the complete_precopy method isn't called at all.
> > > A device could decide to do something different in complete_postcopy from complete_precopy
> > > but it must do something to complete the section.
> > > Effectively the presence of the complete_postcopy is now doing what
> > > can_postcopy() used to do.
> > 
> > Hmm.. but it means there's no per-device hook for the precopy to
> > postcopy transition point.  I'm not sure if that might matter.
> 
> This is true, but if we needed a generic hook for that (which might be useful)
> it probably shouldn't be 'complete'.

Well, I'm thinking of these as "state (complete precopy)" rather than
"(state complete) precopy".  But yeah, a different name might be less
ambiguous.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-26 16:33     ` Dr. David Alan Gilbert
@ 2015-03-27  4:13       ` David Gibson
  2015-03-27 10:48         ` Dr. David Alan Gilbert
  2015-03-28 15:58         ` Paolo Bonzini
  0 siblings, 2 replies; 181+ messages in thread
From: David Gibson @ 2015-03-27  4:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 11226 bytes --]

On Thu, Mar 26, 2015 at 04:33:28PM +0000, Dr. David Alan Gilbert wrote:
> (Only replying to some of the items in this mail - the others I'll get
> to another time).
> 
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Wed, Feb 25, 2015 at 04:51:40PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > The state of the postcopy process is managed via a series of messages;
> > >    * Add wrappers and handlers for sending/receiving these messages
> > >    * Add state variable that track the current state of postcopy
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  include/migration/migration.h |  15 ++
> > >  include/sysemu/sysemu.h       |  23 +++
> > >  migration/migration.c         |  13 ++
> > >  savevm.c                      | 325 ++++++++++++++++++++++++++++++++++++++++++
> > >  trace-events                  |  11 ++
> > >  5 files changed, 387 insertions(+)
> > > 
> > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > index f94af5b..81cd1f2 100644
> > > --- a/include/migration/migration.h
> > > +++ b/include/migration/migration.h
> > > @@ -52,6 +52,14 @@ typedef struct MigrationState MigrationState;
> > >  
> > >  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> > >  
> > > +typedef enum {
> > > +    POSTCOPY_INCOMING_NONE = 0,  /* Initial state - no postcopy */
> > > +    POSTCOPY_INCOMING_ADVISE,
> > > +    POSTCOPY_INCOMING_LISTENING,
> > > +    POSTCOPY_INCOMING_RUNNING,
> > > +    POSTCOPY_INCOMING_END
> > > +} PostcopyState;
> > > +
> > >  /* State for the incoming migration */
> > >  struct MigrationIncomingState {
> > >      QEMUFile *file;
> > > @@ -59,6 +67,8 @@ struct MigrationIncomingState {
> > >      /* See savevm.c */
> > >      LoadStateEntry_Head loadvm_handlers;
> > >  
> > > +    PostcopyState postcopy_state;
> > > +
> > >      QEMUFile *return_path;
> > >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> > >  };
> > > @@ -219,4 +229,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
> > >                               ram_addr_t offset, size_t size,
> > >                               int *bytes_sent);
> > >  
> > > +PostcopyState postcopy_state_get(MigrationIncomingState *mis);
> > > +
> > > +/* Set the state and return the old state */
> > > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > > +                                 PostcopyState new_state);
> > >  #endif
> > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > index 8da879f..d6a6d51 100644
> > > --- a/include/sysemu/sysemu.h
> > > +++ b/include/sysemu/sysemu.h
> > > @@ -87,6 +87,18 @@ enum qemu_vm_cmd {
> > >      MIG_CMD_INVALID = 0,       /* Must be 0 */
> > >      MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
> > >      MIG_CMD_PING,              /* Request a PONG on the RP */
> > > +
> > > +    MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
> > > +                                      warn we might want to do PC */
> > > +    MIG_CMD_POSTCOPY_LISTEN,       /* Start listening for incoming
> > > +                                      pages as it's running. */
> > > +    MIG_CMD_POSTCOPY_RUN,          /* Start execution */
> > > +    MIG_CMD_POSTCOPY_END,          /* Postcopy is finished. */
> > > +
> > > +    MIG_CMD_POSTCOPY_RAM_DISCARD,  /* A list of pages to discard that
> > > +                                      were previously sent during
> > > +                                      precopy but are dirty. */
> > > +
> > >  };
> > >  
> > >  bool qemu_savevm_state_blocked(Error **errp);
> > > @@ -101,6 +113,17 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
> > >                                uint16_t len, uint8_t *data);
> > >  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
> > >  void qemu_savevm_send_open_return_path(QEMUFile *f);
> > > +void qemu_savevm_send_postcopy_advise(QEMUFile *f);
> > > +void qemu_savevm_send_postcopy_listen(QEMUFile *f);
> > > +void qemu_savevm_send_postcopy_run(QEMUFile *f);
> > > +void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status);
> > > +
> > > +void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
> > > +                                           uint16_t len, uint8_t offset,
> > > +                                           uint64_t *addrlist,
> > > +                                           uint32_t *masklist);
> > > +
> > > +
> > >  int qemu_loadvm_state(QEMUFile *f);
> > >  
> > >  /* SLIRP */
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 434864a..957115a 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -971,3 +971,16 @@ void migrate_fd_connect(MigrationState *s)
> > >      qemu_thread_create(&s->thread, "migration", migration_thread, s,
> > >                         QEMU_THREAD_JOINABLE);
> > >  }
> > > +
> > > +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> > > +{
> > > +    return atomic_fetch_add(&mis->postcopy_state, 0);
> > > +}
> > > +
> > > +/* Set the state and return the old state */
> > > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > > +                                 PostcopyState new_state)
> > > +{
> > > +    return atomic_xchg(&mis->postcopy_state, new_state);
> > 
> > Is there anything explaining what the overall atomicity requirements
> > are for this state variable?  It's a bit hard to tell if an atomic
> > xchg is necessary or sufficient without a description of what the
> > overall concurrency scheme is with regards to this variable.
> 
> Can you tell me how to define the requirements?

Well, that is always the tricky question.

> It's a state variable tested and changed by at least two threads and
> it's got to go through a correct sequence of states.
> So generally you're doing a 'I expect to be in .... now change to ....'
> so the exchange works well for that.

So.. in this case.  It seems what might make sense as a basic
atomicity option here is a cmpxchg.  So your state_set function takes
old_state and new_state.  It will atomically update old_state to
new_state, returning an error if somebody else changed the state away
from old_state in the meantime.

Then you can have a blurb next to the state_set helper explaining why
you need this atomic op for updates to the state variable from
multiple threads.

> > > + *  n x
> > > + *      be64   Page addresses for start of an invalidation range
> > > + *      be32   mask of 32 pages, '1' to discard'
> > 
> > Is the extra compactness from this semi-sparse bitmap encoding
> > actually worth it?  A simple list of page addresses, or address ranges
> > to discard would be substantially simpler to get one's head around,
> > and also seems like it might be more robust against future
> > implementation changes as a wire format.
> 
> As previously discussed I really think it is;  what I'm tending to
> see when I've been looking at these in debug is something that's
> sparse but tends to be blobby with sets of pages discarded near
> by.  However you do this you're going to have to walk this
> bitmap and format out some sort of set of messages.

I'd really need to see some numbers to be convinced.  The semi-sparse
bitmap introduces a huge amount of code and complexity at both ends.

Remember, using ranges, you can still coalesce adjacent discarded
pages.  Even using some compat differential encodings of the range
endpoints might work out simpler than the bitmap fragments.

Plus, ranges handle the host versus target page size distinctions very
naturally.

[snip]
> > DISCARD will typically immediately precede LISTEN, won't it?  Is there
> > a reason not to put the discard data into the LISTEN command?
> 
> Discard data can be quite large, so I potentially send multiple discard
> commands.
> (Also as you can tell generally I've got a preference for one message does one
> thing, and thus I have tried to keep them separate).

So, I like the idea of one message per action in principle - but only
if that action really is well-defined without reference to what
operations come before and after it.  If there are hidden dependencies
about what actions have to come in what order, I'd rather bake that
into the command structure.

> > > +static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
> > > +                                         uint64_t remote_hps,
> > > +                                         uint64_t remote_tps)
> > > +{
> > > +    PostcopyState ps = postcopy_state_get(mis);
> > > +    trace_loadvm_postcopy_handle_advise();
> > > +    if (ps != POSTCOPY_INCOMING_NONE) {
> > > +        error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state (%d)", ps);
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (remote_hps != getpagesize())  {
> > > +        /*
> > > +         * Some combinations of mismatch are probably possible but it gets
> > > +         * a bit more complicated.  In particular we need to place whole
> > > +         * host pages on the dest at once, and we need to ensure that we
> > > +         * handle dirtying to make sure we never end up sending part of
> > > +         * a hostpage on it's own.
> > > +         */
> > > +        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
> > > +                     (int)remote_hps, getpagesize());
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (remote_tps != (1ul << qemu_target_page_bits())) {
> > > +        /*
> > > +         * Again, some differences could be dealt with, but for now keep it
> > > +         * simple.
> > > +         */
> > > +        error_report("Postcopy needs matching target page sizes (s=%d d=%d)",
> > > +                     (int)remote_tps, 1 << qemu_target_page_bits());
> > > +        return -1;
> > > +    }
> > > +
> > > +    postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
> > 
> > Should you be checking the return value here to make sure it's still
> > POSTCOPY_INCOMING_NONE?  Atomic xchgs seem overkill if you still have
> > a race between the fetch at the top and the set here.
> > 
> > Or, in fact, should you be just doing an atomic exchange-then-check at
> > the top, rather than checking at the top, then changing at the bottom.
> 
> There's no race at this point yet; going from None->advise we still only
> have one thread.  The check at the top is a check against protocol
> violations (e.g. getting two advise or something like that).

Yeah.. and having to have intimate knowledge of the thread structure
at each point in order to analyze the atomic operation correctness is
exactly what bothers me about it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-27  4:13       ` David Gibson
@ 2015-03-27 10:48         ` Dr. David Alan Gilbert
  2015-03-28 16:00           ` Paolo Bonzini
  2015-03-30  4:03           ` David Gibson
  2015-03-28 15:58         ` Paolo Bonzini
  1 sibling, 2 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-27 10:48 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Thu, Mar 26, 2015 at 04:33:28PM +0000, Dr. David Alan Gilbert wrote:
> > (Only replying to some of the items in this mail - the others I'll get
> > to another time).
> > 
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:40PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > The state of the postcopy process is managed via a series of messages;
> > > >    * Add wrappers and handlers for sending/receiving these messages
> > > >    * Add state variable that track the current state of postcopy
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >  include/migration/migration.h |  15 ++
> > > >  include/sysemu/sysemu.h       |  23 +++
> > > >  migration/migration.c         |  13 ++
> > > >  savevm.c                      | 325 ++++++++++++++++++++++++++++++++++++++++++
> > > >  trace-events                  |  11 ++
> > > >  5 files changed, 387 insertions(+)
> > > > 
> > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > index f94af5b..81cd1f2 100644
> > > > --- a/include/migration/migration.h
> > > > +++ b/include/migration/migration.h
> > > > @@ -52,6 +52,14 @@ typedef struct MigrationState MigrationState;
> > > >  
> > > >  typedef QLIST_HEAD(, LoadStateEntry) LoadStateEntry_Head;
> > > >  
> > > > +typedef enum {
> > > > +    POSTCOPY_INCOMING_NONE = 0,  /* Initial state - no postcopy */
> > > > +    POSTCOPY_INCOMING_ADVISE,
> > > > +    POSTCOPY_INCOMING_LISTENING,
> > > > +    POSTCOPY_INCOMING_RUNNING,
> > > > +    POSTCOPY_INCOMING_END
> > > > +} PostcopyState;
> > > > +
> > > >  /* State for the incoming migration */
> > > >  struct MigrationIncomingState {
> > > >      QEMUFile *file;
> > > > @@ -59,6 +67,8 @@ struct MigrationIncomingState {
> > > >      /* See savevm.c */
> > > >      LoadStateEntry_Head loadvm_handlers;
> > > >  
> > > > +    PostcopyState postcopy_state;
> > > > +
> > > >      QEMUFile *return_path;
> > > >      QemuMutex      rp_mutex;    /* We send replies from multiple threads */
> > > >  };
> > > > @@ -219,4 +229,9 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
> > > >                               ram_addr_t offset, size_t size,
> > > >                               int *bytes_sent);
> > > >  
> > > > +PostcopyState postcopy_state_get(MigrationIncomingState *mis);
> > > > +
> > > > +/* Set the state and return the old state */
> > > > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > > > +                                 PostcopyState new_state);
> > > >  #endif
> > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > > index 8da879f..d6a6d51 100644
> > > > --- a/include/sysemu/sysemu.h
> > > > +++ b/include/sysemu/sysemu.h
> > > > @@ -87,6 +87,18 @@ enum qemu_vm_cmd {
> > > >      MIG_CMD_INVALID = 0,       /* Must be 0 */
> > > >      MIG_CMD_OPEN_RETURN_PATH,  /* Tell the dest to open the Return path */
> > > >      MIG_CMD_PING,              /* Request a PONG on the RP */
> > > > +
> > > > +    MIG_CMD_POSTCOPY_ADVISE = 20,  /* Prior to any page transfers, just
> > > > +                                      warn we might want to do PC */
> > > > +    MIG_CMD_POSTCOPY_LISTEN,       /* Start listening for incoming
> > > > +                                      pages as it's running. */
> > > > +    MIG_CMD_POSTCOPY_RUN,          /* Start execution */
> > > > +    MIG_CMD_POSTCOPY_END,          /* Postcopy is finished. */
> > > > +
> > > > +    MIG_CMD_POSTCOPY_RAM_DISCARD,  /* A list of pages to discard that
> > > > +                                      were previously sent during
> > > > +                                      precopy but are dirty. */
> > > > +
> > > >  };
> > > >  
> > > >  bool qemu_savevm_state_blocked(Error **errp);
> > > > @@ -101,6 +113,17 @@ void qemu_savevm_command_send(QEMUFile *f, enum qemu_vm_cmd command,
> > > >                                uint16_t len, uint8_t *data);
> > > >  void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
> > > >  void qemu_savevm_send_open_return_path(QEMUFile *f);
> > > > +void qemu_savevm_send_postcopy_advise(QEMUFile *f);
> > > > +void qemu_savevm_send_postcopy_listen(QEMUFile *f);
> > > > +void qemu_savevm_send_postcopy_run(QEMUFile *f);
> > > > +void qemu_savevm_send_postcopy_end(QEMUFile *f, uint8_t status);
> > > > +
> > > > +void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
> > > > +                                           uint16_t len, uint8_t offset,
> > > > +                                           uint64_t *addrlist,
> > > > +                                           uint32_t *masklist);
> > > > +
> > > > +
> > > >  int qemu_loadvm_state(QEMUFile *f);
> > > >  
> > > >  /* SLIRP */
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 434864a..957115a 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -971,3 +971,16 @@ void migrate_fd_connect(MigrationState *s)
> > > >      qemu_thread_create(&s->thread, "migration", migration_thread, s,
> > > >                         QEMU_THREAD_JOINABLE);
> > > >  }
> > > > +
> > > > +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> > > > +{
> > > > +    return atomic_fetch_add(&mis->postcopy_state, 0);
> > > > +}
> > > > +
> > > > +/* Set the state and return the old state */
> > > > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > > > +                                 PostcopyState new_state)
> > > > +{
> > > > +    return atomic_xchg(&mis->postcopy_state, new_state);
> > > 
> > > Is there anything explaining what the overall atomicity requirements
> > > are for this state variable?  It's a bit hard to tell if an atomic
> > > xchg is necessary or sufficient without a description of what the
> > > overall concurrency scheme is with regards to this variable.
> > 
> > Can you tell me how to define the requirements?
> 
> Well, that is always the tricky question.
> 
> > It's a state variable tested and changed by at least two threads and
> > it's got to go through a correct sequence of states.
> > So generally you're doing a 'I expect to be in .... now change to ....'
> > so the exchange works well for that.
> 
> So.. in this case.  It seems what might make sense as a basic
> atomicity option here is a cmpxchg.  So your state_set function takes
> old_state and new_state.  It will atomically update old_state to
> new_state, returning an error if somebody else changed the state away
> from old_state in the meantime.
> 
> Then you can have a blurb next to the state_set helper explaining why
> you need this atomic op for updates to the state variable from
> multiple threads.

I can document that next to state_set, and I'll look at whether it's 
possible to move the error check down to there.

> > > > + *  n x
> > > > + *      be64   Page addresses for start of an invalidation range
> > > > + *      be32   mask of 32 pages, '1' to discard'
> > > 
> > > Is the extra compactness from this semi-sparse bitmap encoding
> > > actually worth it?  A simple list of page addresses, or address ranges
> > > to discard would be substantially simpler to get one's head around,
> > > and also seems like it might be more robust against future
> > > implementation changes as a wire format.
> > 
> > As previously discussed I really think it is;  what I'm tending to
> > see when I've been looking at these in debug is something that's
> > sparse but tends to be blobby with sets of pages discarded near
> > by.  However you do this you're going to have to walk this
> > bitmap and format out some sort of set of messages.
> 
> I'd really need to see some numbers to be convinced.  The semi-sparse
> bitmap introduces a huge amount of code and complexity at both ends.
> 
> Remember, using ranges, you can still coalesce adjacent discarded
> pages.  Even using some compat differential encodings of the range
> endpoints might work out simpler than the bitmap fragments.
> 
> Plus, ranges handle the host versus target page size distinctions very
> naturally.

I guess I'm going to have to implement this and compare then; what format
do you want?

> [snip]
> > > DISCARD will typically immediately precede LISTEN, won't it?  Is there
> > > a reason not to put the discard data into the LISTEN command?
> > 
> > Discard data can be quite large, so I potentially send multiple discard
> > commands.
> > (Also as you can tell generally I've got a preference for one message does one
> > thing, and thus I have tried to keep them separate).
> 
> So, I like the idea of one message per action in principle - but only
> if that action really is well-defined without reference to what
> operations come before and after it.  If there are hidden dependencies
> about what actions have to come in what order, I'd rather bake that
> into the command structure.

In no way is it hidden; the commands match the state transitions.

> > > > +static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
> > > > +                                         uint64_t remote_hps,
> > > > +                                         uint64_t remote_tps)
> > > > +{
> > > > +    PostcopyState ps = postcopy_state_get(mis);
> > > > +    trace_loadvm_postcopy_handle_advise();
> > > > +    if (ps != POSTCOPY_INCOMING_NONE) {
> > > > +        error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state (%d)", ps);
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    if (remote_hps != getpagesize())  {
> > > > +        /*
> > > > +         * Some combinations of mismatch are probably possible but it gets
> > > > +         * a bit more complicated.  In particular we need to place whole
> > > > +         * host pages on the dest at once, and we need to ensure that we
> > > > +         * handle dirtying to make sure we never end up sending part of
> > > > +         * a hostpage on it's own.
> > > > +         */
> > > > +        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
> > > > +                     (int)remote_hps, getpagesize());
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    if (remote_tps != (1ul << qemu_target_page_bits())) {
> > > > +        /*
> > > > +         * Again, some differences could be dealt with, but for now keep it
> > > > +         * simple.
> > > > +         */
> > > > +        error_report("Postcopy needs matching target page sizes (s=%d d=%d)",
> > > > +                     (int)remote_tps, 1 << qemu_target_page_bits());
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
> > > 
> > > Should you be checking the return value here to make sure it's still
> > > POSTCOPY_INCOMING_NONE?  Atomic xchgs seem overkill if you still have
> > > a race between the fetch at the top and the set here.
> > > 
> > > Or, in fact, should you be just doing an atomic exchange-then-check at
> > > the top, rather than checking at the top, then changing at the bottom.
> > 
> > There's no race at this point yet; going from None->advise we still only
> > have one thread.  The check at the top is a check against protocol
> > violations (e.g. getting two advise or something like that).
> 
> Yeah.. and having to have intimate knowledge of the thread structure
> at each point in order to analyze the atomic operation correctness is
> exactly what bothers me about it.

No, you don't; By always using the postcopy_state_set you don't need
to worry about the thread structure or protocol to understand the atomic operation
correctness.  The only reason you got into that comparison is because
you worried that it might be overkill in this case; by keeping it consistent
like it already is then you don't need to understand the thread structure.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's Dr. David Alan Gilbert (git)
  2015-03-10  2:56   ` David Gibson
@ 2015-03-28 15:30   ` Paolo Bonzini
  2015-03-29  4:07     ` David Gibson
  1 sibling, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-28 15:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, yanghy, david



On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> +            if (err != EAGAIN) {

if (err != EAGAIN && err != EWOULDBLOCK)

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/45] Return path: Control commands
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 10/45] Return path: Control commands Dr. David Alan Gilbert (git)
  2015-03-10  5:40   ` David Gibson
@ 2015-03-28 15:32   ` Paolo Bonzini
  2015-03-30 17:34     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-28 15:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, yanghy, david



On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> 
> Add two src->dest commands:
>    * OPEN_RETURN_PATH - To request that the destination open the return path
>    * SEND_PING - Request an acknowledge from the destination

It's just PING, not SEND_PING.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source
  2015-03-10  5:47   ` David Gibson
  2015-03-10 14:34     ` Dr. David Alan Gilbert
@ 2015-03-28 15:34     ` Paolo Bonzini
  1 sibling, 0 replies; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-28 15:34 UTC (permalink / raw)
  To: David Gibson, Dr. David Alan Gilbert (git)
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 10/03/2015 06:47, David Gibson wrote:
>> > +void migrate_send_rp_message(MigrationIncomingState *mis,
>> > +                             enum mig_rpcomm_cmd cmd,
>> > +                             uint16_t len, uint8_t *data)
> Using (void *) for data would avoid casts in a bunch of the callers.

Could also use uint8_t in the callers and replace cpu_to_be32 with
stl_be_p.  This makes it harder to send stuff with the wrong endianness.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages Dr. David Alan Gilbert (git)
  2015-03-12  9:30   ` David Gibson
@ 2015-03-28 15:43   ` Paolo Bonzini
  2015-03-30 17:46     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-28 15:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel
  Cc: aarcange, yamahata, quintela, amit.shah, yanghy, david



On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> +{
> +    return atomic_fetch_add(&mis->postcopy_state, 0);
> +}
> +
> +/* Set the state and return the old state */
> +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> +                                 PostcopyState new_state)
> +{
> +    return atomic_xchg(&mis->postcopy_state, new_state);
> +}

Which are the (multiple) threads are calling these functions?

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-27  4:13       ` David Gibson
  2015-03-27 10:48         ` Dr. David Alan Gilbert
@ 2015-03-28 15:58         ` Paolo Bonzini
  1 sibling, 0 replies; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-28 15:58 UTC (permalink / raw)
  To: David Gibson, Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 27/03/2015 05:13, David Gibson wrote:
>> It's a state variable tested and changed by at least two threads
>> and it's got to go through a correct sequence of states. So
>> generally you're doing a 'I expect to be in .... now change to
>> ....' so the exchange works well for that.
> 
> So.. in this case.  It seems what might make sense as a basic 
> atomicity option here is a cmpxchg.  So your state_set function
> takes old_state and new_state.  It will atomically update old_state
> to new_state, returning an error if somebody else changed the state
> away from old_state in the meantime.
> 
> Then you can have a blurb next to the state_set helper explaining
> why you need this atomic op for updates to the state variable from 
> multiple threads.

I agree.

My humble suggestion is:

- document the transitions, see this existing example in thread-pool.c:

    /* Moving state out of THREAD_QUEUED is protected by lock.  After
     * that, only the worker thread can write to it.  Reads and writes
     * of state and ret are ordered with memory barriers.
     */
    enum ThreadState state;

The outgoing migration state fails to document this.  My fault.

- perhaps you can use a pattern similar to what is used for the
outgoing migration state.  There one thread does the work, and the
second (could be "the others", e.g. all VCPU threads) affect the first
just by triggering changes in the migration state.  Transitions in the
first thread use cmpxchg without a loop---if it fails it means the
second thread intervened, and the thread moves on to do whatever the
second thread requested.  Transition in the second thread use cmpxchg
within a loop, in case there was a concurrent change.

Paolo

>>>> + *  n x + *      be64   Page addresses for start of an
>>>> invalidation range + *      be32   mask of 32 pages, '1' to
>>>> discard'
>>> 
>>> Is the extra compactness from this semi-sparse bitmap encoding 
>>> actually worth it?  A simple list of page addresses, or address
>>> ranges to discard would be substantially simpler to get one's
>>> head around, and also seems like it might be more robust
>>> against future implementation changes as a wire format.
>> 
>> As previously discussed I really think it is;  what I'm tending
>> to see when I've been looking at these in debug is something
>> that's sparse but tends to be blobby with sets of pages discarded
>> near by.  However you do this you're going to have to walk this 
>> bitmap and format out some sort of set of messages.
> 
> I'd really need to see some numbers to be convinced.  The
> semi-sparse bitmap introduces a huge amount of code and complexity
> at both ends.
> 
> Remember, using ranges, you can still coalesce adjacent discarded 
> pages.  Even using some compat differential encodings of the range 
> endpoints might work out simpler than the bitmap fragments.
> 
> Plus, ranges handle the host versus target page size distinctions
> very naturally.
> 
> [snip]
>>> DISCARD will typically immediately precede LISTEN, won't it?
>>> Is there a reason not to put the discard data into the LISTEN
>>> command?
>> 
>> Discard data can be quite large, so I potentially send multiple
>> discard commands. (Also as you can tell generally I've got a
>> preference for one message does one thing, and thus I have tried
>> to keep them separate).
> 
> So, I like the idea of one message per action in principle - but
> only if that action really is well-defined without reference to
> what operations come before and after it.  If there are hidden
> dependencies about what actions have to come in what order, I'd
> rather bake that into the command structure.
> 
>>>> +static int
>>>> loadvm_postcopy_handle_advise(MigrationIncomingState *mis, +
>>>> uint64_t remote_hps, +
>>>> uint64_t remote_tps) +{ +    PostcopyState ps =
>>>> postcopy_state_get(mis); +
>>>> trace_loadvm_postcopy_handle_advise(); +    if (ps !=
>>>> POSTCOPY_INCOMING_NONE) { +
>>>> error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state
>>>> (%d)", ps); +        return -1; +    } + +    if (remote_hps
>>>> != getpagesize())  { +        /* +         * Some
>>>> combinations of mismatch are probably possible but it gets +
>>>> * a bit more complicated.  In particular we need to place
>>>> whole +         * host pages on the dest at once, and we need
>>>> to ensure that we +         * handle dirtying to make sure we
>>>> never end up sending part of +         * a hostpage on it's
>>>> own. +         */ +        error_report("Postcopy needs
>>>> matching host page sizes (s=%d d=%d)", +
>>>> (int)remote_hps, getpagesize()); +        return -1; +    } 
>>>> + +    if (remote_tps != (1ul << qemu_target_page_bits())) { 
>>>> +        /* +         * Again, some differences could be
>>>> dealt with, but for now keep it +         * simple. +
>>>> */ +        error_report("Postcopy needs matching target page
>>>> sizes (s=%d d=%d)", +                     (int)remote_tps, 1
>>>> << qemu_target_page_bits()); +        return -1; +    } + +
>>>> postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
>>> 
>>> Should you be checking the return value here to make sure it's
>>> still POSTCOPY_INCOMING_NONE?  Atomic xchgs seem overkill if
>>> you still have a race between the fetch at the top and the set
>>> here.
>>> 
>>> Or, in fact, should you be just doing an atomic
>>> exchange-then-check at the top, rather than checking at the
>>> top, then changing at the bottom.
>> 
>> There's no race at this point yet; going from None->advise we
>> still only have one thread.  The check at the top is a check
>> against protocol violations (e.g. getting two advise or something
>> like that).
> 
> Yeah.. and having to have intimate knowledge of the thread
> structure at each point in order to analyze the atomic operation
> correctness is exactly what bothers me about it.
> 

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-27 10:48         ` Dr. David Alan Gilbert
@ 2015-03-28 16:00           ` Paolo Bonzini
  2015-03-30  4:03           ` David Gibson
  1 sibling, 0 replies; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-28 16:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 27/03/2015 11:48, Dr. David Alan Gilbert wrote:
> > So, I like the idea of one message per action in principle - but only
> > if that action really is well-defined without reference to what
> > operations come before and after it.  If there are hidden dependencies
> > about what actions have to come in what order, I'd rather bake that
> > into the command structure.
> 
> In no way is it hidden; the commands match the state transitions.

So is there only one writer thread?  In which states can the other
threads read the postcopy state, and why is sequential consistency
important? That is, in threads other than the writer, how does reading
the state interact with other reads and writes?

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-03-28 15:30   ` Paolo Bonzini
@ 2015-03-29  4:07     ` David Gibson
  2015-03-29  9:03       ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: David Gibson @ 2015-03-29  4:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, Dr. David Alan Gilbert (git),
	qemu-devel, amit.shah, yanghy

[-- Attachment #1: Type: text/plain, Size: 504 bytes --]

On Sat, Mar 28, 2015 at 04:30:06PM +0100, Paolo Bonzini wrote:
> 
> 
> On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> > +            if (err != EAGAIN) {
> 
> if (err != EAGAIN && err != EWOULDBLOCK)

I assume that's for the benefit of non-Linux hosts?  On Linux EAGAIN
== EWOULDBLOCK.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-03-29  4:07     ` David Gibson
@ 2015-03-29  9:03       ` Paolo Bonzini
  2015-03-30 16:50         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-29  9:03 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel,
	Dr. David Alan Gilbert (git),
	amit.shah, yanghy



On 29/03/2015 06:07, David Gibson wrote:
> On Sat, Mar 28, 2015 at 04:30:06PM +0100, Paolo Bonzini wrote:
>> 
>> 
>> On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
>>> +            if (err != EAGAIN) {
>> 
>> if (err != EAGAIN && err != EWOULDBLOCK)
> 
> I assume that's for the benefit of non-Linux hosts?  On Linux
> EAGAIN == EWOULDBLOCK.

Yes, that's just the standard idiom in QEMU.  This is generic code, so
assumption based on the host platform are not wise. :)

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-27 10:48         ` Dr. David Alan Gilbert
  2015-03-28 16:00           ` Paolo Bonzini
@ 2015-03-30  4:03           ` David Gibson
  1 sibling, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-30  4:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 4968 bytes --]

On Fri, Mar 27, 2015 at 10:48:52AM +0000, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Thu, Mar 26, 2015 at 04:33:28PM +0000, Dr. David Alan Gilbert wrote:
> > > (Only replying to some of the items in this mail - the others I'll get
> > > to another time).
[snip]
> > > > DISCARD will typically immediately precede LISTEN, won't it?  Is there
> > > > a reason not to put the discard data into the LISTEN command?
> > > 
> > > Discard data can be quite large, so I potentially send multiple discard
> > > commands.
> > > (Also as you can tell generally I've got a preference for one message does one
> > > thing, and thus I have tried to keep them separate).
> > 
> > So, I like the idea of one message per action in principle - but only
> > if that action really is well-defined without reference to what
> > operations come before and after it.  If there are hidden dependencies
> > about what actions have to come in what order, I'd rather bake that
> > into the command structure.
> 
> In no way is it hidden; the commands match the state transitions.

Not all of them.  In particular DISCARD's relation to state
transitions is not at all obvious.  Likewise I don't think the
connection of LISTEN to the internal state machines at each end is
terribly clear.

> > > > > +static int loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
> > > > > +                                         uint64_t remote_hps,
> > > > > +                                         uint64_t remote_tps)
> > > > > +{
> > > > > +    PostcopyState ps = postcopy_state_get(mis);
> > > > > +    trace_loadvm_postcopy_handle_advise();
> > > > > +    if (ps != POSTCOPY_INCOMING_NONE) {
> > > > > +        error_report("CMD_POSTCOPY_ADVISE in wrong postcopy state (%d)", ps);
> > > > > +        return -1;
> > > > > +    }
> > > > > +
> > > > > +    if (remote_hps != getpagesize())  {
> > > > > +        /*
> > > > > +         * Some combinations of mismatch are probably possible but it gets
> > > > > +         * a bit more complicated.  In particular we need to place whole
> > > > > +         * host pages on the dest at once, and we need to ensure that we
> > > > > +         * handle dirtying to make sure we never end up sending part of
> > > > > +         * a hostpage on it's own.
> > > > > +         */
> > > > > +        error_report("Postcopy needs matching host page sizes (s=%d d=%d)",
> > > > > +                     (int)remote_hps, getpagesize());
> > > > > +        return -1;
> > > > > +    }
> > > > > +
> > > > > +    if (remote_tps != (1ul << qemu_target_page_bits())) {
> > > > > +        /*
> > > > > +         * Again, some differences could be dealt with, but for now keep it
> > > > > +         * simple.
> > > > > +         */
> > > > > +        error_report("Postcopy needs matching target page sizes (s=%d d=%d)",
> > > > > +                     (int)remote_tps, 1 << qemu_target_page_bits());
> > > > > +        return -1;
> > > > > +    }
> > > > > +
> > > > > +    postcopy_state_set(mis, POSTCOPY_INCOMING_ADVISE);
> > > > 
> > > > Should you be checking the return value here to make sure it's still
> > > > POSTCOPY_INCOMING_NONE?  Atomic xchgs seem overkill if you still have
> > > > a race between the fetch at the top and the set here.
> > > > 
> > > > Or, in fact, should you be just doing an atomic exchange-then-check at
> > > > the top, rather than checking at the top, then changing at the bottom.
> > > 
> > > There's no race at this point yet; going from None->advise we still only
> > > have one thread.  The check at the top is a check against protocol
> > > violations (e.g. getting two advise or something like that).
> > 
> > Yeah.. and having to have intimate knowledge of the thread structure
> > at each point in order to analyze the atomic operation correctness is
> > exactly what bothers me about it.
> 
> No, you don't; By always using the postcopy_state_set you don't need
> to worry about the thread structure or protocol to understand the atomic operation
> correctness.  The only reason you got into that comparison is because
> you worried that it might be overkill in this case; by keeping it consistent
> like it already is then you don't need to understand the thread structure.

You really do, though.  You may be using the same function in the
original, but not the same idiom: in this case you don't check the
return value and deal with it accordingly.

Because of that, this looks pretty much exactly like the classic
"didn't understand what was atomic in the atomic" race condition, and
you need to know that there's only one thread at this point to realize
it's correct after all.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-24 22:32               ` David Gibson
  2015-03-25 15:00                 ` Dr. David Alan Gilbert
@ 2015-03-30  8:10                 ` Paolo Bonzini
  2015-03-31  0:10                   ` David Gibson
  1 sibling, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30  8:10 UTC (permalink / raw)
  To: David Gibson, Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 24/03/2015 23:32, David Gibson wrote:
>>> OK, that one I prefer.  Are you OK with: 
>>> qemu_savevm_state_complete_precopy calls ->
>>> save_live_complete_precopy
>>> 
>>> qemu_savevm_state_complete_postcopy calls ->
>>> save_live_complete_postcopy
>>> 
>>> ?
> Sounds ok to me.  Fwiw, I was thinking that both the
> complete_precopy and complete_postcopy hooks should always be
> called.  For a non-postcopy migration, the postcopy hooks would
> just be called immediately after the precopy hooks.

What about then calling them save_live_after_precopy and
save_live_complete, or having save_live_before_postcopy and
save_live_complete?

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-03-13 10:41     ` Dr. David Alan Gilbert
  2015-03-16  6:22       ` David Gibson
@ 2015-03-30  8:14       ` Paolo Bonzini
  2015-03-30 14:07         ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30  8:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 13/03/2015 11:41, Dr. David Alan Gilbert wrote:
>>> > > +#ifdef HOST_X86_64
>>> > > +#ifndef __NR_userfaultfd
>>> > > +#define __NR_userfaultfd 323
>> > 
>> > Sholdn't this come from the kernel headers imported in the previous
>> > patch?  Rather than having an arch-specific hack.
> The header, like the rest of the kernel headers, just provides
> the constant and structure definitions for the call; the syscall numbers
> come from arch specific headers.  I guess in the final world I wouldn't
> need this at all since it'll come from the system headers; but what's
> the right way to put this in for new syscalls?
> 

You would just require new _installed_ kernel headers.  Then you can use
linux/userfaultfd.h and syscall.h (the latter from glibc, includes
asm/unistd.h to get syscall numbers).

linux-headers/ is useful for APIs that do not require system calls, or
for APIs that are extensible.  However, if a system call is required
(and mandatory) it's simpler to just use installed headers.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-19  9:33             ` Dr. David Alan Gilbert
  2015-03-23  2:20               ` David Gibson
@ 2015-03-30  8:17               ` Paolo Bonzini
  2015-03-31  2:23                 ` David Gibson
  1 sibling, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30  8:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 19/03/2015 10:33, Dr. David Alan Gilbert wrote:
>> > But still pointless.  Atomicity isn't magic pixie dust; it only makes
>> > sense if you're making atomic specific operations that need to be.
>> > Simple integer loads and stores are already atomic.  Unless at least
>> > some of the atomic operations are something more complex, there's
>> > really no point to atomically marked operations.
> OK, I'll kill it off.

No, don't.

And both of you, read docs/atomics.txt.

"atomic_read() and atomic_set() prevents the compiler from using
optimizations that might otherwise optimize accesses out of existence
on the one hand, or that might create unsolicited accesses on the other.
[...] it tells readers which variables are shared with
other threads, and which are local to the current thread or protected
by other, more mundane means."

atomic_read() and atomic_set() provide the same guarantees as
ACCESS_ONCE in the Linux kernel.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-23  2:20               ` David Gibson
@ 2015-03-30  8:19                 ` Paolo Bonzini
  2015-03-30 17:04                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30  8:19 UTC (permalink / raw)
  To: David Gibson, Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 23/03/2015 03:20, David Gibson wrote:
>>> 1) There's no barrier after the write, so there's no guarantee
>>> the other thread will eventually see it (in practice we've got
>>> other pthread ops we take so we will get a barrier somewhere,
>>> and most CPUs eventually do propagate the store).
> Sorry, I should have been clearer.  If you need a memory barrier,
> by all means include a memory barrier.  But that should be
> explicit: atomic set/read operations often include barriers, but
> it's not obvious which side will include what barrier.

Memory barriers are not needed here.  The variable is set
independently from every other set.  There's no ordering.

atomic_read/atomic_set do not provide sequential consistency.  That's
ensured instead by atomic_mb_read/atomic_mb_set (and you're right that
it's not obvious which side will include barriers, so you have to use
the two together).

>>> 2) The read side could legally be optimised out of the loop by
>>> the compiler. (but in practice wont be because compilers won't
>>> optimise that far).
> That one's a trickier question.  Compilers are absolutely capable
> of optimizing that far, *but* the C rules about when it's allowed
> to assume in-memory values remain unchanged are pretty
> conservative.  I think any function call in the loop will require
> it to reload the value, for example.  That said, a (compiler only)
> memory barrier might be appropriate to ensure that reload.

That's exactly what atomic_read provides.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-03-26 11:05     ` Dr. David Alan Gilbert
@ 2015-03-30  8:31       ` Paolo Bonzini
  2015-04-13 11:35         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30  8:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy



On 26/03/2015 12:05, Dr. David Alan Gilbert wrote:
>>> > > +    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
>>> > > +    *old_vm_running = runstate_is_running();
>> > 
>> > I think that needs some explanation.  Why are you doing a wakeup on
>> > the source host?
> This matches the existing code in migration_thread for the end of precopy;
> Paolo's explanation of what it does is here:
> https://lists.gnu.org/archive/html/qemu-devel/2014-08/msg04880.html

The more I look at it, the more I'm convinced it's working by chance or
not working at all.

Here we probably need to do only the notifier_list_notify +
qapi_event_send_wakeup.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-03-30  8:14       ` Paolo Bonzini
@ 2015-03-30 14:07         ` Dr. David Alan Gilbert
  2015-03-30 14:09           ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-30 14:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 13/03/2015 11:41, Dr. David Alan Gilbert wrote:
> >>> > > +#ifdef HOST_X86_64
> >>> > > +#ifndef __NR_userfaultfd
> >>> > > +#define __NR_userfaultfd 323
> >> > 
> >> > Sholdn't this come from the kernel headers imported in the previous
> >> > patch?  Rather than having an arch-specific hack.
> > The header, like the rest of the kernel headers, just provides
> > the constant and structure definitions for the call; the syscall numbers
> > come from arch specific headers.  I guess in the final world I wouldn't
> > need this at all since it'll come from the system headers; but what's
> > the right way to put this in for new syscalls?
> > 
> 
> You would just require new _installed_ kernel headers.  Then you can use
> linux/userfaultfd.h and syscall.h (the latter from glibc, includes
> asm/unistd.h to get syscall numbers).
> 
> linux-headers/ is useful for APIs that do not require system calls, or
> for APIs that are extensible.  However, if a system call is required
> (and mandatory) it's simpler to just use installed headers.

OK, so then I could check for ifdef __NR_userfault and then
do the include and I think that would be safe.
Although then what's the best way to tell people to try it out
without an updated libc?

Or is it best to modify ./configure to detect it?

Dave


> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-03-30 14:07         ` Dr. David Alan Gilbert
@ 2015-03-30 14:09           ` Paolo Bonzini
  2015-06-16 10:49             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30 14:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson



On 30/03/2015 16:07, Dr. David Alan Gilbert wrote:
>>> > > 
>> > 
>> > You would just require new _installed_ kernel headers.  Then you can use
>> > linux/userfaultfd.h and syscall.h (the latter from glibc, includes
>> > asm/unistd.h to get syscall numbers).
>> > 
>> > linux-headers/ is useful for APIs that do not require system calls, or
>> > for APIs that are extensible.  However, if a system call is required
>> > (and mandatory) it's simpler to just use installed headers.
> OK, so then I could check for ifdef __NR_userfault and then
> do the include and I think that would be safe.

I think it's okay.  First include syscall.h, then include
linux/userfaultfd.h under #ifdef.

> Although then what's the best way to tell people to try it out
> without an updated libc?

They don't need an updated libc, just an updated kernel.  syscall.h is
just a wrapper around Linux headers.

Paolo

> Or is it best to modify ./configure to detect it?

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-03-29  9:03       ` Paolo Bonzini
@ 2015-03-30 16:50         ` Dr. David Alan Gilbert
  2015-03-30 18:22           ` Markus Armbruster
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-30 16:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 29/03/2015 06:07, David Gibson wrote:
> > On Sat, Mar 28, 2015 at 04:30:06PM +0100, Paolo Bonzini wrote:
> >> 
> >> 
> >> On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> >>> +            if (err != EAGAIN) {
> >> 
> >> if (err != EAGAIN && err != EWOULDBLOCK)
> > 
> > I assume that's for the benefit of non-Linux hosts?  On Linux
> > EAGAIN == EWOULDBLOCK.
> 
> Yes, that's just the standard idiom in QEMU.  This is generic code, so
> assumption based on the host platform are not wise. :)

Done; I didn't know of EWOULDBLOCK - and indeed as far as I can tell
most places we only test for EAGAIN.

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-30  8:19                 ` Paolo Bonzini
@ 2015-03-30 17:04                   ` Dr. David Alan Gilbert
  2015-03-30 19:22                     ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-30 17:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 23/03/2015 03:20, David Gibson wrote:
> >>> 1) There's no barrier after the write, so there's no guarantee
> >>> the other thread will eventually see it (in practice we've got
> >>> other pthread ops we take so we will get a barrier somewhere,
> >>> and most CPUs eventually do propagate the store).
> > Sorry, I should have been clearer.  If you need a memory barrier,
> > by all means include a memory barrier.  But that should be
> > explicit: atomic set/read operations often include barriers, but
> > it's not obvious which side will include what barrier.
> 
> Memory barriers are not needed here.  The variable is set
> independently from every other set.  There's no ordering.
> 
> atomic_read/atomic_set do not provide sequential consistency.  That's
> ensured instead by atomic_mb_read/atomic_mb_set (and you're right that
> it's not obvious which side will include barriers, so you have to use
> the two together).
> 
> >>> 2) The read side could legally be optimised out of the loop by
> >>> the compiler. (but in practice wont be because compilers won't
> >>> optimise that far).
> > That one's a trickier question.  Compilers are absolutely capable
> > of optimizing that far, *but* the C rules about when it's allowed
> > to assume in-memory values remain unchanged are pretty
> > conservative.  I think any function call in the loop will require
> > it to reload the value, for example.  That said, a (compiler only)
> > memory barrier might be appropriate to ensure that reload.
> 
> That's exactly what atomic_read provides.

So does that say I need the atomic_read but not the atomic_write -
which seems a bit weird, but I think only due to the naming.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/45] Return path: Control commands
  2015-03-28 15:32   ` Paolo Bonzini
@ 2015-03-30 17:34     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-30 17:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy, david

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> > 
> > Add two src->dest commands:
> >    * OPEN_RETURN_PATH - To request that the destination open the return path
> >    * SEND_PING - Request an acknowledge from the destination
> 
> It's just PING, not SEND_PING.

Fixed, thanks.

Dave
> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-28 15:43   ` Paolo Bonzini
@ 2015-03-30 17:46     ` Dr. David Alan Gilbert
  2015-03-30 19:23       ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-30 17:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy, david

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
> > +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> > +{
> > +    return atomic_fetch_add(&mis->postcopy_state, 0);
> > +}
> > +
> > +/* Set the state and return the old state */
> > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> > +                                 PostcopyState new_state)
> > +{
> > +    return atomic_xchg(&mis->postcopy_state, new_state);
> > +}
> 
> Which are the (multiple) threads are calling these functions?

The main thread receiving the migration and the postcopy ram_listen_thread
receiving the RAM pages.
It's not actually racy between multiple threads updating it,
it's sequenced so that the main thread initialises it and then
hands over to the listen thread that takes it over from that point.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's
  2015-03-30 16:50         ` Dr. David Alan Gilbert
@ 2015-03-30 18:22           ` Markus Armbruster
  0 siblings, 0 replies; 181+ messages in thread
From: Markus Armbruster @ 2015-03-30 18:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah,
	Paolo Bonzini, yanghy, David Gibson

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>> 
>> 
>> On 29/03/2015 06:07, David Gibson wrote:
>> > On Sat, Mar 28, 2015 at 04:30:06PM +0100, Paolo Bonzini wrote:
>> >> 
>> >> 
>> >> On 25/02/2015 17:51, Dr. David Alan Gilbert (git) wrote:
>> >>> +            if (err != EAGAIN) {
>> >> 
>> >> if (err != EAGAIN && err != EWOULDBLOCK)
>> > 
>> > I assume that's for the benefit of non-Linux hosts?  On Linux
>> > EAGAIN == EWOULDBLOCK.
>> 
>> Yes, that's just the standard idiom in QEMU.  This is generic code, so
>> assumption based on the host platform are not wise. :)
>
> Done; I didn't know of EWOULDBLOCK - and indeed as far as I can tell
> most places we only test for EAGAIN.

Bug unless the place in question is effectively #ifdef __GNU_LIBRARY__
or similar.

https://www.gnu.org/software/libc/manual/html_node/Error-Codes.html#Error-Codes

 -- Macro: int EAGAIN
     Resource temporarily unavailable; the call might work if you try
     again later.  The macro 'EWOULDBLOCK' is another name for 'EAGAIN';
     they are always the same in the GNU C Library.

     This error can happen in a few different situations:

        * An operation that would block was attempted on an object that
          has non-blocking mode selected.  Trying the same operation
          again will block until some external condition makes it
          possible to read, write, or connect (whatever the operation).
          You can use 'select' to find out when the operation will be
          possible; *note Waiting for I/O::.

          *Portability Note:* In many older Unix systems, this condition
          was indicated by 'EWOULDBLOCK', which was a distinct error
          code different from 'EAGAIN'.  To make your program portable,
          you should check for both codes and treat them the same.

        * A temporary resource shortage made an operation impossible.
          'fork' can return this error.  It indicates that the shortage
          is expected to pass, so your program can try the call again
          later and it may succeed.  It is probably a good idea to delay
          for a few seconds before trying it again, to allow time for
          other processes to release scarce resources.  Such shortages
          are usually fairly serious and affect the whole system, so
          usually an interactive program should report the error to the
          user and return to its command loop.

 -- Macro: int EWOULDBLOCK
     In the GNU C Library, this is another name for 'EAGAIN' (above).
     The values are always the same, on every operating system.

     C libraries in many older Unix systems have 'EWOULDBLOCK' as a
     separate error code.

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-30 17:04                   ` Dr. David Alan Gilbert
@ 2015-03-30 19:22                     ` Paolo Bonzini
  2015-03-31 11:21                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30 19:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson



On 30/03/2015 19:04, Dr. David Alan Gilbert wrote:
>>> > > That one's a trickier question.  Compilers are absolutely capable
>>> > > of optimizing that far, *but* the C rules about when it's allowed
>>> > > to assume in-memory values remain unchanged are pretty
>>> > > conservative.  I think any function call in the loop will require
>>> > > it to reload the value, for example.  That said, a (compiler only)
>>> > > memory barrier might be appropriate to ensure that reload.
>> > 
>> > That's exactly what atomic_read provides.
> So does that say I need the atomic_read but not the atomic_write -
> which seems a bit weird, but I think only due to the naming.

No, you need both even though it's even more far-fetched that the
compiler will do something bad with the set.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-30 17:46     ` Dr. David Alan Gilbert
@ 2015-03-30 19:23       ` Paolo Bonzini
  2015-03-31 11:05         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-30 19:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy, david



On 30/03/2015 19:46, Dr. David Alan Gilbert wrote:
>>> > > +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
>>> > > +{
>>> > > +    return atomic_fetch_add(&mis->postcopy_state, 0);
>>> > > +}
>>> > > +
>>> > > +/* Set the state and return the old state */
>>> > > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
>>> > > +                                 PostcopyState new_state)
>>> > > +{
>>> > > +    return atomic_xchg(&mis->postcopy_state, new_state);
>>> > > +}
>> > 
>> > Which are the (multiple) threads are calling these functions?
> The main thread receiving the migration and the postcopy ram_listen_thread
> receiving the RAM pages.
> It's not actually racy between multiple threads updating it,
> it's sequenced so that the main thread initialises it and then
> hands over to the listen thread that takes it over from that point.

I would use atomic_mb_read/atomic_mb_set here.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy
  2015-03-30  8:10                 ` Paolo Bonzini
@ 2015-03-31  0:10                   ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-31  0:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, Dr. David Alan Gilbert,
	amit.shah, yanghy

[-- Attachment #1: Type: text/plain, Size: 973 bytes --]

On Mon, Mar 30, 2015 at 10:10:05AM +0200, Paolo Bonzini wrote:
> 
> 
> On 24/03/2015 23:32, David Gibson wrote:
> >>> OK, that one I prefer.  Are you OK with: 
> >>> qemu_savevm_state_complete_precopy calls ->
> >>> save_live_complete_precopy
> >>> 
> >>> qemu_savevm_state_complete_postcopy calls ->
> >>> save_live_complete_postcopy
> >>> 
> >>> ?
> > Sounds ok to me.  Fwiw, I was thinking that both the
> > complete_precopy and complete_postcopy hooks should always be
> > called.  For a non-postcopy migration, the postcopy hooks would
> > just be called immediately after the precopy hooks.
> 
> What about then calling them save_live_after_precopy and
> save_live_complete, or having save_live_before_postcopy and
> save_live_complete?

Good idea.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-30  8:17               ` Paolo Bonzini
@ 2015-03-31  2:23                 ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-03-31  2:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, Dr. David Alan Gilbert,
	amit.shah, yanghy

[-- Attachment #1: Type: text/plain, Size: 1458 bytes --]

On Mon, Mar 30, 2015 at 10:17:06AM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/03/2015 10:33, Dr. David Alan Gilbert wrote:
> >> > But still pointless.  Atomicity isn't magic pixie dust; it only makes
> >> > sense if you're making atomic specific operations that need to be.
> >> > Simple integer loads and stores are already atomic.  Unless at least
> >> > some of the atomic operations are something more complex, there's
> >> > really no point to atomically marked operations.
> > OK, I'll kill it off.
> 
> No, don't.
> 
> And both of you, read docs/atomics.txt.

*reads*

> "atomic_read() and atomic_set() prevents the compiler from using
> optimizations that might otherwise optimize accesses out of existence
> on the one hand, or that might create unsolicited accesses on the other.
> [...] it tells readers which variables are shared with
> other threads, and which are local to the current thread or protected
> by other, more mundane means."
> 
> atomic_read() and atomic_set() provide the same guarantees as
> ACCESS_ONCE in the Linux kernel.

Ok, having better understood the semantics and intentions of the
atomic_* functions in qemu, I withdraw my objection.  They do seem
like the right primitives for this situation.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-30 19:23       ` Paolo Bonzini
@ 2015-03-31 11:05         ` Dr. David Alan Gilbert
  2015-03-31 11:10           ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-31 11:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy, david

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 30/03/2015 19:46, Dr. David Alan Gilbert wrote:
> >>> > > +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
> >>> > > +{
> >>> > > +    return atomic_fetch_add(&mis->postcopy_state, 0);
> >>> > > +}
> >>> > > +
> >>> > > +/* Set the state and return the old state */
> >>> > > +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
> >>> > > +                                 PostcopyState new_state)
> >>> > > +{
> >>> > > +    return atomic_xchg(&mis->postcopy_state, new_state);
> >>> > > +}
> >> > 
> >> > Which are the (multiple) threads are calling these functions?
> > The main thread receiving the migration and the postcopy ram_listen_thread
> > receiving the RAM pages.
> > It's not actually racy between multiple threads updating it,
> > it's sequenced so that the main thread initialises it and then
> > hands over to the listen thread that takes it over from that point.
> 
> I would use atomic_mb_read/atomic_mb_set here.

The benefit of the atomic_xchg is that I get the old value back so
that I can sanity check I was in the state I thought I should have
been.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages.
  2015-03-31 11:05         ` Dr. David Alan Gilbert
@ 2015-03-31 11:10           ` Paolo Bonzini
  0 siblings, 0 replies; 181+ messages in thread
From: Paolo Bonzini @ 2015-03-31 11:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy, david



On 31/03/2015 13:05, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>
>>
>> On 30/03/2015 19:46, Dr. David Alan Gilbert wrote:
>>>>>>> +PostcopyState  postcopy_state_get(MigrationIncomingState *mis)
>>>>>>> +{
>>>>>>> +    return atomic_fetch_add(&mis->postcopy_state, 0);
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* Set the state and return the old state */
>>>>>>> +PostcopyState postcopy_state_set(MigrationIncomingState *mis,
>>>>>>> +                                 PostcopyState new_state)
>>>>>>> +{
>>>>>>> +    return atomic_xchg(&mis->postcopy_state, new_state);
>>>>>>> +}
>>>>>
>>>>> Which are the (multiple) threads are calling these functions?
>>> The main thread receiving the migration and the postcopy ram_listen_thread
>>> receiving the RAM pages.
>>> It's not actually racy between multiple threads updating it,
>>> it's sequenced so that the main thread initialises it and then
>>> hands over to the listen thread that takes it over from that point.
>>
>> I would use atomic_mb_read/atomic_mb_set here.
> 
> The benefit of the atomic_xchg is that I get the old value back so
> that I can sanity check I was in the state I thought I should have
> been.

Ok, you can still pair atomic_xchg with atomic_mb_read, it's a bit cheaper.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy
  2015-03-30 19:22                     ` Paolo Bonzini
@ 2015-03-31 11:21                       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-31 11:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, Dr. David Alan Gilbert,
	amit.shah, yanghy, David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 30/03/2015 19:04, Dr. David Alan Gilbert wrote:
> >>> > > That one's a trickier question.  Compilers are absolutely capable
> >>> > > of optimizing that far, *but* the C rules about when it's allowed
> >>> > > to assume in-memory values remain unchanged are pretty
> >>> > > conservative.  I think any function call in the loop will require
> >>> > > it to reload the value, for example.  That said, a (compiler only)
> >>> > > memory barrier might be appropriate to ensure that reload.
> >> > 
> >> > That's exactly what atomic_read provides.
> > So does that say I need the atomic_read but not the atomic_write -
> > which seems a bit weird, but I think only due to the naming.
> 
> No, you need both even though it's even more far-fetched that the
> compiler will do something bad with the set.

OK, done - it's back to where it was with atomic_set/atomic_read.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path
  2015-03-23  2:37       ` David Gibson
@ 2015-04-01 15:14         ` Dr. David Alan Gilbert
  2015-04-07  3:07           ` David Gibson
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-01 15:14 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Fri, Mar 20, 2015 at 06:17:31PM +0000, Dr. David Alan Gilbert wrote:
> > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > On Wed, Feb 25, 2015 at 04:51:35PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Open a return path, and handle messages that are received upon it.
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >  include/migration/migration.h |   8 ++
> > > >  migration/migration.c         | 178 +++++++++++++++++++++++++++++++++++++++++-
> > > >  trace-events                  |  13 +++
> > > >  3 files changed, 198 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > index 6775747..5242ead 100644
> > > > --- a/include/migration/migration.h
> > > > +++ b/include/migration/migration.h
> > > > @@ -73,6 +73,14 @@ struct MigrationState
> > > >  
> > > >      int state;
> > > >      MigrationParams params;
> > > > +
> > > > +    /* State related to return path */
> > > > +    struct {
> > > > +        QEMUFile     *file;
> > > > +        QemuThread    rp_thread;
> > > > +        bool          error;
> > > > +    } rp_state;
> > > > +
> > > >      double mbps;
> > > >      int64_t total_time;
> > > >      int64_t downtime;
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 80d234c..34cd4fe 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -237,6 +237,23 @@ MigrationCapabilityStatusList *qmp_query_migrate_capabilities(Error **errp)
> > > >      return head;
> > > >  }
> > > >  
> > > > +/*
> > > > + * Return true if we're already in the middle of a migration
> > > > + * (i.e. any of the active or setup states)
> > > > + */
> > > > +static bool migration_already_active(MigrationState *ms)
> > > > +{
> > > > +    switch (ms->state) {
> > > > +    case MIG_STATE_ACTIVE:
> > > > +    case MIG_STATE_SETUP:
> > > > +        return true;
> > > > +
> > > > +    default:
> > > > +        return false;
> > > > +
> > > > +    }
> > > > +}
> > > > +
> > > >  static void get_xbzrle_cache_stats(MigrationInfo *info)
> > > >  {
> > > >      if (migrate_use_xbzrle()) {
> > > > @@ -362,6 +379,21 @@ static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> > > >      }
> > > >  }
> > > >  
> > > > +static void migrate_fd_cleanup_src_rp(MigrationState *ms)
> > > > +{
> > > > +    QEMUFile *rp = ms->rp_state.file;
> > > > +
> > > > +    /*
> > > > +     * When stuff goes wrong (e.g. failing destination) on the rp, it can get
> > > > +     * cleaned up from a few threads; make sure not to do it twice in parallel
> > > > +     */
> > > > +    rp = atomic_cmpxchg(&ms->rp_state.file, rp, NULL);
> > > 
> > > A cmpxchg seems dangerously subtle for such a basic and infrequent
> > > operation, but ok.
> > 
> > I'll take other suggestions; but I'm trying to just do
> > 'if the qemu_file still exists close it', and it didn't seem
> > worth introducing another state variable to atomically update
> > when we've already got the file pointer itself.
> 
> Yes, I see the rationale.  My concern is just that the more atomicity
> mechanisms are scattered through the code, the harder it is to analyze
> and be sure you haven't missed race cases (or introduced then with a
> future change).
> 
> In short, I prefer to see a simple-as-possible, and preferably
> documented, consistent overall concurrency scheme for a data
> structure, rather than scattered atomic ops for various variable where
> it's difficult to see how all the pieces might relate together.
> 
> > > > +    if (rp) {
> > > > +        trace_migrate_fd_cleanup_src_rp();
> > > > +        qemu_fclose(rp);
> > > > +    }
> > > > +}
> > > > +
> > > >  static void migrate_fd_cleanup(void *opaque)
> > > >  {
> > > >      MigrationState *s = opaque;
> > > > @@ -369,6 +401,8 @@ static void migrate_fd_cleanup(void *opaque)
> > > >      qemu_bh_delete(s->cleanup_bh);
> > > >      s->cleanup_bh = NULL;
> > > >  
> > > > +    migrate_fd_cleanup_src_rp(s);
> > > > +
> > > >      if (s->file) {
> > > >          trace_migrate_fd_cleanup();
> > > >          qemu_mutex_unlock_iothread();
> > > > @@ -406,6 +440,11 @@ static void migrate_fd_cancel(MigrationState *s)
> > > >      QEMUFile *f = migrate_get_current()->file;
> > > >      trace_migrate_fd_cancel();
> > > >  
> > > > +    if (s->rp_state.file) {
> > > > +        /* shutdown the rp socket, so causing the rp thread to shutdown */
> > > > +        qemu_file_shutdown(s->rp_state.file);
> > > 
> > > I missed where qemu_file_shutdown() was implemented.  Does this
> > > introduce a leftover socket dependency?
> > 
> > No, it shouldn't.  The shutdown() causes a shutdown(2) syscall to
> > be issued on the socket stopping anything blocking on it; it then
> > gets closed at the end after the rp thread has exited.
> 
> 
> Sorry, that's not what I meant.  I mean is this a hole in the
> abstraction of the QemuFile, because it assumes that what you're
> dealing with here is indeed a socket, rather than something else?

It's just a dependency that we have a shutdown method on the qemu_file
we're using; if it's not a socket then whatever it is, if we're going
to use it for a rp then it needs to implement something equivalent.

> > > > +    }
> > > > +
> > > >      do {
> > > >          old_state = s->state;
> > > >          if (old_state != MIG_STATE_SETUP && old_state != MIG_STATE_ACTIVE) {
> > > > @@ -658,8 +697,145 @@ int64_t migrate_xbzrle_cache_size(void)
> > > >      return s->xbzrle_cache_size;
> > > >  }
> > > >  
> > > > -/* migration thread support */
> > > > +/*
> > > > + * Something bad happened to the RP stream, mark an error
> > > > + * The caller shall print something to indicate why
> > > > + */
> > > > +static void source_return_path_bad(MigrationState *s)
> > > > +{
> > > > +    s->rp_state.error = true;
> > > > +    migrate_fd_cleanup_src_rp(s);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Handles messages sent on the return path towards the source VM
> > > > + *
> > > > + */
> > > > +static void *source_return_path_thread(void *opaque)
> > > > +{
> > > > +    MigrationState *ms = opaque;
> > > > +    QEMUFile *rp = ms->rp_state.file;
> > > > +    uint16_t expected_len, header_len, header_com;
> > > > +    const int max_len = 512;
> > > > +    uint8_t buf[max_len];
> > > > +    uint32_t tmp32;
> > > > +    int res;
> > > > +
> > > > +    trace_source_return_path_thread_entry();
> > > > +    while (rp && !qemu_file_get_error(rp) &&
> > > > +        migration_already_active(ms)) {
> > > > +        trace_source_return_path_thread_loop_top();
> > > > +        header_com = qemu_get_be16(rp);
> > > > +        header_len = qemu_get_be16(rp);
> > > > +
> > > > +        switch (header_com) {
> > > > +        case MIG_RP_CMD_SHUT:
> > > > +        case MIG_RP_CMD_PONG:
> > > > +            expected_len = 4;
> > > 
> > > Could the knowledge of expected lengths be folded into the switch
> > > below?  Switching twice on the same thing is a bit icky.
> > 
> > No, because the length at this point is used to valdiate the
> > length field in the header prior to reading the body.
> > The other switch processes the contents of the body that
> > have been read.
> 
> Ok.
> 
> > > > +            break;
> > > > +
> > > > +        default:
> > > > +            error_report("RP: Received invalid cmd 0x%04x length 0x%04x",
> > > > +                    header_com, header_len);
> > > > +            source_return_path_bad(ms);
> > > > +            goto out;
> > > > +        }
> > > >  
> > > > +        if (header_len > expected_len) {
> > > > +            error_report("RP: Received command 0x%04x with"
> > > > +                    "incorrect length %d expecting %d",
> > > > +                    header_com, header_len,
> > > > +                    expected_len);
> > > > +            source_return_path_bad(ms);
> > > > +            goto out;
> > > > +        }
> > > > +
> > > > +        /* We know we've got a valid header by this point */
> > > > +        res = qemu_get_buffer(rp, buf, header_len);
> > > > +        if (res != header_len) {
> > > > +            trace_source_return_path_thread_failed_read_cmd_data();
> > > > +            source_return_path_bad(ms);
> > > > +            goto out;
> > > > +        }
> > > > +
> > > > +        /* OK, we have the command and the data */
> > > > +        switch (header_com) {
> > > > +        case MIG_RP_CMD_SHUT:
> > > > +            tmp32 = be32_to_cpup((uint32_t *)buf);
> > > > +            trace_source_return_path_thread_shut(tmp32);
> > > > +            if (tmp32) {
> > > > +                error_report("RP: Sibling indicated error %d", tmp32);
> > > > +                source_return_path_bad(ms);
> > > > +            }
> > > > +            /*
> > > > +             * We'll let the main thread deal with closing the RP
> > > > +             * we could do a shutdown(2) on it, but we're the only user
> > > > +             * anyway, so there's nothing gained.
> > > > +             */
> > > > +            goto out;
> > > > +
> > > > +        case MIG_RP_CMD_PONG:
> > > > +            tmp32 = be32_to_cpup((uint32_t *)buf);
> > > > +            trace_source_return_path_thread_pong(tmp32);
> > > > +            break;
> > > > +
> > > > +        default:
> > > > +            /* This shouldn't happen because we should catch this above */
> > > > +            trace_source_return_path_bad_header_com();
> > > > +        }
> > > > +        /* Latest command processed, now leave a gap for the next one */
> > > > +        header_com = MIG_RP_CMD_INVALID;
> > > 
> > > This assignment will always get overwritten.
> > 
> > Thanks; gone - it's a left over from an old version.
> > 
> > > > +    }
> > > > +    if (rp && qemu_file_get_error(rp)) {
> > > > +        trace_source_return_path_thread_bad_end();
> > > > +        source_return_path_bad(ms);
> > > > +    }
> > > > +
> > > > +    trace_source_return_path_thread_end();
> > > > +out:
> > > > +    return NULL;
> > > > +}
> > > > +
> > > > +__attribute__ (( unused )) /* Until later in patch series */
> > > > +static int open_outgoing_return_path(MigrationState *ms)
> > > 
> > > Uh.. surely this should be open_incoming_return_path(); it's designed
> > > to be used on the source side, AFAICT.
> > > 
> > > > +{
> > > > +
> > > > +    ms->rp_state.file = qemu_file_get_return_path(ms->file);
> > > > +    if (!ms->rp_state.file) {
> > > > +        return -1;
> > > > +    }
> > > > +
> > > > +    trace_open_outgoing_return_path();
> > > > +    qemu_thread_create(&ms->rp_state.rp_thread, "return path",
> > > > +                       source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
> > > > +
> > > > +    trace_open_outgoing_return_path_continue();
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +__attribute__ (( unused )) /* Until later in patch series */
> > > > +static void await_outgoing_return_path_close(MigrationState *ms)
> > > 
> > > Likewise "incoming" here, surely.
> > 
> > I've changed those two  to open_source_return_path()  which seems less ambiguous;
> > that OK?
> 
> Uh.. not really, it just moves the ambiguity to a different place (is
> "source return path" the return path *on* the source or *to* the
> source).
> 
> Perhaps "open_return_path_on_source" and
> "await_return_path_close_on_source"?  I'm not particularly fond of
> those, but they're the best I've come up with yet.

Done.

Dave

> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path
  2015-04-01 15:14         ` Dr. David Alan Gilbert
@ 2015-04-07  3:07           ` David Gibson
  0 siblings, 0 replies; 181+ messages in thread
From: David Gibson @ 2015-04-07  3:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

[-- Attachment #1: Type: text/plain, Size: 6218 bytes --]

On Wed, Apr 01, 2015 at 04:14:05PM +0100, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Fri, Mar 20, 2015 at 06:17:31PM +0000, Dr. David Alan Gilbert wrote:
> > > * David Gibson (david@gibson.dropbear.id.au) wrote:
> > > > On Wed, Feb 25, 2015 at 04:51:35PM +0000, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > Open a return path, and handle messages that are received upon it.
> > > > > 
> > > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > ---
> > > > >  include/migration/migration.h |   8 ++
> > > > >  migration/migration.c         | 178 +++++++++++++++++++++++++++++++++++++++++-
> > > > >  trace-events                  |  13 +++
> > > > >  3 files changed, 198 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/include/migration/migration.h b/include/migration/migration.h
> > > > > index 6775747..5242ead 100644
> > > > > --- a/include/migration/migration.h
> > > > > +++ b/include/migration/migration.h
> > > > > @@ -73,6 +73,14 @@ struct MigrationState
> > > > >  
> > > > >      int state;
> > > > >      MigrationParams params;
> > > > > +
> > > > > +    /* State related to return path */
> > > > > +    struct {
> > > > > +        QEMUFile     *file;
> > > > > +        QemuThread    rp_thread;
> > > > > +        bool          error;
> > > > > +    } rp_state;
> > > > > +
> > > > >      double mbps;
> > > > >      int64_t total_time;
> > > > >      int64_t downtime;
> > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > index 80d234c..34cd4fe 100644
> > > > > --- a/migration/migration.c
> > > > > +++ b/migration/migration.c
> > > > > @@ -237,6 +237,23 @@ MigrationCapabilityStatusList *qmp_query_migrate_capabilities(Error **errp)
> > > > >      return head;
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * Return true if we're already in the middle of a migration
> > > > > + * (i.e. any of the active or setup states)
> > > > > + */
> > > > > +static bool migration_already_active(MigrationState *ms)
> > > > > +{
> > > > > +    switch (ms->state) {
> > > > > +    case MIG_STATE_ACTIVE:
> > > > > +    case MIG_STATE_SETUP:
> > > > > +        return true;
> > > > > +
> > > > > +    default:
> > > > > +        return false;
> > > > > +
> > > > > +    }
> > > > > +}
> > > > > +
> > > > >  static void get_xbzrle_cache_stats(MigrationInfo *info)
> > > > >  {
> > > > >      if (migrate_use_xbzrle()) {
> > > > > @@ -362,6 +379,21 @@ static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> > > > >      }
> > > > >  }
> > > > >  
> > > > > +static void migrate_fd_cleanup_src_rp(MigrationState *ms)
> > > > > +{
> > > > > +    QEMUFile *rp = ms->rp_state.file;
> > > > > +
> > > > > +    /*
> > > > > +     * When stuff goes wrong (e.g. failing destination) on the rp, it can get
> > > > > +     * cleaned up from a few threads; make sure not to do it twice in parallel
> > > > > +     */
> > > > > +    rp = atomic_cmpxchg(&ms->rp_state.file, rp, NULL);
> > > > 
> > > > A cmpxchg seems dangerously subtle for such a basic and infrequent
> > > > operation, but ok.
> > > 
> > > I'll take other suggestions; but I'm trying to just do
> > > 'if the qemu_file still exists close it', and it didn't seem
> > > worth introducing another state variable to atomically update
> > > when we've already got the file pointer itself.
> > 
> > Yes, I see the rationale.  My concern is just that the more atomicity
> > mechanisms are scattered through the code, the harder it is to analyze
> > and be sure you haven't missed race cases (or introduced then with a
> > future change).
> > 
> > In short, I prefer to see a simple-as-possible, and preferably
> > documented, consistent overall concurrency scheme for a data
> > structure, rather than scattered atomic ops for various variable where
> > it's difficult to see how all the pieces might relate together.
> > 
> > > > > +    if (rp) {
> > > > > +        trace_migrate_fd_cleanup_src_rp();
> > > > > +        qemu_fclose(rp);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > >  static void migrate_fd_cleanup(void *opaque)
> > > > >  {
> > > > >      MigrationState *s = opaque;
> > > > > @@ -369,6 +401,8 @@ static void migrate_fd_cleanup(void *opaque)
> > > > >      qemu_bh_delete(s->cleanup_bh);
> > > > >      s->cleanup_bh = NULL;
> > > > >  
> > > > > +    migrate_fd_cleanup_src_rp(s);
> > > > > +
> > > > >      if (s->file) {
> > > > >          trace_migrate_fd_cleanup();
> > > > >          qemu_mutex_unlock_iothread();
> > > > > @@ -406,6 +440,11 @@ static void migrate_fd_cancel(MigrationState *s)
> > > > >      QEMUFile *f = migrate_get_current()->file;
> > > > >      trace_migrate_fd_cancel();
> > > > >  
> > > > > +    if (s->rp_state.file) {
> > > > > +        /* shutdown the rp socket, so causing the rp thread to shutdown */
> > > > > +        qemu_file_shutdown(s->rp_state.file);
> > > > 
> > > > I missed where qemu_file_shutdown() was implemented.  Does this
> > > > introduce a leftover socket dependency?
> > > 
> > > No, it shouldn't.  The shutdown() causes a shutdown(2) syscall to
> > > be issued on the socket stopping anything blocking on it; it then
> > > gets closed at the end after the rp thread has exited.
> > 
> > 
> > Sorry, that's not what I meant.  I mean is this a hole in the
> > abstraction of the QemuFile, because it assumes that what you're
> > dealing with here is indeed a socket, rather than something else?
> 
> It's just a dependency that we have a shutdown method on the qemu_file
> we're using; if it's not a socket then whatever it is, if we're going
> to use it for a rp then it needs to implement something equivalent.

Um, yeah, except I don't think most file types really have an
operation that's semantically similar to socket shutdown.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-03-30  8:31       ` Paolo Bonzini
@ 2015-04-13 11:35         ` Dr. David Alan Gilbert
  2015-04-13 13:26           ` Paolo Bonzini
  0 siblings, 1 reply; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-13 11:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 26/03/2015 12:05, Dr. David Alan Gilbert wrote:
> >>> > > +    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
> >>> > > +    *old_vm_running = runstate_is_running();
> >> > 
> >> > I think that needs some explanation.  Why are you doing a wakeup on
> >> > the source host?
> > This matches the existing code in migration_thread for the end of precopy;
> > Paolo's explanation of what it does is here:
> > https://lists.gnu.org/archive/html/qemu-devel/2014-08/msg04880.html
> 
> The more I look at it, the more I'm convinced it's working by chance or
> not working at all.

Do you mean in general or in the postcopy case?

> Here we probably need to do only the notifier_list_notify +
> qapi_event_send_wakeup.

Do you mean a :
   wakeup_reason = QEMU_WAKEUP_REASON_OTHER;
   notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
   wakeup_reason = QEMU_WAKEUP_REASON_NONE;
   qapi_event_send_wakeup(&error);

which I guess would need wrapping up in vl.c

(It's not really clear to me what this stuff does even with
your previous explanation; if it's to do with migrating
something suspended-to-ram I guess a postcopy is possible
but it doesn't seem that sensible to use postcopy for a
machine with a CPU that isn't running).

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-04-13 11:35         ` Dr. David Alan Gilbert
@ 2015-04-13 13:26           ` Paolo Bonzini
  2015-04-13 14:58             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 181+ messages in thread
From: Paolo Bonzini @ 2015-04-13 13:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson



On 13/04/2015 13:35, Dr. David Alan Gilbert wrote:
>>>>>>> +    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
>>>>>>> > >>> > > +    *old_vm_running = runstate_is_running();
>>>>> > >> > 
>>>>> > >> > I think that needs some explanation.  Why are you doing a wakeup on
>>>>> > >> > the source host?
>>> > > This matches the existing code in migration_thread for the end of precopy;
>>> > > Paolo's explanation of what it does is here:
>>> > > https://lists.gnu.org/archive/html/qemu-devel/2014-08/msg04880.html
>> > 
>> > The more I look at it, the more I'm convinced it's working by chance or
>> > not working at all.
> Do you mean in general or in the postcopy case?

In general.

> > Here we probably need to do only the notifier_list_notify +
> > qapi_event_send_wakeup.
>
> Do you mean a :
>    wakeup_reason = QEMU_WAKEUP_REASON_OTHER;
>    notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>    wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>    qapi_event_send_wakeup(&error);
> 
> which I guess would need wrapping up in vl.c

Yes.

> (It's not really clear to me what this stuff does even with
> your previous explanation; if it's to do with migrating
> something suspended-to-ram I guess a postcopy is possible
> but it doesn't seem that sensible to use postcopy for a
> machine with a CPU that isn't running).

Yes, but management does not know in advance that the machine will
suspend to RAM.  It's basically a race.

Paolo

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread
  2015-04-13 13:26           ` Paolo Bonzini
@ 2015-04-13 14:58             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-13 14:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 13/04/2015 13:35, Dr. David Alan Gilbert wrote:
> >>>>>>> +    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
> >>>>>>> > >>> > > +    *old_vm_running = runstate_is_running();
> >>>>> > >> > 
> >>>>> > >> > I think that needs some explanation.  Why are you doing a wakeup on
> >>>>> > >> > the source host?
> >>> > > This matches the existing code in migration_thread for the end of precopy;
> >>> > > Paolo's explanation of what it does is here:
> >>> > > https://lists.gnu.org/archive/html/qemu-devel/2014-08/msg04880.html
> >> > 
> >> > The more I look at it, the more I'm convinced it's working by chance or
> >> > not working at all.
> > Do you mean in general or in the postcopy case?
> 
> In general.

OK, then lets take this off this thread and fix it generally
and just make sure postcopy is consistent with whatever that
change is.

> > > Here we probably need to do only the notifier_list_notify +
> > > qapi_event_send_wakeup.
> >
> > Do you mean a :
> >    wakeup_reason = QEMU_WAKEUP_REASON_OTHER;
> >    notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
> >    wakeup_reason = QEMU_WAKEUP_REASON_NONE;
> >    qapi_event_send_wakeup(&error);
> > 
> > which I guess would need wrapping up in vl.c
> 
> Yes.
> 
> > (It's not really clear to me what this stuff does even with
> > your previous explanation; if it's to do with migrating
> > something suspended-to-ram I guess a postcopy is possible
> > but it doesn't seem that sensible to use postcopy for a
> > machine with a CPU that isn't running).
> 
> Yes, but management does not know in advance that the machine will
> suspend to RAM.  It's basically a race.

OK, reasonable.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops
  2015-03-12  6:11   ` David Gibson
@ 2015-04-14 12:04     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-04-14 12:04 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini, yanghy

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:38PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Postcopy needs to have two migration streams loading concurrently;
> > one from memory (with the device state) and the other from the fd
> > with the memory transactions.
> > 
> > Split the core of qemu_loadvm_state out so we can use it for both.
> > 
> > Allow the inner loadvm loop to quit and signal whether the parent
> > should.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  savevm.c     | 106 ++++++++++++++++++++++++++++++++++++-----------------------
> >  trace-events |   4 +++
> >  2 files changed, 69 insertions(+), 41 deletions(-)
> > 
> > diff --git a/savevm.c b/savevm.c
> > index f42713d..4b619da 100644
> > --- a/savevm.c
> > +++ b/savevm.c
> > @@ -951,6 +951,16 @@ static SaveStateEntry *find_se(const char *idstr, int instance_id)
> >      return NULL;
> >  }
> >  
> > +/* ORable flags that control the (potentially nested) loadvm_state loops */
> > +enum LoadVMExitCodes {
> > +    /* Quit the loop level that received this command */
> > +    LOADVM_QUIT_LOOP     =  1,
> > +    /* Quit this loop and our parent */
> > +    LOADVM_QUIT_PARENT   =  2,
> > +};
> 
> The semantics of all the exit code stuff is doing my head in; I'm not
> sure how to make it more comprehensible.

I've managed to kill one of the states off; it's now a single flag that quits
all levels; much easier to understand.

> > +static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > +
> >  static int loadvm_process_command_simple_lencheck(const char *name,
> >                                                    unsigned int actual,
> >                                                    unsigned int expected)
> > @@ -967,6 +977,8 @@ static int loadvm_process_command_simple_lencheck(const char *name,
> >  /*
> >   * Process an incoming 'QEMU_VM_COMMAND'
> >   * negative return on error (will issue error message)
> > + * 0   just a normal return
> > + * 1   All good, but exit the loop
> 
> This should probably also mention the possibility of negative returns
> for errors.

'negative return on error' it says there; I've made that clearer now
so it now has a '<0'  line.

> Am I correct in thinking that at this point the function never returns
> 1?  I'm assuming later patches in the series change that.

Yes, it's one of the commands that now returns that flag.
(Actually it was the LOADVM_QUIT_* enum, not '1')

> Maybe I'm missing something in my mental model here, but tying the
> duration of the containing loop to execution of specific commands
> seems problematic.  What's the circumstance in which it makes sense
> for a command to indicate that the rest of the packaged data should be
> essentially ignored

Yes, the only case we actually care about here in the postcopy case
is that when we are reading from the package we stop the main loops
ever reading from the main fd, since the listen thread takes that over.

> >   */
> >  static int loadvm_process_command(QEMUFile *f)
> >  {
> > @@ -1036,36 +1048,13 @@ void loadvm_free_handlers(MigrationIncomingState *mis)
> >      }
> >  }
> >  
> > -int qemu_loadvm_state(QEMUFile *f)
> > +static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> >  {
> > -    MigrationIncomingState *mis = migration_incoming_get_current();
> > -    Error *local_err = NULL;
> >      uint8_t section_type;
> > -    unsigned int v;
> >      int ret;
> > +    int exitcode = 0;
> >  
> > -    if (qemu_savevm_state_blocked(&local_err)) {
> > -        error_report("%s", error_get_pretty(local_err));
> > -        error_free(local_err);
> > -        return -EINVAL;
> > -    }
> > -
> > -    v = qemu_get_be32(f);
> > -    if (v != QEMU_VM_FILE_MAGIC) {
> > -        error_report("Not a migration stream");
> > -        return -EINVAL;
> > -    }
> > -
> > -    v = qemu_get_be32(f);
> > -    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
> > -        error_report("SaveVM v2 format is obsolete and don't work anymore");
> > -        return -ENOTSUP;
> > -    }
> > -    if (v != QEMU_VM_FILE_VERSION) {
> > -        error_report("Unsupported migration stream version");
> > -        return -ENOTSUP;
> > -    }
> > -
> > +    trace_qemu_loadvm_state_main();
> >      while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
> >          uint32_t instance_id, version_id, section_id;
> >          SaveStateEntry *se;
> > @@ -1093,16 +1082,14 @@ int qemu_loadvm_state(QEMUFile *f)
> >              if (se == NULL) {
> >                  error_report("Unknown savevm section or instance '%s' %d",
> >                               idstr, instance_id);
> > -                ret = -EINVAL;
> > -                goto out;
> > +                return -EINVAL;
> >              }
> >  
> >              /* Validate version */
> >              if (version_id > se->version_id) {
> >                  error_report("savevm: unsupported version %d for '%s' v%d",
> >                               version_id, idstr, se->version_id);
> > -                ret = -EINVAL;
> > -                goto out;
> > +                return -EINVAL;
> >              }
> >  
> >              /* Add entry */
> > @@ -1117,7 +1104,7 @@ int qemu_loadvm_state(QEMUFile *f)
> >              if (ret < 0) {
> >                  error_report("error while loading state for instance 0x%x of"
> >                               " device '%s'", instance_id, idstr);
> > -                goto out;
> > +                return ret;
> >              }
> >              break;
> >          case QEMU_VM_SECTION_PART:
> > @@ -1132,36 +1119,73 @@ int qemu_loadvm_state(QEMUFile *f)
> >              }
> >              if (le == NULL) {
> >                  error_report("Unknown savevm section %d", section_id);
> > -                ret = -EINVAL;
> > -                goto out;
> > +                return -EINVAL;
> >              }
> >  
> >              ret = vmstate_load(f, le->se, le->version_id);
> >              if (ret < 0) {
> >                  error_report("error while loading state section id %d(%s)",
> >                               section_id, le->se->idstr);
> > -                goto out;
> > +                return ret;
> >              }
> >              break;
> >          case QEMU_VM_COMMAND:
> >              ret = loadvm_process_command(f);
> > -            if (ret < 0) {
> > -                goto out;
> > +            trace_qemu_loadvm_state_section_command(ret);
> > +            if ((ret < 0) || (ret & LOADVM_QUIT_LOOP)) {
> > +                return ret;
> >              }
> > +            exitcode |= ret; /* Lets us pass flags up to the parent */
> >              break;
> >          default:
> >              error_report("Unknown savevm section type %d", section_type);
> > -            ret = -EINVAL;
> > -            goto out;
> > +            return -EINVAL;
> >          }
> >      }
> >  
> > -    cpu_synchronize_all_post_init();
> > +    if (exitcode & LOADVM_QUIT_PARENT) {
> > +        trace_qemu_loadvm_state_main_quit_parent();
> > +        exitcode &= ~LOADVM_QUIT_PARENT;
> > +        exitcode |= LOADVM_QUIT_LOOP;
> > +    }
> 
> So, if I'm following properly putting a QUIT_PARENT will cause this
> loop to exit, also returning QUIT_LOOP, so the next loop out also
> quits.  If there was a third lood beyond that it wouldn't quit.
> 
> But are those really the semantics you want; or do you want the
> options to be "quit one level" and "quit all levels", which seems a
> little bit simpler.  In the current plans you only have the two levels
> so they're equivalent.

Right, that's all gone - now have a 'quit all levels..

Dave

> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue
  2015-03-24  2:15   ` David Gibson
@ 2015-06-16 10:48     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-06-16 10:48 UTC (permalink / raw)
  To: David Gibson
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, pbonzini

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Wed, Feb 25, 2015 at 04:51:57PM +0000, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > When transmitting RAM pages, consume pages that have been queued by
> > MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
> > 
> > Note:
> >   a) After a queued page the linear walk carries on from after the
> > unqueued page; there is a reasonable chance that the destination
> > was about to ask for other closeby pages anyway.
> > 
> >   b) We have to be careful of any assumptions that the page walking
> > code makes, in particular it does some short cuts on its first linear
> > walk that break as soon as we do a queued page.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  arch_init.c  | 154 +++++++++++++++++++++++++++++++++++++++++++++++++----------
> >  trace-events |   2 +
> >  2 files changed, 131 insertions(+), 25 deletions(-)
> > 
> > diff --git a/arch_init.c b/arch_init.c
> > index 9d8fc6b..acf65e1 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -328,6 +328,7 @@ static RAMBlock *last_seen_block;
> >  /* This is the last block from where we have sent data */
> >  static RAMBlock *last_sent_block;
> >  static ram_addr_t last_offset;
> > +static bool last_was_from_queue;
> >  static unsigned long *migration_bitmap;
> >  static uint64_t migration_dirty_pages;
> >  static uint32_t last_version;
> > @@ -461,6 +462,19 @@ static inline bool migration_bitmap_set_dirty(ram_addr_t addr)
> >      return ret;
> >  }
> >  
> > +static inline bool migration_bitmap_clear_dirty(ram_addr_t addr)
> > +{
> > +    bool ret;
> > +    int nr = addr >> TARGET_PAGE_BITS;
> > +
> > +    ret = test_and_clear_bit(nr, migration_bitmap);
> > +
> > +    if (ret) {
> > +        migration_dirty_pages--;
> > +    }
> > +    return ret;
> > +}
> > +
> >  static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
> >  {
> >      ram_addr_t addr;
> > @@ -669,6 +683,39 @@ static int ram_save_page(QEMUFile *f, RAMBlock* block, ram_addr_t offset,
> >  }
> >  
> >  /*
> > + * Unqueue a page from the queue fed by postcopy page requests
> > + *
> > + * Returns:      The RAMBlock* to transmit from (or NULL if the queue is empty)
> > + *      ms:      MigrationState in
> > + *  offset:      the byte offset within the RAMBlock for the start of the page
> > + * ram_addr_abs: global offset in the dirty/sent bitmaps
> > + */
> > +static RAMBlock *ram_save_unqueue_page(MigrationState *ms, ram_addr_t *offset,
> > +                                       ram_addr_t *ram_addr_abs)
> > +{
> > +    RAMBlock *result = NULL;
> > +    qemu_mutex_lock(&ms->src_page_req_mutex);
> > +    if (!QSIMPLEQ_EMPTY(&ms->src_page_requests)) {
> > +        struct MigrationSrcPageRequest *entry =
> > +                                    QSIMPLEQ_FIRST(&ms->src_page_requests);
> > +        result = entry->rb;
> > +        *offset = entry->offset;
> > +        *ram_addr_abs = (entry->offset + entry->rb->offset) & TARGET_PAGE_MASK;
> > +
> > +        if (entry->len > TARGET_PAGE_SIZE) {
> > +            entry->len -= TARGET_PAGE_SIZE;
> > +            entry->offset += TARGET_PAGE_SIZE;
> > +        } else {
> > +            QSIMPLEQ_REMOVE_HEAD(&ms->src_page_requests, next_req);
> > +            g_free(entry);
> > +        }
> > +    }
> > +    qemu_mutex_unlock(&ms->src_page_req_mutex);
> > +
> > +    return result;
> > +}
> > +
> > +/*
> >   * Queue the pages for transmission, e.g. a request from postcopy destination
> >   *   ms: MigrationStatus in which the queue is held
> >   *   rbname: The RAMBlock the request is for - may be NULL (to mean reuse last)
> > @@ -732,46 +779,102 @@ int ram_save_queue_pages(MigrationState *ms, const char *rbname,
> >  
> >  static int ram_find_and_save_block(QEMUFile *f, bool last_stage)
> >  {
> > +    MigrationState *ms = migrate_get_current();
> >      RAMBlock *block = last_seen_block;
> > +    RAMBlock *tmpblock;
> >      ram_addr_t offset = last_offset;
> > +    ram_addr_t tmpoffset;
> >      bool complete_round = false;
> >      int bytes_sent = 0;
> > -    MemoryRegion *mr;
> >      ram_addr_t dirty_ram_abs; /* Address of the start of the dirty page in
> >                                   ram_addr_t space */
> > +    unsigned long hps = sysconf(_SC_PAGESIZE);
> >  
> > -    if (!block)
> > +    if (!block) {
> >          block = QTAILQ_FIRST(&ram_list.blocks);
> > +        last_was_from_queue = false;
> > +    }
> >  
> > -    while (true) {
> > -        mr = block->mr;
> > -        offset = migration_bitmap_find_and_reset_dirty(mr, offset,
> > -                                                       &dirty_ram_abs);
> > -        if (complete_round && block == last_seen_block &&
> > -            offset >= last_offset) {
> > -            break;
> > +    while (true) { /* Until we send a block or run out of stuff to send */
> > +        tmpblock = NULL;
> > +
> > +        /*
> > +         * Don't break host-page chunks up with queue items
> > +         * so only unqueue if,
> > +         *   a) The last item came from the queue anyway
> > +         *   b) The last sent item was the last target-page in a host page
> > +         */
> > +        if (last_was_from_queue || !last_sent_block ||
> > +            ((last_offset & (hps - 1)) == (hps - TARGET_PAGE_SIZE))) {
> > +            tmpblock = ram_save_unqueue_page(ms, &tmpoffset, &dirty_ram_abs);
> 
> This test for whether we've completed a host page is pretty awkward.
> I think it would be cleaner to have an inner loop / helper function
> that completes sending an entire host page (whether requested or
> scanned), before allowing the outer loop to even come back to here to
> reconsider the queue.

I've reworked that in the v7 series I've just posted; please see if it's
more to your taste (I've not tested it on a machine with bigger pages
yet though).  (This patch is 32/42 in v7).

Dave

> 
> >          }
> > -        if (offset >= block->used_length) {
> > -            offset = 0;
> > -            block = QTAILQ_NEXT(block, next);
> > -            if (!block) {
> > -                block = QTAILQ_FIRST(&ram_list.blocks);
> > -                complete_round = true;
> > -                ram_bulk_stage = false;
> > +
> > +        if (tmpblock) {
> > +            /* We've got a block from the postcopy queue */
> > +            trace_ram_find_and_save_block_postcopy(tmpblock->idstr,
> > +                                                   (uint64_t)tmpoffset,
> > +                                                   (uint64_t)dirty_ram_abs);
> > +            /*
> > +             * We're sending this page, and since it's postcopy nothing else
> > +             * will dirty it, and we must make sure it doesn't get sent again.
> > +             */
> > +            if (!migration_bitmap_clear_dirty(dirty_ram_abs)) {
> > +                trace_ram_find_and_save_block_postcopy_not_dirty(
> > +                    tmpblock->idstr, (uint64_t)tmpoffset,
> > +                    (uint64_t)dirty_ram_abs,
> > +                    test_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap));
> > +
> > +                continue;
> >              }
> > +            /*
> > +             * As soon as we start servicing pages out of order, then we have
> > +             * to kill the bulk stage, since the bulk stage assumes
> > +             * in (migration_bitmap_find_and_reset_dirty) that every page is
> > +             * dirty, that's no longer true.
> > +             */
> > +            ram_bulk_stage = false;
> > +            /*
> > +             * We mustn't change block/offset unless it's to a valid one
> > +             * otherwise we can go down some of the exit cases in the normal
> > +             * path.
> > +             */
> > +            block = tmpblock;
> > +            offset = tmpoffset;
> > +            last_was_from_queue = true;
> 
> Hrm.  So now block can change during the execution of
> ram_save_block(), which really suggests that splitting by block is no
> longer a sensible subdivision of the loop surrounding ram_save_block.
> I think it would make more sense to replace that entire surrounding
> loop, so that the logic is essentially
> 
>     while (not finished) {
>         if (something is queued) {
> 	    send that
>         } else {
> 	    scan for an unsent block and offset
> 	    send that instead
> 	}
> 
> 
> >          } else {
> > -            bytes_sent = ram_save_page(f, block, offset, last_stage);
> > -
> > -            /* if page is unmodified, continue to the next */
> > -            if (bytes_sent > 0) {
> > -                MigrationState *ms = migrate_get_current();
> > -                if (ms->sentmap) {
> > -                    set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
> > +            MemoryRegion *mr;
> > +            /* priority queue empty, so just search for something dirty */
> > +            mr = block->mr;
> > +            offset = migration_bitmap_find_and_reset_dirty(mr, offset,
> > +                                                           &dirty_ram_abs);
> > +            if (complete_round && block == last_seen_block &&
> > +                offset >= last_offset) {
> > +                break;
> > +            }
> > +            if (offset >= block->used_length) {
> > +                offset = 0;
> > +                block = QTAILQ_NEXT(block, next);
> > +                if (!block) {
> > +                    block = QTAILQ_FIRST(&ram_list.blocks);
> > +                    complete_round = true;
> > +                    ram_bulk_stage = false;
> >                  }
> > +                continue; /* pick an offset in the new block */
> > +            }
> > +            last_was_from_queue = false;
> > +        }
> >  
> > -                last_sent_block = block;
> > -                break;
> > +        /* We have a page to send, so send it */
> > +        bytes_sent = ram_save_page(f, block, offset, last_stage);
> > +
> > +        /* if page is unmodified, continue to the next */
> > +        if (bytes_sent > 0) {
> > +            if (ms->sentmap) {
> > +                set_bit(dirty_ram_abs >> TARGET_PAGE_BITS, ms->sentmap);
> >              }
> > +
> > +            last_sent_block = block;
> > +            break;
> >          }
> >      }
> >      last_seen_block = block;
> > @@ -865,6 +968,7 @@ static void reset_ram_globals(void)
> >      last_offset = 0;
> >      last_version = ram_list.version;
> >      ram_bulk_stage = true;
> > +    last_was_from_queue = false;
> >  }
> >  
> >  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> > diff --git a/trace-events b/trace-events
> > index 8a0d70d..781cf5c 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -1217,6 +1217,8 @@ qemu_file_fclose(void) ""
> >  migration_bitmap_sync_start(void) ""
> >  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64""
> >  migration_throttle(void) ""
> > +ram_find_and_save_block_postcopy(const char *block_name, uint64_t tmp_offset, uint64_t ram_addr) "%s/%" PRIx64 " ram_addr=%" PRIx64
> > +ram_find_and_save_block_postcopy_not_dirty(const char *block_name, uint64_t tmp_offset, uint64_t ram_addr, int sent) "%s/%" PRIx64 " ram_addr=%" PRIx64 " (sent=%d)"
> >  ram_postcopy_send_discard_bitmap(void) ""
> >  ram_save_queue_pages(const char *rbname, size_t start, size_t len) "%s: start: %zx len: %zx"
> >  
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test
  2015-03-30 14:09           ` Paolo Bonzini
@ 2015-06-16 10:49             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 181+ messages in thread
From: Dr. David Alan Gilbert @ 2015-06-16 10:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, yamahata, quintela, qemu-devel, amit.shah, yanghy,
	David Gibson

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 30/03/2015 16:07, Dr. David Alan Gilbert wrote:
> >>> > > 
> >> > 
> >> > You would just require new _installed_ kernel headers.  Then you can use
> >> > linux/userfaultfd.h and syscall.h (the latter from glibc, includes
> >> > asm/unistd.h to get syscall numbers).
> >> > 
> >> > linux-headers/ is useful for APIs that do not require system calls, or
> >> > for APIs that are extensible.  However, if a system call is required
> >> > (and mandatory) it's simpler to just use installed headers.
> > OK, so then I could check for ifdef __NR_userfault and then
> > do the include and I think that would be safe.
> 
> I think it's okay.  First include syscall.h, then include
> linux/userfaultfd.h under #ifdef.
> 
> > Although then what's the best way to tell people to try it out
> > without an updated libc?
> 
> They don't need an updated libc, just an updated kernel.  syscall.h is
> just a wrapper around Linux headers.

That's what I've implemented in the v6 and the v7 I've just posted.

Dave

> 
> Paolo
> 
> > Or is it best to modify ./configure to detect it?
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 181+ messages in thread

end of thread, other threads:[~2015-06-16 10:50 UTC | newest]

Thread overview: 181+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-25 16:51 [Qemu-devel] [PATCH v5 00/45] Postcopy implementation Dr. David Alan Gilbert (git)
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works Dr. David Alan Gilbert (git)
2015-03-05  3:21   ` David Gibson
2015-03-05  9:21     ` Dr. David Alan Gilbert
2015-03-10  1:04       ` David Gibson
2015-03-13 13:07         ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 02/45] Split header writing out of qemu_save_state_begin Dr. David Alan Gilbert (git)
2015-03-10  1:05   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 03/45] qemu_ram_foreach_block: pass up error value, and down the ramblock name Dr. David Alan Gilbert (git)
2015-03-10 15:30   ` Eric Blake
2015-03-10 16:21     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 04/45] Add qemu_get_counted_string to read a string prefixed by a count byte Dr. David Alan Gilbert (git)
2015-03-10  1:12   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 05/45] Create MigrationIncomingState Dr. David Alan Gilbert (git)
2015-03-10  2:37   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 06/45] Provide runtime Target page information Dr. David Alan Gilbert (git)
2015-03-10  2:38   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 07/45] Return path: Open a return path on QEMUFile for sockets Dr. David Alan Gilbert (git)
2015-03-10  2:49   ` David Gibson
2015-03-13 13:14     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 08/45] Return path: socket_writev_buffer: Block even on non-blocking fd's Dr. David Alan Gilbert (git)
2015-03-10  2:56   ` David Gibson
2015-03-10 13:35     ` Dr. David Alan Gilbert
2015-03-11  1:51       ` David Gibson
2015-03-28 15:30   ` Paolo Bonzini
2015-03-29  4:07     ` David Gibson
2015-03-29  9:03       ` Paolo Bonzini
2015-03-30 16:50         ` Dr. David Alan Gilbert
2015-03-30 18:22           ` Markus Armbruster
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 09/45] Migration commands Dr. David Alan Gilbert (git)
2015-03-10  4:58   ` David Gibson
2015-03-10 11:04     ` Dr. David Alan Gilbert
2015-03-10 11:06       ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 10/45] Return path: Control commands Dr. David Alan Gilbert (git)
2015-03-10  5:40   ` David Gibson
2015-03-28 15:32   ` Paolo Bonzini
2015-03-30 17:34     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 11/45] Return path: Send responses from destination to source Dr. David Alan Gilbert (git)
2015-03-10  5:47   ` David Gibson
2015-03-10 14:34     ` Dr. David Alan Gilbert
2015-03-11  1:54       ` David Gibson
2015-03-25 18:47         ` Dr. David Alan Gilbert
2015-03-28 15:34     ` Paolo Bonzini
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 12/45] Return path: Source handling of return path Dr. David Alan Gilbert (git)
2015-03-10  6:08   ` David Gibson
2015-03-20 18:17     ` Dr. David Alan Gilbert
2015-03-23  2:37       ` David Gibson
2015-04-01 15:14         ` Dr. David Alan Gilbert
2015-04-07  3:07           ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 13/45] ram_debug_dump_bitmap: Dump a migration bitmap as text Dr. David Alan Gilbert (git)
2015-03-10  6:11   ` David Gibson
2015-03-20 18:48     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 14/45] Move loadvm_handlers into MigrationIncomingState Dr. David Alan Gilbert (git)
2015-03-10  6:19   ` David Gibson
2015-03-10 10:12     ` Dr. David Alan Gilbert
2015-03-10 11:03       ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 15/45] Rework loadvm path for subloops Dr. David Alan Gilbert (git)
2015-03-12  6:11   ` David Gibson
2015-04-14 12:04     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 16/45] Add migration-capability boolean for postcopy-ram Dr. David Alan Gilbert (git)
2015-03-12  6:14   ` David Gibson
2015-03-13 12:58     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 17/45] Add wrappers and handlers for sending/receiving the postcopy-ram migration messages Dr. David Alan Gilbert (git)
2015-03-12  9:30   ` David Gibson
2015-03-26 16:33     ` Dr. David Alan Gilbert
2015-03-27  4:13       ` David Gibson
2015-03-27 10:48         ` Dr. David Alan Gilbert
2015-03-28 16:00           ` Paolo Bonzini
2015-03-30  4:03           ` David Gibson
2015-03-28 15:58         ` Paolo Bonzini
2015-03-28 15:43   ` Paolo Bonzini
2015-03-30 17:46     ` Dr. David Alan Gilbert
2015-03-30 19:23       ` Paolo Bonzini
2015-03-31 11:05         ` Dr. David Alan Gilbert
2015-03-31 11:10           ` Paolo Bonzini
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 18/45] MIG_CMD_PACKAGED: Send a packaged chunk of migration stream Dr. David Alan Gilbert (git)
2015-03-13  0:55   ` David Gibson
2015-03-13 11:51     ` Dr. David Alan Gilbert
2015-03-16  6:16       ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 19/45] migrate_init: Call from savevm Dr. David Alan Gilbert (git)
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 20/45] Modify savevm handlers for postcopy Dr. David Alan Gilbert (git)
2015-03-13  1:00   ` David Gibson
2015-03-13 10:19     ` Dr. David Alan Gilbert
2015-03-16  6:18       ` David Gibson
2015-03-20 12:37         ` Dr. David Alan Gilbert
2015-03-23  2:25           ` David Gibson
2015-03-24 20:04             ` Dr. David Alan Gilbert
2015-03-24 22:32               ` David Gibson
2015-03-25 15:00                 ` Dr. David Alan Gilbert
2015-03-25 16:40                   ` Dr. David Alan Gilbert
2015-03-26  1:35                     ` David Gibson
2015-03-26 11:44                       ` Dr. David Alan Gilbert
2015-03-27  3:56                         ` David Gibson
2015-03-26  1:35                   ` David Gibson
2015-03-30  8:10                 ` Paolo Bonzini
2015-03-31  0:10                   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 21/45] Add Linux userfaultfd header Dr. David Alan Gilbert (git)
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 22/45] postcopy: OS support test Dr. David Alan Gilbert (git)
2015-03-13  1:23   ` David Gibson
2015-03-13 10:41     ` Dr. David Alan Gilbert
2015-03-16  6:22       ` David Gibson
2015-03-30  8:14       ` Paolo Bonzini
2015-03-30 14:07         ` Dr. David Alan Gilbert
2015-03-30 14:09           ` Paolo Bonzini
2015-06-16 10:49             ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 23/45] migrate_start_postcopy: Command to trigger transition to postcopy Dr. David Alan Gilbert (git)
2015-03-13  1:26   ` David Gibson
2015-03-13 11:19     ` Dr. David Alan Gilbert
2015-03-16  6:23       ` David Gibson
2015-03-18 17:59         ` Dr. David Alan Gilbert
2015-03-19  4:18           ` David Gibson
2015-03-19  9:33             ` Dr. David Alan Gilbert
2015-03-23  2:20               ` David Gibson
2015-03-30  8:19                 ` Paolo Bonzini
2015-03-30 17:04                   ` Dr. David Alan Gilbert
2015-03-30 19:22                     ` Paolo Bonzini
2015-03-31 11:21                       ` Dr. David Alan Gilbert
2015-03-30  8:17               ` Paolo Bonzini
2015-03-31  2:23                 ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 24/45] MIG_STATE_POSTCOPY_ACTIVE: Add new migration state Dr. David Alan Gilbert (git)
2015-03-13  4:45   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 25/45] qemu_savevm_state_complete: Postcopy changes Dr. David Alan Gilbert (git)
2015-03-13  4:58   ` David Gibson
2015-03-13 12:25     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 26/45] Postcopy page-map-incoming (PMI) structure Dr. David Alan Gilbert (git)
2015-03-13  5:19   ` David Gibson
2015-03-13 13:47     ` Dr. David Alan Gilbert
2015-03-16  6:30       ` David Gibson
2015-03-18 17:58         ` Dr. David Alan Gilbert
2015-03-23  2:48           ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 27/45] Postcopy: Maintain sentmap and calculate discard Dr. David Alan Gilbert (git)
2015-03-23  3:30   ` David Gibson
2015-03-23 14:36     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 28/45] postcopy: Incoming initialisation Dr. David Alan Gilbert (git)
2015-03-23  3:41   ` David Gibson
2015-03-23 13:46     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 29/45] postcopy: ram_enable_notify to switch on userfault Dr. David Alan Gilbert (git)
2015-03-23  3:45   ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 30/45] Postcopy: Postcopy startup in migration thread Dr. David Alan Gilbert (git)
2015-03-23  4:20   ` David Gibson
2015-03-26 11:05     ` Dr. David Alan Gilbert
2015-03-30  8:31       ` Paolo Bonzini
2015-04-13 11:35         ` Dr. David Alan Gilbert
2015-04-13 13:26           ` Paolo Bonzini
2015-04-13 14:58             ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 31/45] Postcopy end in migration_thread Dr. David Alan Gilbert (git)
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 32/45] Page request: Add MIG_RP_CMD_REQ_PAGES reverse command Dr. David Alan Gilbert (git)
2015-03-23  5:00   ` David Gibson
2015-03-25 18:16     ` Dr. David Alan Gilbert
2015-03-26  1:28       ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 33/45] Page request: Process incoming page request Dr. David Alan Gilbert (git)
2015-03-24  1:53   ` David Gibson
2015-03-25 17:37     ` Dr. David Alan Gilbert
2015-03-26  1:31       ` David Gibson
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 34/45] Page request: Consume pages off the post-copy queue Dr. David Alan Gilbert (git)
2015-03-24  2:15   ` David Gibson
2015-06-16 10:48     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 35/45] postcopy_ram.c: place_page and helpers Dr. David Alan Gilbert (git)
2015-03-24  2:33   ` David Gibson
2015-03-25 17:46     ` Dr. David Alan Gilbert
2015-02-25 16:51 ` [Qemu-devel] [PATCH v5 36/45] Postcopy: Use helpers to map pages during migration Dr. David Alan Gilbert (git)
2015-03-24  4:51   ` David Gibson
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 37/45] qemu_ram_block_from_host Dr. David Alan Gilbert (git)
2015-03-24  4:55   ` David Gibson
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 38/45] Don't sync dirty bitmaps in postcopy Dr. David Alan Gilbert (git)
2015-03-24  4:58   ` David Gibson
2015-03-24  9:05     ` Dr. David Alan Gilbert
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 39/45] Host page!=target page: Cleanup bitmaps Dr. David Alan Gilbert (git)
2015-03-24  5:23   ` David Gibson
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 40/45] Postcopy; Handle userfault requests Dr. David Alan Gilbert (git)
2015-03-24  5:38   ` David Gibson
2015-03-26 11:59     ` Dr. David Alan Gilbert
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 41/45] Start up a postcopy/listener thread ready for incoming page data Dr. David Alan Gilbert (git)
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 42/45] postcopy: Wire up loadvm_postcopy_handle_{run, end} commands Dr. David Alan Gilbert (git)
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 43/45] End of migration for postcopy Dr. David Alan Gilbert (git)
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 44/45] Disable mlock around incoming postcopy Dr. David Alan Gilbert (git)
2015-03-23  4:33   ` David Gibson
2015-02-25 16:52 ` [Qemu-devel] [PATCH v5 45/45] Inhibit ballooning during postcopy Dr. David Alan Gilbert (git)
2015-03-23  4:32   ` David Gibson
2015-03-23 12:21     ` Dr. David Alan Gilbert
2015-03-24  1:25       ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.