All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/21][RFC] postcopy live migration
@ 2011-12-29  1:25 ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Intro
=====
This patch series implements postcopy live migration.[1]
As discussed at KVM forum 2011, dedicated character device is used for
distributed shared memory between migration source and destination.
Now we can discuss/benchmark/compare with precopy. I believe there are
much rooms for improvement.

[1] http://wiki.qemu.org/Features/PostCopyLiveMigration


Usage
=====
You need load umem character device on the host before starting migration.
Postcopy can be used for tcg and kvm accelarator. The implementation depend
on only linux umem character device. But the driver dependent code is split
into a file.
I tested only host page size == guest page size case, but the implementation
allows host page size != guest page size case.

The following options are added with this patch series.
- incoming part
  command line options
  -postcopy [-postcopy-flags <flags>]
  where flags is for changing behavior for benchmark/debugging
  Currently the following flags are available
  0: default
  1: enable touching page request

  example:
  qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm

- outging part
  options for migrate command 
  migrate [-p [-n]] URI
  -p: indicate postcopy migration
  -n: disable background transferring pages: This is for benchmark/debugging

  example:
  migrate -p -n tcp:<dest ip address>:4444


TODO
====
- benchmark/evaluation. Especially how async page fault affects the result.
- improve/optimization
  At the moment at least what I'm aware of is
  - touching pages in incoming qemu process by fd handler seems suboptimal.
    creating dedicated thread?
  - making incoming socket non-blocking
  - outgoing handler seems suboptimal causing latency.
- catch up memory API change
- consider on FUSE/CUSE possibility
- and more...

basic postcopy work flow
========================
        qemu on the destination
              |
              V
        open(/dev/umem)
              |
              V
        UMEM_DEV_CREATE_UMEM
              |
              V
        Here we have two file descriptors to
        umem device and shmem file
              |
              |                                  umemd
              |                                  daemon on the destination
              |
              V    create pipe to communicate
        fork()---------------------------------------,
              |                                      |
              V                                      |
        close(socket)                                V
        close(shmem)                              mmap(shmem file)
              |                                      |
              V                                      V
        mmap(umem device) for guest RAM           close(shmem file)
              |                                      |
        close(umem device)                           |
              |                                      |
              V                                      |
        wait for ready from daemon <----pipe-----send ready message
              |                                      |
              |                                 Here the daemon takes over 
        send ok------------pipe---------------> the owner of the socket    
              |				        to the source              
              V                                      |
        entering post copy stage                     |
        start guest execution                        |
              |                                      |
              V                                      V
        access guest RAM                          UMEM_GET_PAGE_REQUEST
              |                                      |
              V                                      V
        page fault ------------------------------>page offset is returned
        block                                        |
                                                     V
                                                  pull page from the source
                                                  write the page contents
                                                  to the shmem.
                                                     |
                                                     V
        unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
        the fault handler returns the page
        page fault is resolved
              |
              |                                   pages can be sent
              |                                   backgroundly
              |                                      |
              |                                      V
              |                                   UMEM_MARK_PAGE_CACHED
              |                                      |
              V                                      V
        The specified pages<-----pipe------------request to touch pages
        are made present by                          |
        touching guest RAM.                          |
              |                                      |
              V                                      V
             reply-------------pipe-------------> release the cached page
              |                                   madvise(MADV_REMOVE)
              |                                      |
              V                                      V

                 all the pages are pulled from the source

              |                                      |
              V                                      V
        the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
       (note: I'm not sure if this can be implemented or not)
              |                                      |
              V                                      V
        migration completes                        exit()



Isaku Yamahata (21):
  arch_init: export sort_ram_list() and ram_save_block()
  arch_init: export RAM_SAVE_xxx flags for postcopy
  arch_init/ram_save: introduce constant for ram save version = 4
  arch_init: refactor host_from_stream_offset()
  arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
  arch_init: refactor ram_save_block()
  arch_init/ram_save_live: factor out ram_save_limit
  arch_init/ram_load: refactor ram_load
  exec.c: factor out qemu_get_ram_ptr()
  exec.c: export last_ram_offset()
  savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
  savevm: qemu_pending_size() to return pending buffered size
  savevm, buffered_file: introduce method to drain buffer of buffered
    file
  migration: export migrate_fd_completed() and migrate_fd_cleanup()
  migration: factor out parameters into MigrationParams
  umem.h: import Linux umem.h
  update-linux-headers.sh: teach umem.h to update-linux-headers.sh
  configure: add CONFIG_POSTCOPY option
  postcopy: introduce -postcopy and -postcopy-flags option
  postcopy outgoing: add -p and -n option to migrate command
  postcopy: implement postcopy livemigration

 Makefile.target                 |    4 +
 arch_init.c                     |  260 ++++---
 arch_init.h                     |   20 +
 block-migration.c               |    8 +-
 buffered_file.c                 |   20 +-
 buffered_file.h                 |    1 +
 configure                       |   12 +
 cpu-all.h                       |    9 +
 exec-obsolete.h                 |    1 +
 exec.c                          |   75 +-
 hmp-commands.hx                 |   12 +-
 hw/hw.h                         |    7 +-
 linux-headers/linux/umem.h      |   83 ++
 migration-exec.c                |    8 +
 migration-fd.c                  |   30 +
 migration-postcopy-stub.c       |   77 ++
 migration-postcopy.c            | 1891 +++++++++++++++++++++++++++++++++++++++
 migration-tcp.c                 |   37 +-
 migration-unix.c                |   32 +-
 migration.c                     |   53 +-
 migration.h                     |   49 +-
 qemu-common.h                   |    2 +
 qemu-options.hx                 |   25 +
 qmp-commands.hx                 |   10 +-
 savevm.c                        |   31 +-
 scripts/update-linux-headers.sh |    2 +-
 sysemu.h                        |    4 +-
 umem.c                          |  379 ++++++++
 umem.h                          |  105 +++
 vl.c                            |   20 +-
 30 files changed, 3086 insertions(+), 181 deletions(-)
 create mode 100644 linux-headers/linux/umem.h
 create mode 100644 migration-postcopy-stub.c
 create mode 100644 migration-postcopy.c
 create mode 100644 umem.c
 create mode 100644 umem.h


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2011-12-29  1:25 ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Intro
=====
This patch series implements postcopy live migration.[1]
As discussed at KVM forum 2011, dedicated character device is used for
distributed shared memory between migration source and destination.
Now we can discuss/benchmark/compare with precopy. I believe there are
much rooms for improvement.

[1] http://wiki.qemu.org/Features/PostCopyLiveMigration


Usage
=====
You need load umem character device on the host before starting migration.
Postcopy can be used for tcg and kvm accelarator. The implementation depend
on only linux umem character device. But the driver dependent code is split
into a file.
I tested only host page size == guest page size case, but the implementation
allows host page size != guest page size case.

The following options are added with this patch series.
- incoming part
  command line options
  -postcopy [-postcopy-flags <flags>]
  where flags is for changing behavior for benchmark/debugging
  Currently the following flags are available
  0: default
  1: enable touching page request

  example:
  qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm

- outging part
  options for migrate command 
  migrate [-p [-n]] URI
  -p: indicate postcopy migration
  -n: disable background transferring pages: This is for benchmark/debugging

  example:
  migrate -p -n tcp:<dest ip address>:4444


TODO
====
- benchmark/evaluation. Especially how async page fault affects the result.
- improve/optimization
  At the moment at least what I'm aware of is
  - touching pages in incoming qemu process by fd handler seems suboptimal.
    creating dedicated thread?
  - making incoming socket non-blocking
  - outgoing handler seems suboptimal causing latency.
- catch up memory API change
- consider on FUSE/CUSE possibility
- and more...

basic postcopy work flow
========================
        qemu on the destination
              |
              V
        open(/dev/umem)
              |
              V
        UMEM_DEV_CREATE_UMEM
              |
              V
        Here we have two file descriptors to
        umem device and shmem file
              |
              |                                  umemd
              |                                  daemon on the destination
              |
              V    create pipe to communicate
        fork()---------------------------------------,
              |                                      |
              V                                      |
        close(socket)                                V
        close(shmem)                              mmap(shmem file)
              |                                      |
              V                                      V
        mmap(umem device) for guest RAM           close(shmem file)
              |                                      |
        close(umem device)                           |
              |                                      |
              V                                      |
        wait for ready from daemon <----pipe-----send ready message
              |                                      |
              |                                 Here the daemon takes over 
        send ok------------pipe---------------> the owner of the socket    
              |				        to the source              
              V                                      |
        entering post copy stage                     |
        start guest execution                        |
              |                                      |
              V                                      V
        access guest RAM                          UMEM_GET_PAGE_REQUEST
              |                                      |
              V                                      V
        page fault ------------------------------>page offset is returned
        block                                        |
                                                     V
                                                  pull page from the source
                                                  write the page contents
                                                  to the shmem.
                                                     |
                                                     V
        unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
        the fault handler returns the page
        page fault is resolved
              |
              |                                   pages can be sent
              |                                   backgroundly
              |                                      |
              |                                      V
              |                                   UMEM_MARK_PAGE_CACHED
              |                                      |
              V                                      V
        The specified pages<-----pipe------------request to touch pages
        are made present by                          |
        touching guest RAM.                          |
              |                                      |
              V                                      V
             reply-------------pipe-------------> release the cached page
              |                                   madvise(MADV_REMOVE)
              |                                      |
              V                                      V

                 all the pages are pulled from the source

              |                                      |
              V                                      V
        the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
       (note: I'm not sure if this can be implemented or not)
              |                                      |
              V                                      V
        migration completes                        exit()



Isaku Yamahata (21):
  arch_init: export sort_ram_list() and ram_save_block()
  arch_init: export RAM_SAVE_xxx flags for postcopy
  arch_init/ram_save: introduce constant for ram save version = 4
  arch_init: refactor host_from_stream_offset()
  arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
  arch_init: refactor ram_save_block()
  arch_init/ram_save_live: factor out ram_save_limit
  arch_init/ram_load: refactor ram_load
  exec.c: factor out qemu_get_ram_ptr()
  exec.c: export last_ram_offset()
  savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
  savevm: qemu_pending_size() to return pending buffered size
  savevm, buffered_file: introduce method to drain buffer of buffered
    file
  migration: export migrate_fd_completed() and migrate_fd_cleanup()
  migration: factor out parameters into MigrationParams
  umem.h: import Linux umem.h
  update-linux-headers.sh: teach umem.h to update-linux-headers.sh
  configure: add CONFIG_POSTCOPY option
  postcopy: introduce -postcopy and -postcopy-flags option
  postcopy outgoing: add -p and -n option to migrate command
  postcopy: implement postcopy livemigration

 Makefile.target                 |    4 +
 arch_init.c                     |  260 ++++---
 arch_init.h                     |   20 +
 block-migration.c               |    8 +-
 buffered_file.c                 |   20 +-
 buffered_file.h                 |    1 +
 configure                       |   12 +
 cpu-all.h                       |    9 +
 exec-obsolete.h                 |    1 +
 exec.c                          |   75 +-
 hmp-commands.hx                 |   12 +-
 hw/hw.h                         |    7 +-
 linux-headers/linux/umem.h      |   83 ++
 migration-exec.c                |    8 +
 migration-fd.c                  |   30 +
 migration-postcopy-stub.c       |   77 ++
 migration-postcopy.c            | 1891 +++++++++++++++++++++++++++++++++++++++
 migration-tcp.c                 |   37 +-
 migration-unix.c                |   32 +-
 migration.c                     |   53 +-
 migration.h                     |   49 +-
 qemu-common.h                   |    2 +
 qemu-options.hx                 |   25 +
 qmp-commands.hx                 |   10 +-
 savevm.c                        |   31 +-
 scripts/update-linux-headers.sh |    2 +-
 sysemu.h                        |    4 +-
 umem.c                          |  379 ++++++++
 umem.h                          |  105 +++
 vl.c                            |   20 +-
 30 files changed, 3086 insertions(+), 181 deletions(-)
 create mode 100644 linux-headers/linux/umem.h
 create mode 100644 migration-postcopy-stub.c
 create mode 100644 migration-postcopy.c
 create mode 100644 umem.c
 create mode 100644 umem.h

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 01/21] arch_init: export sort_ram_list() and ram_save_block()
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This will be used by postcopy.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |    4 ++--
 migration.h |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index d4c92b0..1947396 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -112,7 +112,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
 static RAMBlock *last_block;
 static ram_addr_t last_offset;
 
-static int ram_save_block(QEMUFile *f)
+int ram_save_block(QEMUFile *f)
 {
     RAMBlock *block = last_block;
     ram_addr_t offset = last_offset;
@@ -229,7 +229,7 @@ static int block_compar(const void *a, const void *b)
     return 0;
 }
 
-static void sort_ram_list(void)
+void sort_ram_list(void)
 {
     RAMBlock *block, *nblock, **blocks;
     int n;
diff --git a/migration.h b/migration.h
index 372b066..e79a69b 100644
--- a/migration.h
+++ b/migration.h
@@ -78,6 +78,8 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_transferred(void);
 uint64_t ram_bytes_total(void);
 
+void sort_ram_list(void);
+int ram_save_block(QEMUFile *f);
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque);
 int ram_load(QEMUFile *f, void *opaque, int version_id);
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 01/21] arch_init: export sort_ram_list() and ram_save_block()
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This will be used by postcopy.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |    4 ++--
 migration.h |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index d4c92b0..1947396 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -112,7 +112,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
 static RAMBlock *last_block;
 static ram_addr_t last_offset;
 
-static int ram_save_block(QEMUFile *f)
+int ram_save_block(QEMUFile *f)
 {
     RAMBlock *block = last_block;
     ram_addr_t offset = last_offset;
@@ -229,7 +229,7 @@ static int block_compar(const void *a, const void *b)
     return 0;
 }
 
-static void sort_ram_list(void)
+void sort_ram_list(void)
 {
     RAMBlock *block, *nblock, **blocks;
     int n;
diff --git a/migration.h b/migration.h
index 372b066..e79a69b 100644
--- a/migration.h
+++ b/migration.h
@@ -78,6 +78,8 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_transferred(void);
 uint64_t ram_bytes_total(void);
 
+void sort_ram_list(void);
+int ram_save_block(QEMUFile *f);
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque);
 int ram_load(QEMUFile *f, void *opaque, int version_id);
 
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 02/21] arch_init: export RAM_SAVE_xxx flags for postcopy
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Those constants will be also used by postcopy.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |    7 -------
 arch_init.h |    7 +++++++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 1947396..4ede5ad 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -87,13 +87,6 @@ const uint32_t arch_type = QEMU_ARCH;
 /***********************************************************/
 /* ram save/restore */
 
-#define RAM_SAVE_FLAG_FULL     0x01 /* Obsolete, not used anymore */
-#define RAM_SAVE_FLAG_COMPRESS 0x02
-#define RAM_SAVE_FLAG_MEM_SIZE 0x04
-#define RAM_SAVE_FLAG_PAGE     0x08
-#define RAM_SAVE_FLAG_EOS      0x10
-#define RAM_SAVE_FLAG_CONTINUE 0x20
-
 static int is_dup_page(uint8_t *page, uint8_t ch)
 {
     uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch;
diff --git a/arch_init.h b/arch_init.h
index 828256c..cf27625 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -32,4 +32,11 @@ int tcg_available(void);
 int kvm_available(void);
 int xen_available(void);
 
+#define RAM_SAVE_FLAG_FULL     0x01 /* Obsolete, not used anymore */
+#define RAM_SAVE_FLAG_COMPRESS 0x02
+#define RAM_SAVE_FLAG_MEM_SIZE 0x04
+#define RAM_SAVE_FLAG_PAGE     0x08
+#define RAM_SAVE_FLAG_EOS      0x10
+#define RAM_SAVE_FLAG_CONTINUE 0x20
+
 #endif
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 02/21] arch_init: export RAM_SAVE_xxx flags for postcopy
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Those constants will be also used by postcopy.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |    7 -------
 arch_init.h |    7 +++++++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 1947396..4ede5ad 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -87,13 +87,6 @@ const uint32_t arch_type = QEMU_ARCH;
 /***********************************************************/
 /* ram save/restore */
 
-#define RAM_SAVE_FLAG_FULL     0x01 /* Obsolete, not used anymore */
-#define RAM_SAVE_FLAG_COMPRESS 0x02
-#define RAM_SAVE_FLAG_MEM_SIZE 0x04
-#define RAM_SAVE_FLAG_PAGE     0x08
-#define RAM_SAVE_FLAG_EOS      0x10
-#define RAM_SAVE_FLAG_CONTINUE 0x20
-
 static int is_dup_page(uint8_t *page, uint8_t ch)
 {
     uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch;
diff --git a/arch_init.h b/arch_init.h
index 828256c..cf27625 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -32,4 +32,11 @@ int tcg_available(void);
 int kvm_available(void);
 int xen_available(void);
 
+#define RAM_SAVE_FLAG_FULL     0x01 /* Obsolete, not used anymore */
+#define RAM_SAVE_FLAG_COMPRESS 0x02
+#define RAM_SAVE_FLAG_MEM_SIZE 0x04
+#define RAM_SAVE_FLAG_PAGE     0x08
+#define RAM_SAVE_FLAG_EOS      0x10
+#define RAM_SAVE_FLAG_CONTINUE 0x20
+
 #endif
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 03/21] arch_init/ram_save: introduce constant for ram save version = 4
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Introduce RAM_SAVE_VERSION_ID to represent version_id for ram save format.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |    2 +-
 arch_init.h |    2 ++
 vl.c        |    4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 4ede5ad..5ad6956 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -371,7 +371,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
     int flags;
     int error;
 
-    if (version_id < 3 || version_id > 4) {
+    if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
         return -EINVAL;
     }
 
diff --git a/arch_init.h b/arch_init.h
index cf27625..a3aa059 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -39,4 +39,6 @@ int xen_available(void);
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 
+#define RAM_SAVE_VERSION_ID     4 /* currently version 4 */
+
 #endif
diff --git a/vl.c b/vl.c
index c03abb6..a4c9489 100644
--- a/vl.c
+++ b/vl.c
@@ -3266,8 +3266,8 @@ int main(int argc, char **argv, char **envp)
     default_drive(default_sdcard, snapshot, machine->use_scsi,
                   IF_SD, 0, SD_OPTS);
 
-    register_savevm_live(NULL, "ram", 0, 4, NULL, ram_save_live, NULL,
-                         ram_load, NULL);
+    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
+                         ram_save_live, NULL, ram_load, NULL);
 
     if (nb_numa_nodes > 0) {
         int i;
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 03/21] arch_init/ram_save: introduce constant for ram save version = 4
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Introduce RAM_SAVE_VERSION_ID to represent version_id for ram save format.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |    2 +-
 arch_init.h |    2 ++
 vl.c        |    4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 4ede5ad..5ad6956 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -371,7 +371,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
     int flags;
     int error;
 
-    if (version_id < 3 || version_id > 4) {
+    if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
         return -EINVAL;
     }
 
diff --git a/arch_init.h b/arch_init.h
index cf27625..a3aa059 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -39,4 +39,6 @@ int xen_available(void);
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 
+#define RAM_SAVE_VERSION_ID     4 /* currently version 4 */
+
 #endif
diff --git a/vl.c b/vl.c
index c03abb6..a4c9489 100644
--- a/vl.c
+++ b/vl.c
@@ -3266,8 +3266,8 @@ int main(int argc, char **argv, char **envp)
     default_drive(default_sdcard, snapshot, machine->use_scsi,
                   IF_SD, 0, SD_OPTS);
 
-    register_savevm_live(NULL, "ram", 0, 4, NULL, ram_save_live, NULL,
-                         ram_load, NULL);
+    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
+                         ram_save_live, NULL, ram_load, NULL);
 
     if (nb_numa_nodes > 0) {
         int i;
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 04/21] arch_init: refactor host_from_stream_offset()
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   25 ++++++++++++++++++-------
 arch_init.h |    9 +++++++++
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 5ad6956..d55e39c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -335,21 +335,22 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     return (stage == 2) && (expected_time <= migrate_max_downtime());
 }
 
-static inline void *host_from_stream_offset(QEMUFile *f,
-                                            ram_addr_t offset,
-                                            int flags)
+void *ram_load_host_from_stream_offset(QEMUFile *f,
+                                       ram_addr_t offset,
+                                       int flags,
+                                       RAMBlock **last_blockp)
 {
-    static RAMBlock *block = NULL;
+    RAMBlock *block;
     char id[256];
     uint8_t len;
 
     if (flags & RAM_SAVE_FLAG_CONTINUE) {
-        if (!block) {
+        if (!(*last_blockp)) {
             fprintf(stderr, "Ack, bad migration stream!\n");
             return NULL;
         }
 
-        return block->host + offset;
+        return (*last_blockp)->host + offset;
     }
 
     len = qemu_get_byte(f);
@@ -357,14 +358,24 @@ static inline void *host_from_stream_offset(QEMUFile *f,
     id[len] = 0;
 
     QLIST_FOREACH(block, &ram_list.blocks, next) {
-        if (!strncmp(id, block->idstr, sizeof(id)))
+        if (!strncmp(id, block->idstr, sizeof(id))) {
+            *last_blockp = block;
             return block->host + offset;
+        }
     }
 
     fprintf(stderr, "Can't find block %s!\n", id);
     return NULL;
 }
 
+static inline void *host_from_stream_offset(QEMUFile *f,
+                                            ram_addr_t offset,
+                                            int flags)
+{
+    static RAMBlock *block = NULL;
+    return ram_load_host_from_stream_offset(f, offset, flags, &block);
+}
+
 int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     ram_addr_t addr;
diff --git a/arch_init.h b/arch_init.h
index a3aa059..14d6644 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -1,6 +1,8 @@
 #ifndef QEMU_ARCH_INIT_H
 #define QEMU_ARCH_INIT_H
 
+#include "qemu-common.h"
+
 extern const char arch_config_name[];
 
 enum {
@@ -41,4 +43,11 @@ int xen_available(void);
 
 #define RAM_SAVE_VERSION_ID     4 /* currently version 4 */
 
+#ifdef NEED_CPU_H
+void *ram_load_host_from_stream_offset(QEMUFile *f,
+                                       ram_addr_t offset,
+                                       int flags,
+                                       RAMBlock **last_blockp);
+#endif
+
 #endif
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 04/21] arch_init: refactor host_from_stream_offset()
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   25 ++++++++++++++++++-------
 arch_init.h |    9 +++++++++
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 5ad6956..d55e39c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -335,21 +335,22 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     return (stage == 2) && (expected_time <= migrate_max_downtime());
 }
 
-static inline void *host_from_stream_offset(QEMUFile *f,
-                                            ram_addr_t offset,
-                                            int flags)
+void *ram_load_host_from_stream_offset(QEMUFile *f,
+                                       ram_addr_t offset,
+                                       int flags,
+                                       RAMBlock **last_blockp)
 {
-    static RAMBlock *block = NULL;
+    RAMBlock *block;
     char id[256];
     uint8_t len;
 
     if (flags & RAM_SAVE_FLAG_CONTINUE) {
-        if (!block) {
+        if (!(*last_blockp)) {
             fprintf(stderr, "Ack, bad migration stream!\n");
             return NULL;
         }
 
-        return block->host + offset;
+        return (*last_blockp)->host + offset;
     }
 
     len = qemu_get_byte(f);
@@ -357,14 +358,24 @@ static inline void *host_from_stream_offset(QEMUFile *f,
     id[len] = 0;
 
     QLIST_FOREACH(block, &ram_list.blocks, next) {
-        if (!strncmp(id, block->idstr, sizeof(id)))
+        if (!strncmp(id, block->idstr, sizeof(id))) {
+            *last_blockp = block;
             return block->host + offset;
+        }
     }
 
     fprintf(stderr, "Can't find block %s!\n", id);
     return NULL;
 }
 
+static inline void *host_from_stream_offset(QEMUFile *f,
+                                            ram_addr_t offset,
+                                            int flags)
+{
+    static RAMBlock *block = NULL;
+    return ram_load_host_from_stream_offset(f, offset, flags, &block);
+}
+
 int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     ram_addr_t addr;
diff --git a/arch_init.h b/arch_init.h
index a3aa059..14d6644 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -1,6 +1,8 @@
 #ifndef QEMU_ARCH_INIT_H
 #define QEMU_ARCH_INIT_H
 
+#include "qemu-common.h"
+
 extern const char arch_config_name[];
 
 enum {
@@ -41,4 +43,11 @@ int xen_available(void);
 
 #define RAM_SAVE_VERSION_ID     4 /* currently version 4 */
 
+#ifdef NEED_CPU_H
+void *ram_load_host_from_stream_offset(QEMUFile *f,
+                                       ram_addr_t offset,
+                                       int flags,
+                                       RAMBlock **last_blockp);
+#endif
+
 #endif
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 05/21] arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   21 ++++++++++++++-------
 migration.h |    1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index d55e39c..9bc313e 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -243,6 +243,19 @@ void sort_ram_list(void)
     g_free(blocks);
 }
 
+void ram_save_live_mem_size(QEMUFile *f)
+{
+    RAMBlock *block;
+
+    qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        qemu_put_byte(f, strlen(block->idstr));
+        qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
+        qemu_put_be64(f, block->length);
+    }
+}
+
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
 {
     ram_addr_t addr;
@@ -282,13 +295,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         /* Enable dirty memory tracking */
         cpu_physical_memory_set_dirty_tracking(1);
 
-        qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
-
-        QLIST_FOREACH(block, &ram_list.blocks, next) {
-            qemu_put_byte(f, strlen(block->idstr));
-            qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
-            qemu_put_be64(f, block->length);
-        }
+        ram_save_live_mem_size(f);
     }
 
     bytes_transferred_last = bytes_transferred;
diff --git a/migration.h b/migration.h
index e79a69b..cb4a2d5 100644
--- a/migration.h
+++ b/migration.h
@@ -80,6 +80,7 @@ uint64_t ram_bytes_total(void);
 
 void sort_ram_list(void);
 int ram_save_block(QEMUFile *f);
+void ram_save_live_mem_size(QEMUFile *f);
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque);
 int ram_load(QEMUFile *f, void *opaque, int version_id);
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 05/21] arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   21 ++++++++++++++-------
 migration.h |    1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index d55e39c..9bc313e 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -243,6 +243,19 @@ void sort_ram_list(void)
     g_free(blocks);
 }
 
+void ram_save_live_mem_size(QEMUFile *f)
+{
+    RAMBlock *block;
+
+    qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        qemu_put_byte(f, strlen(block->idstr));
+        qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
+        qemu_put_be64(f, block->length);
+    }
+}
+
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
 {
     ram_addr_t addr;
@@ -282,13 +295,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         /* Enable dirty memory tracking */
         cpu_physical_memory_set_dirty_tracking(1);
 
-        qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
-
-        QLIST_FOREACH(block, &ram_list.blocks, next) {
-            qemu_put_byte(f, strlen(block->idstr));
-            qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
-            qemu_put_be64(f, block->length);
-        }
+        ram_save_live_mem_size(f);
     }
 
     bytes_transferred_last = bytes_transferred;
diff --git a/migration.h b/migration.h
index e79a69b..cb4a2d5 100644
--- a/migration.h
+++ b/migration.h
@@ -80,6 +80,7 @@ uint64_t ram_bytes_total(void);
 
 void sort_ram_list(void);
 int ram_save_block(QEMUFile *f);
+void ram_save_live_mem_size(QEMUFile *f);
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque);
 int ram_load(QEMUFile *f, void *opaque, int version_id);
 
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 06/21] arch_init: refactor ram_save_block()
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   82 +++++++++++++++++++++++++++++++---------------------------
 arch_init.h |    1 +
 2 files changed, 45 insertions(+), 38 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 9bc313e..982c846 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -102,6 +102,44 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
     return 1;
 }
 
+static RAMBlock *last_block_sent = NULL;
+
+int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
+{
+    ram_addr_t current_addr = block->offset + offset;
+    uint8_t *p;
+    int cont;
+
+    if (!cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
+        return 0;
+    }
+    cpu_physical_memory_reset_dirty(current_addr,
+                                    current_addr + TARGET_PAGE_SIZE,
+                                    MIGRATION_DIRTY_FLAG);
+
+    p = block->host + offset;
+    cont = (block == last_block_sent) ? RAM_SAVE_FLAG_CONTINUE : 0;
+    last_block_sent = block;
+
+    if (is_dup_page(p, *p)) {
+        qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
+        if (!cont) {
+            qemu_put_byte(f, strlen(block->idstr));
+            qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
+        }
+        qemu_put_byte(f, *p);
+        return 1;
+    }
+
+    qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE);
+    if (!cont) {
+        qemu_put_byte(f, strlen(block->idstr));
+        qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
+    }
+    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+    return TARGET_PAGE_SIZE;
+}
+
 static RAMBlock *last_block;
 static ram_addr_t last_offset;
 
@@ -109,47 +147,13 @@ int ram_save_block(QEMUFile *f)
 {
     RAMBlock *block = last_block;
     ram_addr_t offset = last_offset;
-    ram_addr_t current_addr;
     int bytes_sent = 0;
 
     if (!block)
         block = QLIST_FIRST(&ram_list.blocks);
 
-    current_addr = block->offset + offset;
-
     do {
-        if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
-            uint8_t *p;
-            int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0;
-
-            cpu_physical_memory_reset_dirty(current_addr,
-                                            current_addr + TARGET_PAGE_SIZE,
-                                            MIGRATION_DIRTY_FLAG);
-
-            p = block->host + offset;
-
-            if (is_dup_page(p, *p)) {
-                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
-                if (!cont) {
-                    qemu_put_byte(f, strlen(block->idstr));
-                    qemu_put_buffer(f, (uint8_t *)block->idstr,
-                                    strlen(block->idstr));
-                }
-                qemu_put_byte(f, *p);
-                bytes_sent = 1;
-            } else {
-                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE);
-                if (!cont) {
-                    qemu_put_byte(f, strlen(block->idstr));
-                    qemu_put_buffer(f, (uint8_t *)block->idstr,
-                                    strlen(block->idstr));
-                }
-                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-                bytes_sent = TARGET_PAGE_SIZE;
-            }
-
-            break;
-        }
+        bytes_sent = ram_save_page(f, block, offset);
 
         offset += TARGET_PAGE_SIZE;
         if (offset >= block->length) {
@@ -159,9 +163,10 @@ int ram_save_block(QEMUFile *f)
                 block = QLIST_FIRST(&ram_list.blocks);
         }
 
-        current_addr = block->offset + offset;
-
-    } while (current_addr != last_block->offset + last_offset);
+        if (bytes_sent > 0) {
+            break;
+        }
+    } while (block->offset + offset != last_block->offset + last_offset);
 
     last_block = block;
     last_offset = offset;
@@ -277,6 +282,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     if (stage == 1) {
         RAMBlock *block;
         bytes_transferred = 0;
+        last_block_sent = NULL;
         last_block = NULL;
         last_offset = 0;
         sort_ram_list();
diff --git a/arch_init.h b/arch_init.h
index 14d6644..118461a 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -44,6 +44,7 @@ int xen_available(void);
 #define RAM_SAVE_VERSION_ID     4 /* currently version 4 */
 
 #ifdef NEED_CPU_H
+int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 void *ram_load_host_from_stream_offset(QEMUFile *f,
                                        ram_addr_t offset,
                                        int flags,
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 06/21] arch_init: refactor ram_save_block()
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   82 +++++++++++++++++++++++++++++++---------------------------
 arch_init.h |    1 +
 2 files changed, 45 insertions(+), 38 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 9bc313e..982c846 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -102,6 +102,44 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
     return 1;
 }
 
+static RAMBlock *last_block_sent = NULL;
+
+int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
+{
+    ram_addr_t current_addr = block->offset + offset;
+    uint8_t *p;
+    int cont;
+
+    if (!cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
+        return 0;
+    }
+    cpu_physical_memory_reset_dirty(current_addr,
+                                    current_addr + TARGET_PAGE_SIZE,
+                                    MIGRATION_DIRTY_FLAG);
+
+    p = block->host + offset;
+    cont = (block == last_block_sent) ? RAM_SAVE_FLAG_CONTINUE : 0;
+    last_block_sent = block;
+
+    if (is_dup_page(p, *p)) {
+        qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
+        if (!cont) {
+            qemu_put_byte(f, strlen(block->idstr));
+            qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
+        }
+        qemu_put_byte(f, *p);
+        return 1;
+    }
+
+    qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE);
+    if (!cont) {
+        qemu_put_byte(f, strlen(block->idstr));
+        qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
+    }
+    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+    return TARGET_PAGE_SIZE;
+}
+
 static RAMBlock *last_block;
 static ram_addr_t last_offset;
 
@@ -109,47 +147,13 @@ int ram_save_block(QEMUFile *f)
 {
     RAMBlock *block = last_block;
     ram_addr_t offset = last_offset;
-    ram_addr_t current_addr;
     int bytes_sent = 0;
 
     if (!block)
         block = QLIST_FIRST(&ram_list.blocks);
 
-    current_addr = block->offset + offset;
-
     do {
-        if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
-            uint8_t *p;
-            int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0;
-
-            cpu_physical_memory_reset_dirty(current_addr,
-                                            current_addr + TARGET_PAGE_SIZE,
-                                            MIGRATION_DIRTY_FLAG);
-
-            p = block->host + offset;
-
-            if (is_dup_page(p, *p)) {
-                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
-                if (!cont) {
-                    qemu_put_byte(f, strlen(block->idstr));
-                    qemu_put_buffer(f, (uint8_t *)block->idstr,
-                                    strlen(block->idstr));
-                }
-                qemu_put_byte(f, *p);
-                bytes_sent = 1;
-            } else {
-                qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE);
-                if (!cont) {
-                    qemu_put_byte(f, strlen(block->idstr));
-                    qemu_put_buffer(f, (uint8_t *)block->idstr,
-                                    strlen(block->idstr));
-                }
-                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-                bytes_sent = TARGET_PAGE_SIZE;
-            }
-
-            break;
-        }
+        bytes_sent = ram_save_page(f, block, offset);
 
         offset += TARGET_PAGE_SIZE;
         if (offset >= block->length) {
@@ -159,9 +163,10 @@ int ram_save_block(QEMUFile *f)
                 block = QLIST_FIRST(&ram_list.blocks);
         }
 
-        current_addr = block->offset + offset;
-
-    } while (current_addr != last_block->offset + last_offset);
+        if (bytes_sent > 0) {
+            break;
+        }
+    } while (block->offset + offset != last_block->offset + last_offset);
 
     last_block = block;
     last_offset = offset;
@@ -277,6 +282,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     if (stage == 1) {
         RAMBlock *block;
         bytes_transferred = 0;
+        last_block_sent = NULL;
         last_block = NULL;
         last_offset = 0;
         sort_ram_list();
diff --git a/arch_init.h b/arch_init.h
index 14d6644..118461a 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -44,6 +44,7 @@ int xen_available(void);
 #define RAM_SAVE_VERSION_ID     4 /* currently version 4 */
 
 #ifdef NEED_CPU_H
+int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 void *ram_load_host_from_stream_offset(QEMUFile *f,
                                        ram_addr_t offset,
                                        int flags,
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 07/21] arch_init/ram_save_live: factor out ram_save_limit
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   28 +++++++++++++++++-----------
 migration.h |    1 +
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 982c846..249b440 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -261,9 +261,24 @@ void ram_save_live_mem_size(QEMUFile *f)
     }
 }
 
+void ram_save_memory_set_dirty(void)
+{
+    RAMBlock *block;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        ram_addr_t addr;
+        for (addr = block->offset; addr < block->offset + block->length;
+             addr += TARGET_PAGE_SIZE) {
+            if (!cpu_physical_memory_get_dirty(addr,
+                                               MIGRATION_DIRTY_FLAG)) {
+                cpu_physical_memory_set_dirty(addr);
+            }
+        }
+    }
+}
+
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
 {
-    ram_addr_t addr;
     uint64_t bytes_transferred_last;
     double bwidth = 0;
     uint64_t expected_time = 0;
@@ -280,7 +295,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     }
 
     if (stage == 1) {
-        RAMBlock *block;
         bytes_transferred = 0;
         last_block_sent = NULL;
         last_block = NULL;
@@ -288,15 +302,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         sort_ram_list();
 
         /* Make sure all dirty bits are set */
-        QLIST_FOREACH(block, &ram_list.blocks, next) {
-            for (addr = block->offset; addr < block->offset + block->length;
-                 addr += TARGET_PAGE_SIZE) {
-                if (!cpu_physical_memory_get_dirty(addr,
-                                                   MIGRATION_DIRTY_FLAG)) {
-                    cpu_physical_memory_set_dirty(addr);
-                }
-            }
-        }
+        ram_save_memory_set_dirty();
 
         /* Enable dirty memory tracking */
         cpu_physical_memory_set_dirty_tracking(1);
diff --git a/migration.h b/migration.h
index cb4a2d5..6459457 100644
--- a/migration.h
+++ b/migration.h
@@ -80,6 +80,7 @@ uint64_t ram_bytes_total(void);
 
 void sort_ram_list(void);
 int ram_save_block(QEMUFile *f);
+void ram_save_memory_set_dirty(void);
 void ram_save_live_mem_size(QEMUFile *f);
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque);
 int ram_load(QEMUFile *f, void *opaque, int version_id);
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 07/21] arch_init/ram_save_live: factor out ram_save_limit
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   28 +++++++++++++++++-----------
 migration.h |    1 +
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 982c846..249b440 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -261,9 +261,24 @@ void ram_save_live_mem_size(QEMUFile *f)
     }
 }
 
+void ram_save_memory_set_dirty(void)
+{
+    RAMBlock *block;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        ram_addr_t addr;
+        for (addr = block->offset; addr < block->offset + block->length;
+             addr += TARGET_PAGE_SIZE) {
+            if (!cpu_physical_memory_get_dirty(addr,
+                                               MIGRATION_DIRTY_FLAG)) {
+                cpu_physical_memory_set_dirty(addr);
+            }
+        }
+    }
+}
+
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
 {
-    ram_addr_t addr;
     uint64_t bytes_transferred_last;
     double bwidth = 0;
     uint64_t expected_time = 0;
@@ -280,7 +295,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     }
 
     if (stage == 1) {
-        RAMBlock *block;
         bytes_transferred = 0;
         last_block_sent = NULL;
         last_block = NULL;
@@ -288,15 +302,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         sort_ram_list();
 
         /* Make sure all dirty bits are set */
-        QLIST_FOREACH(block, &ram_list.blocks, next) {
-            for (addr = block->offset; addr < block->offset + block->length;
-                 addr += TARGET_PAGE_SIZE) {
-                if (!cpu_physical_memory_get_dirty(addr,
-                                                   MIGRATION_DIRTY_FLAG)) {
-                    cpu_physical_memory_set_dirty(addr);
-                }
-            }
-        }
+        ram_save_memory_set_dirty();
 
         /* Enable dirty memory tracking */
         cpu_physical_memory_set_dirty_tracking(1);
diff --git a/migration.h b/migration.h
index cb4a2d5..6459457 100644
--- a/migration.h
+++ b/migration.h
@@ -80,6 +80,7 @@ uint64_t ram_bytes_total(void);
 
 void sort_ram_list(void);
 int ram_save_block(QEMUFile *f);
+void ram_save_memory_set_dirty(void);
 void ram_save_live_mem_size(QEMUFile *f);
 int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque);
 int ram_load(QEMUFile *f, void *opaque, int version_id);
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 08/21] arch_init/ram_load: refactor ram_load
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   67 +++++++++++++++++++++++++++++++++-------------------------
 arch_init.h |    1 +
 2 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 249b440..bc53092 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -395,6 +395,41 @@ static inline void *host_from_stream_offset(QEMUFile *f,
     return ram_load_host_from_stream_offset(f, offset, flags, &block);
 }
 
+int ram_load_mem_size(QEMUFile *f, ram_addr_t total_ram_bytes)
+{
+    /* Synchronize RAM block list */
+    char id[256];
+    ram_addr_t length;
+
+    while (total_ram_bytes) {
+        RAMBlock *block;
+        uint8_t len;
+
+        len = qemu_get_byte(f);
+        qemu_get_buffer(f, (uint8_t *)id, len);
+        id[len] = 0;
+        length = qemu_get_be64(f);
+
+        QLIST_FOREACH(block, &ram_list.blocks, next) {
+            if (!strncmp(id, block->idstr, sizeof(id))) {
+                if (block->length != length)
+                    return -EINVAL;
+                break;
+            }
+        }
+
+        if (!block) {
+            fprintf(stderr, "Unknown ramblock \"%s\", cannot "
+                    "accept migration\n", id);
+            return -EINVAL;
+        }
+
+        total_ram_bytes -= length;
+    }
+
+    return 0;
+}
+
 int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     ram_addr_t addr;
@@ -417,35 +452,9 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
                     return -EINVAL;
                 }
             } else {
-                /* Synchronize RAM block list */
-                char id[256];
-                ram_addr_t length;
-                ram_addr_t total_ram_bytes = addr;
-
-                while (total_ram_bytes) {
-                    RAMBlock *block;
-                    uint8_t len;
-
-                    len = qemu_get_byte(f);
-                    qemu_get_buffer(f, (uint8_t *)id, len);
-                    id[len] = 0;
-                    length = qemu_get_be64(f);
-
-                    QLIST_FOREACH(block, &ram_list.blocks, next) {
-                        if (!strncmp(id, block->idstr, sizeof(id))) {
-                            if (block->length != length)
-                                return -EINVAL;
-                            break;
-                        }
-                    }
-
-                    if (!block) {
-                        fprintf(stderr, "Unknown ramblock \"%s\", cannot "
-                                "accept migration\n", id);
-                        return -EINVAL;
-                    }
-
-                    total_ram_bytes -= length;
+                error = ram_load_mem_size(f, addr);
+                if (error) {
+                    return error;
                 }
             }
         }
diff --git a/arch_init.h b/arch_init.h
index 118461a..72b906d 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -49,6 +49,7 @@ void *ram_load_host_from_stream_offset(QEMUFile *f,
                                        ram_addr_t offset,
                                        int flags,
                                        RAMBlock **last_blockp);
+int ram_load_mem_size(QEMUFile *f, ram_addr_t total_ram_bytes);
 #endif
 
 #endif
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 08/21] arch_init/ram_load: refactor ram_load
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch_init.c |   67 +++++++++++++++++++++++++++++++++-------------------------
 arch_init.h |    1 +
 2 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 249b440..bc53092 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -395,6 +395,41 @@ static inline void *host_from_stream_offset(QEMUFile *f,
     return ram_load_host_from_stream_offset(f, offset, flags, &block);
 }
 
+int ram_load_mem_size(QEMUFile *f, ram_addr_t total_ram_bytes)
+{
+    /* Synchronize RAM block list */
+    char id[256];
+    ram_addr_t length;
+
+    while (total_ram_bytes) {
+        RAMBlock *block;
+        uint8_t len;
+
+        len = qemu_get_byte(f);
+        qemu_get_buffer(f, (uint8_t *)id, len);
+        id[len] = 0;
+        length = qemu_get_be64(f);
+
+        QLIST_FOREACH(block, &ram_list.blocks, next) {
+            if (!strncmp(id, block->idstr, sizeof(id))) {
+                if (block->length != length)
+                    return -EINVAL;
+                break;
+            }
+        }
+
+        if (!block) {
+            fprintf(stderr, "Unknown ramblock \"%s\", cannot "
+                    "accept migration\n", id);
+            return -EINVAL;
+        }
+
+        total_ram_bytes -= length;
+    }
+
+    return 0;
+}
+
 int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     ram_addr_t addr;
@@ -417,35 +452,9 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
                     return -EINVAL;
                 }
             } else {
-                /* Synchronize RAM block list */
-                char id[256];
-                ram_addr_t length;
-                ram_addr_t total_ram_bytes = addr;
-
-                while (total_ram_bytes) {
-                    RAMBlock *block;
-                    uint8_t len;
-
-                    len = qemu_get_byte(f);
-                    qemu_get_buffer(f, (uint8_t *)id, len);
-                    id[len] = 0;
-                    length = qemu_get_be64(f);
-
-                    QLIST_FOREACH(block, &ram_list.blocks, next) {
-                        if (!strncmp(id, block->idstr, sizeof(id))) {
-                            if (block->length != length)
-                                return -EINVAL;
-                            break;
-                        }
-                    }
-
-                    if (!block) {
-                        fprintf(stderr, "Unknown ramblock \"%s\", cannot "
-                                "accept migration\n", id);
-                        return -EINVAL;
-                    }
-
-                    total_ram_bytes -= length;
+                error = ram_load_mem_size(f, addr);
+                if (error) {
+                    return error;
                 }
             }
         }
diff --git a/arch_init.h b/arch_init.h
index 118461a..72b906d 100644
--- a/arch_init.h
+++ b/arch_init.h
@@ -49,6 +49,7 @@ void *ram_load_host_from_stream_offset(QEMUFile *f,
                                        ram_addr_t offset,
                                        int flags,
                                        RAMBlock **last_blockp);
+int ram_load_mem_size(QEMUFile *f, ram_addr_t total_ram_bytes);
 #endif
 
 #endif
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 09/21] exec.c: factor out qemu_get_ram_ptr()
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 cpu-all.h |    2 ++
 exec.c    |   51 +++++++++++++++++++++++++++++----------------------
 2 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 9d78715..0244f7a 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -496,6 +496,8 @@ extern RAMList ram_list;
 extern const char *mem_path;
 extern int mem_prealloc;
 
+RAMBlock *qemu_get_ram_block(ram_addr_t adar);
+
 /* physical memory access */
 
 /* MMIO pages are identified by a combination of an IO device index and
diff --git a/exec.c b/exec.c
index 32782b4..51b8d15 100644
--- a/exec.c
+++ b/exec.c
@@ -3117,15 +3117,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
 }
 #endif /* !_WIN32 */
 
-/* Return a host pointer to ram allocated with qemu_ram_alloc.
-   With the exception of the softmmu code in this file, this should
-   only be used for local memory (e.g. video ram) that the device owns,
-   and knows it isn't going to access beyond the end of the block.
-
-   It should not be used for general purpose DMA.
-   Use cpu_physical_memory_map/cpu_physical_memory_rw instead.
- */
-void *qemu_get_ram_ptr(ram_addr_t addr)
+RAMBlock *qemu_get_ram_block(ram_addr_t addr)
 {
     RAMBlock *block;
 
@@ -3136,19 +3128,7 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
                 QLIST_REMOVE(block, next);
                 QLIST_INSERT_HEAD(&ram_list.blocks, block, next);
             }
-            if (xen_enabled()) {
-                /* We need to check if the requested address is in the RAM
-                 * because we don't want to map the entire memory in QEMU.
-                 * In that case just map until the end of the page.
-                 */
-                if (block->offset == 0) {
-                    return xen_map_cache(addr, 0, 0);
-                } else if (block->host == NULL) {
-                    block->host =
-                        xen_map_cache(block->offset, block->length, 1);
-                }
-            }
-            return block->host + (addr - block->offset);
+            return block;
         }
     }
 
@@ -3159,6 +3139,33 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
 }
 
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
+   With the exception of the softmmu code in this file, this should
+   only be used for local memory (e.g. video ram) that the device owns,
+   and knows it isn't going to access beyond the end of the block.
+
+   It should not be used for general purpose DMA.
+   Use cpu_physical_memory_map/cpu_physical_memory_rw instead.
+ */
+void *qemu_get_ram_ptr(ram_addr_t addr)
+{
+    RAMBlock *block = qemu_get_ram_block(addr);
+
+    if (xen_enabled()) {
+        /* We need to check if the requested address is in the RAM
+         * because we don't want to map the entire memory in QEMU.
+         * In that case just map until the end of the page.
+         */
+        if (block->offset == 0) {
+            return xen_map_cache(addr, 0, 0);
+        } else if (block->host == NULL) {
+            block->host =
+                xen_map_cache(block->offset, block->length, 1);
+        }
+    }
+    return block->host + (addr - block->offset);
+}
+
+/* Return a host pointer to ram allocated with qemu_ram_alloc.
  * Same as qemu_get_ram_ptr but avoid reordering ramblocks.
  */
 void *qemu_safe_ram_ptr(ram_addr_t addr)
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 09/21] exec.c: factor out qemu_get_ram_ptr()
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 cpu-all.h |    2 ++
 exec.c    |   51 +++++++++++++++++++++++++++++----------------------
 2 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 9d78715..0244f7a 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -496,6 +496,8 @@ extern RAMList ram_list;
 extern const char *mem_path;
 extern int mem_prealloc;
 
+RAMBlock *qemu_get_ram_block(ram_addr_t adar);
+
 /* physical memory access */
 
 /* MMIO pages are identified by a combination of an IO device index and
diff --git a/exec.c b/exec.c
index 32782b4..51b8d15 100644
--- a/exec.c
+++ b/exec.c
@@ -3117,15 +3117,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
 }
 #endif /* !_WIN32 */
 
-/* Return a host pointer to ram allocated with qemu_ram_alloc.
-   With the exception of the softmmu code in this file, this should
-   only be used for local memory (e.g. video ram) that the device owns,
-   and knows it isn't going to access beyond the end of the block.
-
-   It should not be used for general purpose DMA.
-   Use cpu_physical_memory_map/cpu_physical_memory_rw instead.
- */
-void *qemu_get_ram_ptr(ram_addr_t addr)
+RAMBlock *qemu_get_ram_block(ram_addr_t addr)
 {
     RAMBlock *block;
 
@@ -3136,19 +3128,7 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
                 QLIST_REMOVE(block, next);
                 QLIST_INSERT_HEAD(&ram_list.blocks, block, next);
             }
-            if (xen_enabled()) {
-                /* We need to check if the requested address is in the RAM
-                 * because we don't want to map the entire memory in QEMU.
-                 * In that case just map until the end of the page.
-                 */
-                if (block->offset == 0) {
-                    return xen_map_cache(addr, 0, 0);
-                } else if (block->host == NULL) {
-                    block->host =
-                        xen_map_cache(block->offset, block->length, 1);
-                }
-            }
-            return block->host + (addr - block->offset);
+            return block;
         }
     }
 
@@ -3159,6 +3139,33 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
 }
 
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
+   With the exception of the softmmu code in this file, this should
+   only be used for local memory (e.g. video ram) that the device owns,
+   and knows it isn't going to access beyond the end of the block.
+
+   It should not be used for general purpose DMA.
+   Use cpu_physical_memory_map/cpu_physical_memory_rw instead.
+ */
+void *qemu_get_ram_ptr(ram_addr_t addr)
+{
+    RAMBlock *block = qemu_get_ram_block(addr);
+
+    if (xen_enabled()) {
+        /* We need to check if the requested address is in the RAM
+         * because we don't want to map the entire memory in QEMU.
+         * In that case just map until the end of the page.
+         */
+        if (block->offset == 0) {
+            return xen_map_cache(addr, 0, 0);
+        } else if (block->host == NULL) {
+            block->host =
+                xen_map_cache(block->offset, block->length, 1);
+        }
+    }
+    return block->host + (addr - block->offset);
+}
+
+/* Return a host pointer to ram allocated with qemu_ram_alloc.
  * Same as qemu_get_ram_ptr but avoid reordering ramblocks.
  */
 void *qemu_safe_ram_ptr(ram_addr_t addr)
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 10/21] exec.c: export last_ram_offset()
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 exec-obsolete.h |    1 +
 exec.c          |    4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/exec-obsolete.h b/exec-obsolete.h
index 34b9fc5..8f69f1c 100644
--- a/exec-obsolete.h
+++ b/exec-obsolete.h
@@ -25,6 +25,7 @@
 
 #ifndef CONFIG_USER_ONLY
 
+ram_addr_t qemu_last_ram_offset(void);
 ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
                                    ram_addr_t size, void *host,
                                    MemoryRegion *mr);
diff --git a/exec.c b/exec.c
index 51b8d15..c8c6692 100644
--- a/exec.c
+++ b/exec.c
@@ -2907,7 +2907,7 @@ static ram_addr_t find_ram_offset(ram_addr_t size)
     return offset;
 }
 
-static ram_addr_t last_ram_offset(void)
+ram_addr_t qemu_last_ram_offset(void)
 {
     RAMBlock *block;
     ram_addr_t last = 0;
@@ -2989,7 +2989,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
     QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next);
 
     ram_list.phys_dirty = g_realloc(ram_list.phys_dirty,
-                                       last_ram_offset() >> TARGET_PAGE_BITS);
+                                    qemu_last_ram_offset() >> TARGET_PAGE_BITS);
     memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
            0xff, size >> TARGET_PAGE_BITS);
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 10/21] exec.c: export last_ram_offset()
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 exec-obsolete.h |    1 +
 exec.c          |    4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/exec-obsolete.h b/exec-obsolete.h
index 34b9fc5..8f69f1c 100644
--- a/exec-obsolete.h
+++ b/exec-obsolete.h
@@ -25,6 +25,7 @@
 
 #ifndef CONFIG_USER_ONLY
 
+ram_addr_t qemu_last_ram_offset(void);
 ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
                                    ram_addr_t size, void *host,
                                    MemoryRegion *mr);
diff --git a/exec.c b/exec.c
index 51b8d15..c8c6692 100644
--- a/exec.c
+++ b/exec.c
@@ -2907,7 +2907,7 @@ static ram_addr_t find_ram_offset(ram_addr_t size)
     return offset;
 }
 
-static ram_addr_t last_ram_offset(void)
+ram_addr_t qemu_last_ram_offset(void)
 {
     RAMBlock *block;
     ram_addr_t last = 0;
@@ -2989,7 +2989,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
     QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next);
 
     ram_list.phys_dirty = g_realloc(ram_list.phys_dirty,
-                                       last_ram_offset() >> TARGET_PAGE_BITS);
+                                    qemu_last_ram_offset() >> TARGET_PAGE_BITS);
     memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
            0xff, size >> TARGET_PAGE_BITS);
 
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 11/21] savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Those will be used by postcopy.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 hw/hw.h  |    3 +++
 savevm.c |    6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index efa04d1..0b481ba 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,9 @@ void qemu_put_be32(QEMUFile *f, unsigned int v);
 void qemu_put_be64(QEMUFile *f, uint64_t v);
 int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size);
 int qemu_get_byte(QEMUFile *f);
+int qemu_peek_byte(QEMUFile *f, int offset);
+int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset);
+void qemu_file_skip(QEMUFile *f, int size);
 
 static inline unsigned int qemu_get_ubyte(QEMUFile *f)
 {
diff --git a/savevm.c b/savevm.c
index f153c25..ff77846 100644
--- a/savevm.c
+++ b/savevm.c
@@ -586,14 +586,14 @@ void qemu_put_byte(QEMUFile *f, int v)
         qemu_fflush(f);
 }
 
-static void qemu_file_skip(QEMUFile *f, int size)
+void qemu_file_skip(QEMUFile *f, int size)
 {
     if (f->buf_index + size <= f->buf_size) {
         f->buf_index += size;
     }
 }
 
-static int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
+int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
 {
     int pending;
     int index;
@@ -641,7 +641,7 @@ int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size)
     return done;
 }
 
-static int qemu_peek_byte(QEMUFile *f, int offset)
+int qemu_peek_byte(QEMUFile *f, int offset)
 {
     int index = f->buf_index + offset;
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 11/21] savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Those will be used by postcopy.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 hw/hw.h  |    3 +++
 savevm.c |    6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index efa04d1..0b481ba 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,9 @@ void qemu_put_be32(QEMUFile *f, unsigned int v);
 void qemu_put_be64(QEMUFile *f, uint64_t v);
 int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size);
 int qemu_get_byte(QEMUFile *f);
+int qemu_peek_byte(QEMUFile *f, int offset);
+int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset);
+void qemu_file_skip(QEMUFile *f, int size);
 
 static inline unsigned int qemu_get_ubyte(QEMUFile *f)
 {
diff --git a/savevm.c b/savevm.c
index f153c25..ff77846 100644
--- a/savevm.c
+++ b/savevm.c
@@ -586,14 +586,14 @@ void qemu_put_byte(QEMUFile *f, int v)
         qemu_fflush(f);
 }
 
-static void qemu_file_skip(QEMUFile *f, int size)
+void qemu_file_skip(QEMUFile *f, int size)
 {
     if (f->buf_index + size <= f->buf_size) {
         f->buf_index += size;
     }
 }
 
-static int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
+int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
 {
     int pending;
     int index;
@@ -641,7 +641,7 @@ int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size)
     return done;
 }
 
-static int qemu_peek_byte(QEMUFile *f, int offset)
+int qemu_peek_byte(QEMUFile *f, int offset)
 {
     int index = f->buf_index + offset;
 
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 12/21] savevm: qemu_pending_size() to return pending buffered size
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This will be used later by postcopy migration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 hw/hw.h  |    1 +
 savevm.c |    5 +++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 0b481ba..d508b4e 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -80,6 +80,7 @@ int qemu_get_byte(QEMUFile *f);
 int qemu_peek_byte(QEMUFile *f, int offset);
 int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset);
 void qemu_file_skip(QEMUFile *f, int size);
+int qemu_pending_size(const QEMUFile *f);
 
 static inline unsigned int qemu_get_ubyte(QEMUFile *f)
 {
diff --git a/savevm.c b/savevm.c
index ff77846..1d9e218 100644
--- a/savevm.c
+++ b/savevm.c
@@ -593,6 +593,11 @@ void qemu_file_skip(QEMUFile *f, int size)
     }
 }
 
+int qemu_pending_size(const QEMUFile *f)
+{
+    return f->buf_size - f->buf_index;
+}
+
 int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
 {
     int pending;
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 12/21] savevm: qemu_pending_size() to return pending buffered size
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This will be used later by postcopy migration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 hw/hw.h  |    1 +
 savevm.c |    5 +++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 0b481ba..d508b4e 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -80,6 +80,7 @@ int qemu_get_byte(QEMUFile *f);
 int qemu_peek_byte(QEMUFile *f, int offset);
 int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset);
 void qemu_file_skip(QEMUFile *f, int size);
+int qemu_pending_size(const QEMUFile *f);
 
 static inline unsigned int qemu_get_ubyte(QEMUFile *f)
 {
diff --git a/savevm.c b/savevm.c
index ff77846..1d9e218 100644
--- a/savevm.c
+++ b/savevm.c
@@ -593,6 +593,11 @@ void qemu_file_skip(QEMUFile *f, int size)
     }
 }
 
+int qemu_pending_size(const QEMUFile *f)
+{
+    return f->buf_size - f->buf_index;
+}
+
 int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
 {
     int pending;
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 13/21] savevm, buffered_file: introduce method to drain buffer of buffered file
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Introduce a new method to drain the buffer of QEMUBufferedFile.
When postcopy migration, buffer size can increase unboundedly.
To keep the buffer size reasonably small, introduce the method to
wait for buffer to drain.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 buffered_file.c |   20 +++++++++++++++-----
 buffered_file.h |    1 +
 hw/hw.h         |    1 +
 savevm.c        |    6 ++++++
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index fed9a22..be1a192 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -168,6 +168,15 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, in
     return offset;
 }
 
+static void buffered_drain(QEMUFileBuffered *s)
+{
+    while (!qemu_file_get_error(s->file) && s->buffer_size) {
+        buffered_flush(s);
+        if (s->freeze_output)
+            s->wait_for_unfreeze(s->opaque);
+    }
+}
+
 static int buffered_close(void *opaque)
 {
     QEMUFileBuffered *s = opaque;
@@ -175,11 +184,7 @@ static int buffered_close(void *opaque)
 
     DPRINTF("closing\n");
 
-    while (!qemu_file_get_error(s->file) && s->buffer_size) {
-        buffered_flush(s);
-        if (s->freeze_output)
-            s->wait_for_unfreeze(s->opaque);
-    }
+    buffered_drain(s);
 
     ret = s->close(s->opaque);
 
@@ -289,3 +294,8 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
 
     return s->file;
 }
+
+void qemu_buffered_file_drain_buffer(void *buffered_file)
+{
+    buffered_drain(buffered_file);
+}
diff --git a/buffered_file.h b/buffered_file.h
index 98d358b..cd8e1e8 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -26,5 +26,6 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
                                   BufferedPutReadyFunc *put_ready,
                                   BufferedWaitForUnfreezeFunc *wait_for_unfreeze,
                                   BufferedCloseFunc *close);
+void qemu_buffered_file_drain_buffer(void *buffered_file);
 
 #endif
diff --git a/hw/hw.h b/hw/hw.h
index d508b4e..a59b770 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -61,6 +61,7 @@ QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
+void qemu_buffered_file_drain(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
diff --git a/savevm.c b/savevm.c
index 1d9e218..891c4fd 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,7 @@
 #include "qemu-queue.h"
 #include "qemu-timer.h"
 #include "cpus.h"
+#include "buffered_file.h"
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -475,6 +476,11 @@ void qemu_fflush(QEMUFile *f)
     }
 }
 
+void qemu_buffered_file_drain(QEMUFile *f)
+{
+    qemu_buffered_file_drain_buffer(f->opaque);
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 13/21] savevm, buffered_file: introduce method to drain buffer of buffered file
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Introduce a new method to drain the buffer of QEMUBufferedFile.
When postcopy migration, buffer size can increase unboundedly.
To keep the buffer size reasonably small, introduce the method to
wait for buffer to drain.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 buffered_file.c |   20 +++++++++++++++-----
 buffered_file.h |    1 +
 hw/hw.h         |    1 +
 savevm.c        |    6 ++++++
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index fed9a22..be1a192 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -168,6 +168,15 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, in
     return offset;
 }
 
+static void buffered_drain(QEMUFileBuffered *s)
+{
+    while (!qemu_file_get_error(s->file) && s->buffer_size) {
+        buffered_flush(s);
+        if (s->freeze_output)
+            s->wait_for_unfreeze(s->opaque);
+    }
+}
+
 static int buffered_close(void *opaque)
 {
     QEMUFileBuffered *s = opaque;
@@ -175,11 +184,7 @@ static int buffered_close(void *opaque)
 
     DPRINTF("closing\n");
 
-    while (!qemu_file_get_error(s->file) && s->buffer_size) {
-        buffered_flush(s);
-        if (s->freeze_output)
-            s->wait_for_unfreeze(s->opaque);
-    }
+    buffered_drain(s);
 
     ret = s->close(s->opaque);
 
@@ -289,3 +294,8 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
 
     return s->file;
 }
+
+void qemu_buffered_file_drain_buffer(void *buffered_file)
+{
+    buffered_drain(buffered_file);
+}
diff --git a/buffered_file.h b/buffered_file.h
index 98d358b..cd8e1e8 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -26,5 +26,6 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
                                   BufferedPutReadyFunc *put_ready,
                                   BufferedWaitForUnfreezeFunc *wait_for_unfreeze,
                                   BufferedCloseFunc *close);
+void qemu_buffered_file_drain_buffer(void *buffered_file);
 
 #endif
diff --git a/hw/hw.h b/hw/hw.h
index d508b4e..a59b770 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -61,6 +61,7 @@ QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
+void qemu_buffered_file_drain(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
diff --git a/savevm.c b/savevm.c
index 1d9e218..891c4fd 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,7 @@
 #include "qemu-queue.h"
 #include "qemu-timer.h"
 #include "cpus.h"
+#include "buffered_file.h"
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -475,6 +476,11 @@ void qemu_fflush(QEMUFile *f)
     }
 }
 
+void qemu_buffered_file_drain(QEMUFile *f)
+{
+    qemu_buffered_file_drain_buffer(f->opaque);
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 14/21] migration: export migrate_fd_completed() and migrate_fd_cleanup()
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This will be used by postcopy migration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 migration.c |    4 ++--
 migration.h |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/migration.c b/migration.c
index 412fdfe..057dde7 100644
--- a/migration.c
+++ b/migration.c
@@ -166,7 +166,7 @@ static void migrate_fd_monitor_suspend(MigrationState *s, Monitor *mon)
     }
 }
 
-static int migrate_fd_cleanup(MigrationState *s)
+int migrate_fd_cleanup(MigrationState *s)
 {
     int ret = 0;
 
@@ -198,7 +198,7 @@ void migrate_fd_error(MigrationState *s)
     migrate_fd_cleanup(s);
 }
 
-static void migrate_fd_completed(MigrationState *s)
+void migrate_fd_completed(MigrationState *s)
 {
     DPRINTF("setting completed state\n");
     if (migrate_fd_cleanup(s) < 0) {
diff --git a/migration.h b/migration.h
index 6459457..0a5e66f 100644
--- a/migration.h
+++ b/migration.h
@@ -64,7 +64,9 @@ int fd_start_incoming_migration(const char *path);
 
 int fd_start_outgoing_migration(MigrationState *s, const char *fdname);
 
+int migrate_fd_cleanup(MigrationState *s);
 void migrate_fd_error(MigrationState *s);
+void migrate_fd_completed(MigrationState *s);
 
 void migrate_fd_connect(MigrationState *s);
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 14/21] migration: export migrate_fd_completed() and migrate_fd_cleanup()
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This will be used by postcopy migration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 migration.c |    4 ++--
 migration.h |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/migration.c b/migration.c
index 412fdfe..057dde7 100644
--- a/migration.c
+++ b/migration.c
@@ -166,7 +166,7 @@ static void migrate_fd_monitor_suspend(MigrationState *s, Monitor *mon)
     }
 }
 
-static int migrate_fd_cleanup(MigrationState *s)
+int migrate_fd_cleanup(MigrationState *s)
 {
     int ret = 0;
 
@@ -198,7 +198,7 @@ void migrate_fd_error(MigrationState *s)
     migrate_fd_cleanup(s);
 }
 
-static void migrate_fd_completed(MigrationState *s)
+void migrate_fd_completed(MigrationState *s)
 {
     DPRINTF("setting completed state\n");
     if (migrate_fd_cleanup(s) < 0) {
diff --git a/migration.h b/migration.h
index 6459457..0a5e66f 100644
--- a/migration.h
+++ b/migration.h
@@ -64,7 +64,9 @@ int fd_start_incoming_migration(const char *path);
 
 int fd_start_outgoing_migration(MigrationState *s, const char *fdname);
 
+int migrate_fd_cleanup(MigrationState *s);
 void migrate_fd_error(MigrationState *s);
+void migrate_fd_completed(MigrationState *s);
 
 void migrate_fd_connect(MigrationState *s);
 
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 15/21] migration: factor out parameters into MigrationParams
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Introduce MigrationParams for parameters of migration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 block-migration.c |    8 ++++----
 hw/hw.h           |    2 +-
 migration.c       |   16 +++++++++-------
 migration.h       |    8 ++++++--
 qemu-common.h     |    1 +
 savevm.c          |   12 ++++++++----
 sysemu.h          |    4 ++--
 7 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/block-migration.c b/block-migration.c
index 2b7edbc..c320913 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -706,13 +706,13 @@ static int block_load(QEMUFile *f, void *opaque, int version_id)
     return 0;
 }
 
-static void block_set_params(int blk_enable, int shared_base, void *opaque)
+static void block_set_params(const MigrationParams *params, void *opaque)
 {
-    block_mig_state.blk_enable = blk_enable;
-    block_mig_state.shared_base = shared_base;
+    block_mig_state.blk_enable = params->blk;
+    block_mig_state.shared_base = params->shared;
 
     /* shared base means that blk_enable = 1 */
-    block_mig_state.blk_enable |= shared_base;
+    block_mig_state.blk_enable |= params->shared;
 }
 
 void blk_mig_init(void)
diff --git a/hw/hw.h b/hw/hw.h
index a59b770..c17f837 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -250,7 +250,7 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
 int64_t qemu_ftell(QEMUFile *f);
 int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence);
 
-typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque);
+typedef void SaveSetParamsHandler(const MigrationParams *params, void * opaque);
 typedef void SaveStateHandler(QEMUFile *f, void *opaque);
 typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage,
                                  void *opaque);
diff --git a/migration.c b/migration.c
index 057dde7..2cef246 100644
--- a/migration.c
+++ b/migration.c
@@ -365,7 +365,7 @@ void migrate_fd_connect(MigrationState *s)
                                       migrate_fd_close);
 
     DPRINTF("beginning savevm\n");
-    ret = qemu_savevm_state_begin(s->mon, s->file, s->blk, s->shared);
+    ret = qemu_savevm_state_begin(s->mon, s->file, &s->params);
     if (ret < 0) {
         DPRINTF("failed, %d\n", ret);
         migrate_fd_error(s);
@@ -374,15 +374,15 @@ void migrate_fd_connect(MigrationState *s)
     migrate_fd_put_ready(s);
 }
 
-static MigrationState *migrate_init(Monitor *mon, int detach, int blk, int inc)
+static MigrationState *migrate_init(Monitor *mon, int detach,
+                                    const MigrationParams *params)
 {
     MigrationState *s = migrate_get_current();
     int64_t bandwidth_limit = s->bandwidth_limit;
 
     memset(s, 0, sizeof(*s));
     s->bandwidth_limit = bandwidth_limit;
-    s->blk = blk;
-    s->shared = inc;
+    s->params = *params;
 
     /* s->mon is used for two things:
        - pass fd in fd migration
@@ -414,13 +414,15 @@ void migrate_del_blocker(Error *reason)
 int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
     MigrationState *s = migrate_get_current();
+    MigrationParams params;
     const char *p;
     int detach = qdict_get_try_bool(qdict, "detach", 0);
-    int blk = qdict_get_try_bool(qdict, "blk", 0);
-    int inc = qdict_get_try_bool(qdict, "inc", 0);
     const char *uri = qdict_get_str(qdict, "uri");
     int ret;
 
+    params.blk = qdict_get_try_bool(qdict, "blk", 0);
+    params.shared = qdict_get_try_bool(qdict, "inc", 0);
+
     if (s->state == MIG_STATE_ACTIVE) {
         monitor_printf(mon, "migration already in progress\n");
         return -1;
@@ -436,7 +438,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
         return -1;
     }
 
-    s = migrate_init(mon, detach, blk, inc);
+    s = migrate_init(mon, detach, &params);
 
     if (strstart(uri, "tcp:", &p)) {
         ret = tcp_start_outgoing_migration(s, p);
diff --git a/migration.h b/migration.h
index 0a5e66f..2e79779 100644
--- a/migration.h
+++ b/migration.h
@@ -19,6 +19,11 @@
 #include "notify.h"
 #include "error.h"
 
+struct MigrationParams {
+    int blk;
+    int shared;
+};
+
 typedef struct MigrationState MigrationState;
 
 struct MigrationState
@@ -32,8 +37,7 @@ struct MigrationState
     int (*close)(MigrationState *s);
     int (*write)(MigrationState *s, const void *buff, size_t size);
     void *opaque;
-    int blk;
-    int shared;
+    MigrationParams params;
 };
 
 void process_incoming_migration(QEMUFile *f);
diff --git a/qemu-common.h b/qemu-common.h
index b2de015..725922b 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -240,6 +240,7 @@ typedef struct SSIBus SSIBus;
 typedef struct EventNotifier EventNotifier;
 typedef struct VirtIODevice VirtIODevice;
 typedef struct QEMUSGList QEMUSGList;
+typedef struct MigrationParams MigrationParams;
 
 typedef uint64_t pcibus_t;
 
diff --git a/savevm.c b/savevm.c
index 891c4fd..2d8e09f 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1564,8 +1564,8 @@ bool qemu_savevm_state_blocked(Monitor *mon)
     return false;
 }
 
-int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-                            int shared)
+int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f,
+                            const MigrationParams *params)
 {
     SaveStateEntry *se;
     int ret;
@@ -1574,7 +1574,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
         if(se->set_params == NULL) {
             continue;
 	}
-	se->set_params(blk_enable, shared, se->opaque);
+	se->set_params(params, se->opaque);
     }
     
     qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
@@ -1712,13 +1712,17 @@ void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
 static int qemu_savevm_state(Monitor *mon, QEMUFile *f)
 {
     int ret;
+    MigrationParams params = {
+        .blk = 0,
+        .shared = 0,
+    };
 
     if (qemu_savevm_state_blocked(mon)) {
         ret = -EINVAL;
         goto out;
     }
 
-    ret = qemu_savevm_state_begin(mon, f, 0, 0);
+    ret = qemu_savevm_state_begin(mon, f, &params);
     if (ret < 0)
         goto out;
 
diff --git a/sysemu.h b/sysemu.h
index 3806901..d606525 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -66,8 +66,8 @@ void do_info_snapshots(Monitor *mon);
 void qemu_announce_self(void);
 
 bool qemu_savevm_state_blocked(Monitor *mon);
-int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-                            int shared);
+int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f,
+                            const MigrationParams *params);
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 15/21] migration: factor out parameters into MigrationParams
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Introduce MigrationParams for parameters of migration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 block-migration.c |    8 ++++----
 hw/hw.h           |    2 +-
 migration.c       |   16 +++++++++-------
 migration.h       |    8 ++++++--
 qemu-common.h     |    1 +
 savevm.c          |   12 ++++++++----
 sysemu.h          |    4 ++--
 7 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/block-migration.c b/block-migration.c
index 2b7edbc..c320913 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -706,13 +706,13 @@ static int block_load(QEMUFile *f, void *opaque, int version_id)
     return 0;
 }
 
-static void block_set_params(int blk_enable, int shared_base, void *opaque)
+static void block_set_params(const MigrationParams *params, void *opaque)
 {
-    block_mig_state.blk_enable = blk_enable;
-    block_mig_state.shared_base = shared_base;
+    block_mig_state.blk_enable = params->blk;
+    block_mig_state.shared_base = params->shared;
 
     /* shared base means that blk_enable = 1 */
-    block_mig_state.blk_enable |= shared_base;
+    block_mig_state.blk_enable |= params->shared;
 }
 
 void blk_mig_init(void)
diff --git a/hw/hw.h b/hw/hw.h
index a59b770..c17f837 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -250,7 +250,7 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv)
 int64_t qemu_ftell(QEMUFile *f);
 int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence);
 
-typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque);
+typedef void SaveSetParamsHandler(const MigrationParams *params, void * opaque);
 typedef void SaveStateHandler(QEMUFile *f, void *opaque);
 typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage,
                                  void *opaque);
diff --git a/migration.c b/migration.c
index 057dde7..2cef246 100644
--- a/migration.c
+++ b/migration.c
@@ -365,7 +365,7 @@ void migrate_fd_connect(MigrationState *s)
                                       migrate_fd_close);
 
     DPRINTF("beginning savevm\n");
-    ret = qemu_savevm_state_begin(s->mon, s->file, s->blk, s->shared);
+    ret = qemu_savevm_state_begin(s->mon, s->file, &s->params);
     if (ret < 0) {
         DPRINTF("failed, %d\n", ret);
         migrate_fd_error(s);
@@ -374,15 +374,15 @@ void migrate_fd_connect(MigrationState *s)
     migrate_fd_put_ready(s);
 }
 
-static MigrationState *migrate_init(Monitor *mon, int detach, int blk, int inc)
+static MigrationState *migrate_init(Monitor *mon, int detach,
+                                    const MigrationParams *params)
 {
     MigrationState *s = migrate_get_current();
     int64_t bandwidth_limit = s->bandwidth_limit;
 
     memset(s, 0, sizeof(*s));
     s->bandwidth_limit = bandwidth_limit;
-    s->blk = blk;
-    s->shared = inc;
+    s->params = *params;
 
     /* s->mon is used for two things:
        - pass fd in fd migration
@@ -414,13 +414,15 @@ void migrate_del_blocker(Error *reason)
 int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
     MigrationState *s = migrate_get_current();
+    MigrationParams params;
     const char *p;
     int detach = qdict_get_try_bool(qdict, "detach", 0);
-    int blk = qdict_get_try_bool(qdict, "blk", 0);
-    int inc = qdict_get_try_bool(qdict, "inc", 0);
     const char *uri = qdict_get_str(qdict, "uri");
     int ret;
 
+    params.blk = qdict_get_try_bool(qdict, "blk", 0);
+    params.shared = qdict_get_try_bool(qdict, "inc", 0);
+
     if (s->state == MIG_STATE_ACTIVE) {
         monitor_printf(mon, "migration already in progress\n");
         return -1;
@@ -436,7 +438,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
         return -1;
     }
 
-    s = migrate_init(mon, detach, blk, inc);
+    s = migrate_init(mon, detach, &params);
 
     if (strstart(uri, "tcp:", &p)) {
         ret = tcp_start_outgoing_migration(s, p);
diff --git a/migration.h b/migration.h
index 0a5e66f..2e79779 100644
--- a/migration.h
+++ b/migration.h
@@ -19,6 +19,11 @@
 #include "notify.h"
 #include "error.h"
 
+struct MigrationParams {
+    int blk;
+    int shared;
+};
+
 typedef struct MigrationState MigrationState;
 
 struct MigrationState
@@ -32,8 +37,7 @@ struct MigrationState
     int (*close)(MigrationState *s);
     int (*write)(MigrationState *s, const void *buff, size_t size);
     void *opaque;
-    int blk;
-    int shared;
+    MigrationParams params;
 };
 
 void process_incoming_migration(QEMUFile *f);
diff --git a/qemu-common.h b/qemu-common.h
index b2de015..725922b 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -240,6 +240,7 @@ typedef struct SSIBus SSIBus;
 typedef struct EventNotifier EventNotifier;
 typedef struct VirtIODevice VirtIODevice;
 typedef struct QEMUSGList QEMUSGList;
+typedef struct MigrationParams MigrationParams;
 
 typedef uint64_t pcibus_t;
 
diff --git a/savevm.c b/savevm.c
index 891c4fd..2d8e09f 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1564,8 +1564,8 @@ bool qemu_savevm_state_blocked(Monitor *mon)
     return false;
 }
 
-int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-                            int shared)
+int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f,
+                            const MigrationParams *params)
 {
     SaveStateEntry *se;
     int ret;
@@ -1574,7 +1574,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
         if(se->set_params == NULL) {
             continue;
 	}
-	se->set_params(blk_enable, shared, se->opaque);
+	se->set_params(params, se->opaque);
     }
     
     qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
@@ -1712,13 +1712,17 @@ void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
 static int qemu_savevm_state(Monitor *mon, QEMUFile *f)
 {
     int ret;
+    MigrationParams params = {
+        .blk = 0,
+        .shared = 0,
+    };
 
     if (qemu_savevm_state_blocked(mon)) {
         ret = -EINVAL;
         goto out;
     }
 
-    ret = qemu_savevm_state_begin(mon, f, 0, 0);
+    ret = qemu_savevm_state_begin(mon, f, &params);
     if (ret < 0)
         goto out;
 
diff --git a/sysemu.h b/sysemu.h
index 3806901..d606525 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -66,8 +66,8 @@ void do_info_snapshots(Monitor *mon);
 void qemu_announce_self(void);
 
 bool qemu_savevm_state_blocked(Monitor *mon);
-int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-                            int shared);
+int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f,
+                            const MigrationParams *params);
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 16/21] umem.h: import Linux umem.h
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 linux-headers/linux/umem.h |   83 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 83 insertions(+), 0 deletions(-)
 create mode 100644 linux-headers/linux/umem.h

diff --git a/linux-headers/linux/umem.h b/linux-headers/linux/umem.h
new file mode 100644
index 0000000..e1a8633
--- /dev/null
+++ b/linux-headers/linux/umem.h
@@ -0,0 +1,83 @@
+/*
+ * User process backed memory.
+ * This is mainly for KVM post copy.
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_UMEM_H
+#define __LINUX_UMEM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#ifdef __KERNEL__
+#include <linux/compiler.h>
+#else
+#define __user
+#endif
+
+#define UMEM_ID_MAX	256
+#define UMEM_NAME_MAX	256
+
+struct umem_name {
+	char id[UMEM_ID_MAX];		/* non-zero terminated */
+	char name[UMEM_NAME_MAX];	/* non-zero terminated */
+};
+
+struct umem_list {
+	__u32 nr;
+	__u32 padding;
+	struct umem_name names[0];
+};
+
+struct umem_create {
+	__u64 size;	/* in bytes */
+	__s32 umem_fd;
+	__s32 shmem_fd;
+	__u32 async_req_max;
+	__u32 sync_req_max;
+	struct umem_name name;
+};
+
+struct umem_page_request {
+	__u64 __user *pgoffs;
+	__u32 nr;
+	__u32 padding;
+};
+
+struct umem_page_cached {
+	__u64 __user *pgoffs;
+	__u32 nr;
+	__u32 padding;
+};
+
+#define UMEMIO	0x1E
+
+/* ioctl for umem_dev fd */
+#define UMEM_DEV_CREATE_UMEM	_IOWR(UMEMIO, 0x0, struct umem_create)
+#define UMEM_DEV_LIST		_IOWR(UMEMIO, 0x1, struct umem_list)
+#define UMEM_DEV_REATTACH	_IOWR(UMEMIO, 0x2, struct umem_create)
+
+/* ioctl for umem fd */
+#define UMEM_GET_PAGE_REQUEST	_IOWR(UMEMIO, 0x10, struct umem_page_request)
+#define UMEM_MARK_PAGE_CACHED	_IOW (UMEMIO, 0x11, struct umem_page_cached)
+#define UMEM_MAKE_VMA_ANONYMOUS	_IO  (UMEMIO, 0x12)
+
+#endif /* __LINUX_UMEM_H */
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 16/21] umem.h: import Linux umem.h
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 linux-headers/linux/umem.h |   83 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 83 insertions(+), 0 deletions(-)
 create mode 100644 linux-headers/linux/umem.h

diff --git a/linux-headers/linux/umem.h b/linux-headers/linux/umem.h
new file mode 100644
index 0000000..e1a8633
--- /dev/null
+++ b/linux-headers/linux/umem.h
@@ -0,0 +1,83 @@
+/*
+ * User process backed memory.
+ * This is mainly for KVM post copy.
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_UMEM_H
+#define __LINUX_UMEM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#ifdef __KERNEL__
+#include <linux/compiler.h>
+#else
+#define __user
+#endif
+
+#define UMEM_ID_MAX	256
+#define UMEM_NAME_MAX	256
+
+struct umem_name {
+	char id[UMEM_ID_MAX];		/* non-zero terminated */
+	char name[UMEM_NAME_MAX];	/* non-zero terminated */
+};
+
+struct umem_list {
+	__u32 nr;
+	__u32 padding;
+	struct umem_name names[0];
+};
+
+struct umem_create {
+	__u64 size;	/* in bytes */
+	__s32 umem_fd;
+	__s32 shmem_fd;
+	__u32 async_req_max;
+	__u32 sync_req_max;
+	struct umem_name name;
+};
+
+struct umem_page_request {
+	__u64 __user *pgoffs;
+	__u32 nr;
+	__u32 padding;
+};
+
+struct umem_page_cached {
+	__u64 __user *pgoffs;
+	__u32 nr;
+	__u32 padding;
+};
+
+#define UMEMIO	0x1E
+
+/* ioctl for umem_dev fd */
+#define UMEM_DEV_CREATE_UMEM	_IOWR(UMEMIO, 0x0, struct umem_create)
+#define UMEM_DEV_LIST		_IOWR(UMEMIO, 0x1, struct umem_list)
+#define UMEM_DEV_REATTACH	_IOWR(UMEMIO, 0x2, struct umem_create)
+
+/* ioctl for umem fd */
+#define UMEM_GET_PAGE_REQUEST	_IOWR(UMEMIO, 0x10, struct umem_page_request)
+#define UMEM_MARK_PAGE_CACHED	_IOW (UMEMIO, 0x11, struct umem_page_cached)
+#define UMEM_MAKE_VMA_ANONYMOUS	_IO  (UMEMIO, 0x12)
+
+#endif /* __LINUX_UMEM_H */
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 17/21] update-linux-headers.sh: teach umem.h to update-linux-headers.sh
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 scripts/update-linux-headers.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 9d2a4bc..2afdd54 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -43,7 +43,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h kvm_para.h vhost.h virtio_config.h virtio_ring.h; do
+for header in kvm.h kvm_para.h vhost.h virtio_config.h virtio_ring.h umem.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
 if [ -L "$linux/source" ]; then
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 17/21] update-linux-headers.sh: teach umem.h to update-linux-headers.sh
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 scripts/update-linux-headers.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 9d2a4bc..2afdd54 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -43,7 +43,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h kvm_para.h vhost.h virtio_config.h virtio_ring.h; do
+for header in kvm.h kvm_para.h vhost.h virtio_config.h virtio_ring.h umem.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
 if [ -L "$linux/source" ]; then
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 18/21] configure: add CONFIG_POSTCOPY option
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Add enable/disable postcopy mode. No dynamic test yet.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 configure |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 640e815..440fa9e 100755
--- a/configure
+++ b/configure
@@ -190,6 +190,7 @@ opengl=""
 zlib="yes"
 guest_agent="yes"
 libiscsi=""
+postcopy="yes"
 
 # parse CC options first
 for opt do
@@ -789,6 +790,10 @@ for opt do
   ;;
   --disable-guest-agent) guest_agent="no"
   ;;
+  --enable-postcopy) postcopy="yes"
+  ;;
+  --disable-postcopy) postcopy="no"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -1075,6 +1080,8 @@ echo "  --disable-usb-redir      disable usb network redirection support"
 echo "  --enable-usb-redir       enable usb network redirection support"
 echo "  --disable-guest-agent    disable building of the QEMU Guest Agent"
 echo "  --enable-guest-agent     enable building of the QEMU Guest Agent"
+echo "  --disable-postcopy       disable postcopy mode for live migration"
+echo "  --enable-postcopy        enable postcopy mode for live migration"
 echo ""
 echo "NOTE: The object files are built at the place where configure is launched"
 exit 1
@@ -2879,6 +2886,7 @@ echo "usb net redir     $usb_redir"
 echo "OpenGL support    $opengl"
 echo "libiscsi support  $libiscsi"
 echo "build guest agent $guest_agent"
+echo "postcopy support  $postcopy"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -3192,6 +3200,10 @@ if test "$libiscsi" = "yes" ; then
   echo "CONFIG_LIBISCSI=y" >> $config_host_mak
 fi
 
+if test "$postcopy" = "yes" ; then
+  echo "CONFIG_POSTCOPY=y" >> $config_host_mak
+fi
+
 # XXX: suppress that
 if [ "$bsd" = "yes" ] ; then
   echo "CONFIG_BSD=y" >> $config_host_mak
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 18/21] configure: add CONFIG_POSTCOPY option
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Add enable/disable postcopy mode. No dynamic test yet.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 configure |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 640e815..440fa9e 100755
--- a/configure
+++ b/configure
@@ -190,6 +190,7 @@ opengl=""
 zlib="yes"
 guest_agent="yes"
 libiscsi=""
+postcopy="yes"
 
 # parse CC options first
 for opt do
@@ -789,6 +790,10 @@ for opt do
   ;;
   --disable-guest-agent) guest_agent="no"
   ;;
+  --enable-postcopy) postcopy="yes"
+  ;;
+  --disable-postcopy) postcopy="no"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -1075,6 +1080,8 @@ echo "  --disable-usb-redir      disable usb network redirection support"
 echo "  --enable-usb-redir       enable usb network redirection support"
 echo "  --disable-guest-agent    disable building of the QEMU Guest Agent"
 echo "  --enable-guest-agent     enable building of the QEMU Guest Agent"
+echo "  --disable-postcopy       disable postcopy mode for live migration"
+echo "  --enable-postcopy        enable postcopy mode for live migration"
 echo ""
 echo "NOTE: The object files are built at the place where configure is launched"
 exit 1
@@ -2879,6 +2886,7 @@ echo "usb net redir     $usb_redir"
 echo "OpenGL support    $opengl"
 echo "libiscsi support  $libiscsi"
 echo "build guest agent $guest_agent"
+echo "postcopy support  $postcopy"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -3192,6 +3200,10 @@ if test "$libiscsi" = "yes" ; then
   echo "CONFIG_LIBISCSI=y" >> $config_host_mak
 fi
 
+if test "$postcopy" = "yes" ; then
+  echo "CONFIG_POSTCOPY=y" >> $config_host_mak
+fi
+
 # XXX: suppress that
 if [ "$bsd" = "yes" ] ; then
   echo "CONFIG_BSD=y" >> $config_host_mak
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 19/21] postcopy: introduce -postcopy and -postcopy-flags option
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This patch prepares for postcopy livemigration.
It introduces -postcopy option and its internal flag, migration_postcopy.
It introduces -postcopy-flags for chaging the behavior of incoming postcopy
mainly for benchmark/debug.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>

postcopy: introduce -postcopy-flags option

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 migration.h     |    3 +++
 qemu-options.hx |   22 ++++++++++++++++++++++
 vl.c            |    8 ++++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/migration.h b/migration.h
index 2e79779..29f468c 100644
--- a/migration.h
+++ b/migration.h
@@ -105,4 +105,7 @@ void migrate_add_blocker(Error *reason);
  */
 void migrate_del_blocker(Error *reason);
 
+extern bool incoming_postcopy;
+extern unsigned long incoming_postcopy_flags;
+
 #endif
diff --git a/qemu-options.hx b/qemu-options.hx
index a60191f..5c5b8f3 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2497,6 +2497,28 @@ STEXI
 Prepare for incoming migration, listen on @var{port}.
 ETEXI
 
+DEF("postcopy", 0, QEMU_OPTION_postcopy,
+    "-postcopy	postcopy incoming migration when -incoming is specified\n",
+    QEMU_ARCH_ALL)
+STEXI
+@item -postcopy
+@findex -postcopy
+start incoming migration in postcopy mode.
+ETEXI
+
+DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
+    "-postcopy-flags unsigned-int(flags)\n"
+    "	                flags for postcopy incoming migration\n"
+    "                   when -incoming and -postcopy are specified.\n"
+    "                   This is for benchmark/debug purpose (default: 0)\n",
+    QEMU_ARCH_ALL)
+STEXI
+@item -postcopy-flags int
+@findex -postcopy-flags
+Specify flags for incoming postcopy migration when -incoming and -postcopy are
+specified. This is for benchamrk/debug purpose. (default: 0)
+ETEXI
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 STEXI
diff --git a/vl.c b/vl.c
index a4c9489..5430b8c 100644
--- a/vl.c
+++ b/vl.c
@@ -188,6 +188,8 @@ int mem_prealloc = 0; /* force preallocation of physical target memory */
 int nb_nics;
 NICInfo nd_table[MAX_NICS];
 int autostart;
+bool incoming_postcopy = false; /* When -incoming is specified, postcopy mode */
+unsigned long incoming_postcopy_flags = 0; /* flags for postcopy incoming mode */
 static int rtc_utc = 1;
 static int rtc_date_offset = -1; /* -1 means no change */
 QEMUClock *rtc_clock;
@@ -2969,6 +2971,12 @@ int main(int argc, char **argv, char **envp)
             case QEMU_OPTION_incoming:
                 incoming = optarg;
                 break;
+            case QEMU_OPTION_postcopy:
+                incoming_postcopy = true;
+                break;
+            case QEMU_OPTION_postcopy_flags:
+                incoming_postcopy_flags = strtoul(optarg, NULL, 0);
+                break;
             case QEMU_OPTION_nodefaults:
                 default_serial = 0;
                 default_parallel = 0;
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 19/21] postcopy: introduce -postcopy and -postcopy-flags option
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This patch prepares for postcopy livemigration.
It introduces -postcopy option and its internal flag, migration_postcopy.
It introduces -postcopy-flags for chaging the behavior of incoming postcopy
mainly for benchmark/debug.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>

postcopy: introduce -postcopy-flags option

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 migration.h     |    3 +++
 qemu-options.hx |   22 ++++++++++++++++++++++
 vl.c            |    8 ++++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/migration.h b/migration.h
index 2e79779..29f468c 100644
--- a/migration.h
+++ b/migration.h
@@ -105,4 +105,7 @@ void migrate_add_blocker(Error *reason);
  */
 void migrate_del_blocker(Error *reason);
 
+extern bool incoming_postcopy;
+extern unsigned long incoming_postcopy_flags;
+
 #endif
diff --git a/qemu-options.hx b/qemu-options.hx
index a60191f..5c5b8f3 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2497,6 +2497,28 @@ STEXI
 Prepare for incoming migration, listen on @var{port}.
 ETEXI
 
+DEF("postcopy", 0, QEMU_OPTION_postcopy,
+    "-postcopy	postcopy incoming migration when -incoming is specified\n",
+    QEMU_ARCH_ALL)
+STEXI
+@item -postcopy
+@findex -postcopy
+start incoming migration in postcopy mode.
+ETEXI
+
+DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
+    "-postcopy-flags unsigned-int(flags)\n"
+    "	                flags for postcopy incoming migration\n"
+    "                   when -incoming and -postcopy are specified.\n"
+    "                   This is for benchmark/debug purpose (default: 0)\n",
+    QEMU_ARCH_ALL)
+STEXI
+@item -postcopy-flags int
+@findex -postcopy-flags
+Specify flags for incoming postcopy migration when -incoming and -postcopy are
+specified. This is for benchamrk/debug purpose. (default: 0)
+ETEXI
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 STEXI
diff --git a/vl.c b/vl.c
index a4c9489..5430b8c 100644
--- a/vl.c
+++ b/vl.c
@@ -188,6 +188,8 @@ int mem_prealloc = 0; /* force preallocation of physical target memory */
 int nb_nics;
 NICInfo nd_table[MAX_NICS];
 int autostart;
+bool incoming_postcopy = false; /* When -incoming is specified, postcopy mode */
+unsigned long incoming_postcopy_flags = 0; /* flags for postcopy incoming mode */
 static int rtc_utc = 1;
 static int rtc_date_offset = -1; /* -1 means no change */
 QEMUClock *rtc_clock;
@@ -2969,6 +2971,12 @@ int main(int argc, char **argv, char **envp)
             case QEMU_OPTION_incoming:
                 incoming = optarg;
                 break;
+            case QEMU_OPTION_postcopy:
+                incoming_postcopy = true;
+                break;
+            case QEMU_OPTION_postcopy_flags:
+                incoming_postcopy_flags = strtoul(optarg, NULL, 0);
+                break;
             case QEMU_OPTION_nodefaults:
                 default_serial = 0;
                 default_parallel = 0;
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 20/21] postcopy outgoing: add -p and -n option to migrate command
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:25   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Added -p option to migrate command for postcopy mode and
introduce postcopy parameter for migration to indicate that postcopy mode is enabled.
Add -n option for postcopy migration which indicates disabling background transfer.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 hmp-commands.hx |   12 ++++++++----
 migration.c     |    2 ++
 migration.h     |    2 ++
 qmp-commands.hx |   10 +++++++---
 savevm.c        |    2 ++
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 14838b7..42a5f7e 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -746,24 +746,28 @@ ETEXI
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,postcopy:-p,nobg:-n,uri:s",
+        .params     = "[-d] [-b] [-i] [-p [-n]] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t-p for migration with postcopy mode enabled"
+		      "\n\t\t\t-n for no background transfer of postcopy mode",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
 
 
 STEXI
-@item migrate [-d] [-b] [-i] @var{uri}
+@item migrate [-d] [-b] [-i] [-p [-n]] @var{uri}
 @findex migrate
 Migrate to @var{uri} (using -d to not wait for completion).
 	-b for migration with full copy of disk
 	-i for migration with incremental copy of disk (base image is shared)
+	-p for migration with postcopy mode enabled
+	-n for migration with postcopy mode enabled without background transfer
 ETEXI
 
     {
diff --git a/migration.c b/migration.c
index 2cef246..0149ab3 100644
--- a/migration.c
+++ b/migration.c
@@ -422,6 +422,8 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
 
     params.blk = qdict_get_try_bool(qdict, "blk", 0);
     params.shared = qdict_get_try_bool(qdict, "inc", 0);
+    params.postcopy = qdict_get_try_bool(qdict, "postcopy", 0);
+    params.nobg = qdict_get_try_bool(qdict, "nobg", 0);
 
     if (s->state == MIG_STATE_ACTIVE) {
         monitor_printf(mon, "migration already in progress\n");
diff --git a/migration.h b/migration.h
index 29f468c..90ae362 100644
--- a/migration.h
+++ b/migration.h
@@ -22,6 +22,8 @@
 struct MigrationParams {
     int blk;
     int shared;
+    int postcopy;
+    int nobg;
 };
 
 typedef struct MigrationState MigrationState;
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 7e3f4b9..67c7df6 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -430,13 +430,15 @@ EQMP
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,postcopy:-p,nobg:-n,uri:s",
+        .params     = "[-d] [-b] [-i [-n]] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t-p for migration with postcopy mode enabled"
+		      "\n\t\t\t-n for no background transfer of postcopy mode",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
@@ -451,6 +453,8 @@ Arguments:
 
 - "blk": block migration, full disk copy (json-bool, optional)
 - "inc": incremental disk copy (json-bool, optional)
+- "postcopy": postcopy migration (json-bool, optional)
+- "nobg": postcopy without background transfer (json-bool, optional)
 - "uri": Destination URI (json-string)
 
 Example:
diff --git a/savevm.c b/savevm.c
index 2d8e09f..bafb706 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1715,6 +1715,8 @@ static int qemu_savevm_state(Monitor *mon, QEMUFile *f)
     MigrationParams params = {
         .blk = 0,
         .shared = 0,
+        .postcopy = 0,
+        .nobg = 0,
     };
 
     if (qemu_savevm_state_blocked(mon)) {
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 20/21] postcopy outgoing: add -p and -n option to migrate command
@ 2011-12-29  1:25   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:25 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Added -p option to migrate command for postcopy mode and
introduce postcopy parameter for migration to indicate that postcopy mode is enabled.
Add -n option for postcopy migration which indicates disabling background transfer.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 hmp-commands.hx |   12 ++++++++----
 migration.c     |    2 ++
 migration.h     |    2 ++
 qmp-commands.hx |   10 +++++++---
 savevm.c        |    2 ++
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 14838b7..42a5f7e 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -746,24 +746,28 @@ ETEXI
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,postcopy:-p,nobg:-n,uri:s",
+        .params     = "[-d] [-b] [-i] [-p [-n]] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t-p for migration with postcopy mode enabled"
+		      "\n\t\t\t-n for no background transfer of postcopy mode",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
 
 
 STEXI
-@item migrate [-d] [-b] [-i] @var{uri}
+@item migrate [-d] [-b] [-i] [-p [-n]] @var{uri}
 @findex migrate
 Migrate to @var{uri} (using -d to not wait for completion).
 	-b for migration with full copy of disk
 	-i for migration with incremental copy of disk (base image is shared)
+	-p for migration with postcopy mode enabled
+	-n for migration with postcopy mode enabled without background transfer
 ETEXI
 
     {
diff --git a/migration.c b/migration.c
index 2cef246..0149ab3 100644
--- a/migration.c
+++ b/migration.c
@@ -422,6 +422,8 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
 
     params.blk = qdict_get_try_bool(qdict, "blk", 0);
     params.shared = qdict_get_try_bool(qdict, "inc", 0);
+    params.postcopy = qdict_get_try_bool(qdict, "postcopy", 0);
+    params.nobg = qdict_get_try_bool(qdict, "nobg", 0);
 
     if (s->state == MIG_STATE_ACTIVE) {
         monitor_printf(mon, "migration already in progress\n");
diff --git a/migration.h b/migration.h
index 29f468c..90ae362 100644
--- a/migration.h
+++ b/migration.h
@@ -22,6 +22,8 @@
 struct MigrationParams {
     int blk;
     int shared;
+    int postcopy;
+    int nobg;
 };
 
 typedef struct MigrationState MigrationState;
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 7e3f4b9..67c7df6 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -430,13 +430,15 @@ EQMP
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,postcopy:-p,nobg:-n,uri:s",
+        .params     = "[-d] [-b] [-i [-n]] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t-p for migration with postcopy mode enabled"
+		      "\n\t\t\t-n for no background transfer of postcopy mode",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
@@ -451,6 +453,8 @@ Arguments:
 
 - "blk": block migration, full disk copy (json-bool, optional)
 - "inc": incremental disk copy (json-bool, optional)
+- "postcopy": postcopy migration (json-bool, optional)
+- "nobg": postcopy without background transfer (json-bool, optional)
 - "uri": Destination URI (json-string)
 
 Example:
diff --git a/savevm.c b/savevm.c
index 2d8e09f..bafb706 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1715,6 +1715,8 @@ static int qemu_savevm_state(Monitor *mon, QEMUFile *f)
     MigrationParams params = {
         .blk = 0,
         .shared = 0,
+        .postcopy = 0,
+        .nobg = 0,
     };
 
     if (qemu_savevm_state_blocked(mon)) {
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 21/21] postcopy: implement postcopy livemigration
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29  1:26   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:26 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This patch implements postcopy livemigration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 Makefile.target           |    4 +
 arch_init.c               |   26 +-
 cpu-all.h                 |    7 +
 exec.c                    |   20 +-
 migration-exec.c          |    8 +
 migration-fd.c            |   30 +
 migration-postcopy-stub.c |   77 ++
 migration-postcopy.c      | 1891 +++++++++++++++++++++++++++++++++++++++++++++
 migration-tcp.c           |   37 +-
 migration-unix.c          |   32 +-
 migration.c               |   31 +
 migration.h               |   30 +
 qemu-common.h             |    1 +
 qemu-options.hx           |    5 +-
 umem.c                    |  379 +++++++++
 umem.h                    |  105 +++
 vl.c                      |   14 +-
 17 files changed, 2677 insertions(+), 20 deletions(-)
 create mode 100644 migration-postcopy-stub.c
 create mode 100644 migration-postcopy.c
 create mode 100644 umem.c
 create mode 100644 umem.h

diff --git a/Makefile.target b/Makefile.target
index 3261383..d94c53f 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -4,6 +4,7 @@ GENERATED_HEADERS = config-target.h
 CONFIG_NO_PCI = $(if $(subst n,,$(CONFIG_PCI)),n,y)
 CONFIG_NO_KVM = $(if $(subst n,,$(CONFIG_KVM)),n,y)
 CONFIG_NO_XEN = $(if $(subst n,,$(CONFIG_XEN)),n,y)
+CONFIG_NO_POSTCOPY = $(if $(subst n,,$(CONFIG_POSTCOPY)),n,y)
 
 include ../config-host.mak
 include config-devices.mak
@@ -199,6 +200,9 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
 obj-y += memory.o
 LIBS+=-lz
 
+common-obj-$(CONFIG_POSTCOPY) += migration-postcopy.o umem.o
+common-obj-$(CONFIG_NO_POSTCOPY) += migration-postcopy-stub.o
+
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
 QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
diff --git a/arch_init.c b/arch_init.c
index bc53092..8b3130d 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -102,6 +102,13 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
     return 1;
 }
 
+static bool outgoing_postcopy = false;
+
+void ram_save_set_params(const MigrationParams *params, void *opaque)
+{
+    outgoing_postcopy = params->postcopy;
+}
+
 static RAMBlock *last_block_sent = NULL;
 
 int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
@@ -284,6 +291,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     uint64_t expected_time = 0;
     int ret;
 
+    if (stage == 1) {
+        last_block_sent = NULL;
+
+        bytes_transferred = 0;
+        last_block = NULL;
+        last_offset = 0;
+    }
+    if (outgoing_postcopy) {
+        return postcopy_outgoing_ram_save_live(mon, f, stage, opaque);
+    }
+
     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
         return 0;
@@ -295,10 +313,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     }
 
     if (stage == 1) {
-        bytes_transferred = 0;
-        last_block_sent = NULL;
-        last_block = NULL;
-        last_offset = 0;
         sort_ram_list();
 
         /* Make sure all dirty bits are set */
@@ -436,6 +450,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
     int flags;
     int error;
 
+    if (incoming_postcopy) {
+        return postcopy_incoming_ram_load(f, opaque, version_id);
+    }
+
     if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
         return -EINVAL;
     }
diff --git a/cpu-all.h b/cpu-all.h
index 0244f7a..2e9d8a7 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -475,6 +475,9 @@ extern ram_addr_t ram_size;
 /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
 #define RAM_PREALLOC_MASK   (1 << 0)
 
+/* RAM is allocated via umem for postcopy incoming mode */
+#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
+
 typedef struct RAMBlock {
     uint8_t *host;
     ram_addr_t offset;
@@ -485,6 +488,10 @@ typedef struct RAMBlock {
 #if defined(__linux__) && !defined(TARGET_S390X)
     int fd;
 #endif
+
+#ifdef CONFIG_POSTCOPY
+    UMem *umem;    /* for incoming postcopy mode */
+#endif
 } RAMBlock;
 
 typedef struct RAMList {
diff --git a/exec.c b/exec.c
index c8c6692..90b0491 100644
--- a/exec.c
+++ b/exec.c
@@ -35,6 +35,7 @@
 #include "qemu-timer.h"
 #include "memory.h"
 #include "exec-memory.h"
+#include "migration.h"
 #if defined(CONFIG_USER_ONLY)
 #include <qemu.h>
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -2949,6 +2950,13 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
         new_block->host = host;
         new_block->flags |= RAM_PREALLOC_MASK;
     } else {
+#ifdef CONFIG_POSTCOPY
+        if (incoming_postcopy) {
+            postcopy_incoming_ram_alloc(name, size,
+                                        &new_block->host, &new_block->umem);
+            new_block->flags |= RAM_POSTCOPY_UMEM_MASK;
+        } else
+#endif
         if (mem_path) {
 #if defined (__linux__) && !defined(TARGET_S390X)
             new_block->host = file_ram_alloc(new_block, size, mem_path);
@@ -3027,7 +3035,13 @@ void qemu_ram_free(ram_addr_t addr)
             QLIST_REMOVE(block, next);
             if (block->flags & RAM_PREALLOC_MASK) {
                 ;
-            } else if (mem_path) {
+            }
+#ifdef CONFIG_POSTCOPY
+            else if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
+                postcopy_incoming_ram_free(block->umem);
+            }
+#endif
+            else if (mem_path) {
 #if defined (__linux__) && !defined(TARGET_S390X)
                 if (block->fd) {
                     munmap(block->host, block->length);
@@ -3073,6 +3087,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
             } else {
                 flags = MAP_FIXED;
                 munmap(vaddr, length);
+                if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
+                    postcopy_incoming_qemu_pages_unmapped(addr, length);
+                    block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
+                }
                 if (mem_path) {
 #if defined(__linux__) && !defined(TARGET_S390X)
                     if (block->fd) {
diff --git a/migration-exec.c b/migration-exec.c
index e14552e..2bd0c3b 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -62,6 +62,10 @@ int exec_start_outgoing_migration(MigrationState *s, const char *command)
 {
     FILE *f;
 
+    if (s->params.postcopy) {
+        return -ENOSYS;
+    }
+
     f = popen(command, "w");
     if (f == NULL) {
         DPRINTF("Unable to popen exec target\n");
@@ -104,6 +108,10 @@ int exec_start_incoming_migration(const char *command)
 {
     QEMUFile *f;
 
+    if (incoming_postcopy) {
+        return -ENOSYS;
+    }
+
     DPRINTF("Attempting to start an incoming migration\n");
     f = qemu_popen_cmd(command, "r");
     if(f == NULL) {
diff --git a/migration-fd.c b/migration-fd.c
index 6211124..5a62ab9 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -88,6 +88,23 @@ int fd_start_outgoing_migration(MigrationState *s, const char *fdname)
     s->write = fd_write;
     s->close = fd_close;
 
+    if (s->params.postcopy) {
+        int flags = fcntl(s->fd, F_GETFL);
+        if ((flags & O_ACCMODE) != O_RDWR) {
+            goto err_after_open;
+        }
+
+        s->fd_read = dup(s->fd);
+        if (s->fd_read == -1) {
+            goto err_after_open;
+        }
+        s->file_read = qemu_fdopen(s->fd_read, "r");
+        if (s->file_read == NULL) {
+            close(s->fd_read);
+            goto err_after_open;
+        }
+    }
+
     migrate_fd_connect(s);
     return 0;
 
@@ -103,7 +120,14 @@ static void fd_accept_incoming_migration(void *opaque)
 
     process_incoming_migration(f);
     qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
+    if (incoming_postcopy) {
+        postcopy_incoming_fork_umemd(qemu_stdio_fd(f), f);
+    }
     qemu_fclose(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_qemu_ready();
+    }
+    return;
 }
 
 int fd_start_incoming_migration(const char *infd)
@@ -114,6 +138,12 @@ int fd_start_incoming_migration(const char *infd)
     DPRINTF("Attempting to start an incoming migration via fd\n");
 
     fd = strtol(infd, NULL, 0);
+    if (incoming_postcopy) {
+        int flags = fcntl(fd, F_GETFL);
+        if ((flags & O_ACCMODE) != O_RDWR) {
+            return -EINVAL;
+        }
+    }
     f = qemu_fdopen(fd, "rb");
     if(f == NULL) {
         DPRINTF("Unable to apply qemu wrapper to file descriptor\n");
diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
new file mode 100644
index 0000000..0b78de7
--- /dev/null
+++ b/migration-postcopy-stub.c
@@ -0,0 +1,77 @@
+/*
+ * migration-postcopy-stub.c: postcopy livemigration
+ *                            stub functions for non-supported hosts
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "sysemu.h"
+#include "migration.h"
+
+int postcopy_outgoing_create_read_socket(MigrationState *s)
+{
+    return -ENOSYS;
+}
+
+int postcopy_outgoing_ram_save_live(Monitor *mon,
+                                    QEMUFile *f, int stage, void *opaque)
+{
+    return -ENOSYS;
+}
+
+void *postcopy_outgoing_begin(MigrationState *ms)
+{
+    return NULL;
+}
+
+int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
+                                          void *postcopy)
+{
+    return -ENOSYS;
+}
+
+int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
+{
+    return -ENOSYS;
+}
+
+void postcopy_incoming_prepare(void)
+{
+}
+
+int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
+{
+    return -ENOSYS;
+}
+
+void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
+{
+}
+
+void postcopy_incoming_qemu_ready(void)
+{
+}
+
+void postcopy_incoming_qemu_cleanup(void)
+{
+}
+
+void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
+{
+}
diff --git a/migration-postcopy.c b/migration-postcopy.c
new file mode 100644
index 0000000..ed0d574
--- /dev/null
+++ b/migration-postcopy.c
@@ -0,0 +1,1891 @@
+/*
+ * migration-postcopy.c: postcopy livemigration
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "bitmap.h"
+#include "sysemu.h"
+#include "hw/hw.h"
+#include "arch_init.h"
+#include "migration.h"
+#include "umem.h"
+
+#include "memory.h"
+#define WANT_EXEC_OBSOLETE
+#include "exec-obsolete.h"
+
+//#define DEBUG_POSTCOPY
+#ifdef DEBUG_POSTCOPY
+#include <sys/syscall.h>
+#define DPRINTF(fmt, ...)                                               \
+    do {                                                                \
+        printf("%d:%ld %s:%d: " fmt, getpid(), syscall(SYS_gettid),     \
+               __func__, __LINE__, ## __VA_ARGS__);                     \
+    } while (0)
+#else
+#define DPRINTF(fmt, ...)       do { } while (0)
+#endif
+
+#define ALIGN_UP(size, align)   (((size) + (align) - 1) & ~((align) - 1))
+
+static void fd_close(int *fd)
+{
+    if (*fd >= 0) {
+        close(*fd);
+        *fd = -1;
+    }
+}
+
+/***************************************************************************
+ * QEMUFile for non blocking pipe
+ */
+
+/* read only */
+struct QEMUFilePipe {
+    int fd;
+    QEMUFile *file;
+};
+typedef struct QEMUFilePipe QEMUFilePipe;
+
+static int pipe_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFilePipe *s = opaque;
+    ssize_t len = 0;
+
+    while (size > 0) {
+        ssize_t ret = read(s->fd, buf, size);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            }
+            if (len == 0) {
+                len = -errno;
+            }
+            break;
+        }
+
+        if (ret == 0) {
+            /* the write end of the pipe is closed */
+            break;
+        }
+        len += ret;
+        buf += ret;
+        size -= ret;
+    }
+
+    return len;
+}
+
+static int pipe_close(void *opaque)
+{
+    QEMUFilePipe *s = opaque;
+    g_free(s);
+    return 0;
+}
+
+static QEMUFile *qemu_fopen_pipe(int fd)
+{
+    QEMUFilePipe *s = g_malloc0(sizeof(*s));
+
+    s->fd = fd;
+    fcntl_setfl(fd, O_NONBLOCK);
+    s->file = qemu_fopen_ops(s, NULL, pipe_get_buffer, pipe_close,
+                             NULL, NULL, NULL);
+    return s->file;
+}
+
+/* write only */
+struct QEMUFileNonblock {
+    int fd;
+    QEMUFile *file;
+
+    /* for pipe-write nonblocking mode */
+#define BUF_SIZE_INC    (32 * 1024)     /* = IO_BUF_SIZE */
+    uint8_t *buffer;
+    size_t buffer_size;
+    size_t buffer_capacity;
+    bool freeze_output;
+};
+typedef struct QEMUFileNonblock QEMUFileNonblock;
+
+static void nonblock_flush_buffer(QEMUFileNonblock *s)
+{
+    size_t offset = 0;
+    ssize_t ret;
+
+    while (offset < s->buffer_size) {
+        ret = write(s->fd, s->buffer + offset, s->buffer_size - offset);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == EAGAIN) {
+                s->freeze_output = true;
+            } else {
+                qemu_file_set_error(s->file, errno);
+            }
+            break;
+        }
+
+        if (ret == 0) {
+            DPRINTF("ret == 0\n");
+            break;
+        }
+
+        offset += ret;
+    }
+
+    if (offset > 0) {
+        assert(s->buffer_size >= offset);
+        memmove(s->buffer, s->buffer + offset, s->buffer_size - offset);
+        s->buffer_size -= offset;
+    }
+    if (s->buffer_size > 0) {
+        s->freeze_output = true;
+    }
+}
+
+static int nonblock_put_buffer(void *opaque,
+                               const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileNonblock *s = opaque;
+    int error;
+    ssize_t len = 0;
+
+    error = qemu_file_get_error(s->file);
+    if (error) {
+        return error;
+    }
+
+    nonblock_flush_buffer(s);
+    error = qemu_file_get_error(s->file);
+    if (error) {
+        return error;
+    }
+
+    while (!s->freeze_output && size > 0) {
+        ssize_t ret;
+        assert(s->buffer_size == 0);
+
+        ret = write(s->fd, buf, size);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == EAGAIN) {
+                s->freeze_output = true;
+            } else {
+                qemu_file_set_error(s->file, errno);
+            }
+            break;
+        }
+
+        len += ret;
+        buf += ret;
+        size -= ret;
+    }
+
+    if (size > 0) {
+        int inc = size - (s->buffer_capacity - s->buffer_size);
+        if (inc > 0) {
+            s->buffer_capacity +=
+                DIV_ROUND_UP(inc, BUF_SIZE_INC) * BUF_SIZE_INC;
+            s->buffer = g_realloc(s->buffer, s->buffer_capacity);
+        }
+        memcpy(s->buffer + s->buffer_size, buf, size);
+        s->buffer_size += size;
+
+        len += size;
+    }
+
+    return len;
+}
+
+static int nonblock_pending_size(QEMUFileNonblock *s)
+{
+    return qemu_pending_size(s->file) + s->buffer_size;
+}
+
+static void nonblock_fflush(QEMUFileNonblock *s)
+{
+    s->freeze_output = false;
+    nonblock_flush_buffer(s);
+    if (!s->freeze_output) {
+        qemu_fflush(s->file);
+    }
+}
+
+static void nonblock_wait_for_flush(QEMUFileNonblock *s)
+{
+    while (nonblock_pending_size(s) > 0) {
+        fd_set fds;
+        FD_ZERO(&fds);
+        FD_SET(s->fd, &fds);
+        select(s->fd + 1, NULL, &fds, NULL, NULL);
+
+        nonblock_fflush(s);
+    }
+}
+
+static int nonblock_close(void *opaque)
+{
+    QEMUFileNonblock *s = opaque;
+    nonblock_wait_for_flush(s);
+    g_free(s->buffer);
+    g_free(s);
+    return 0;
+}
+
+static QEMUFileNonblock *qemu_fopen_nonblock(int fd)
+{
+    QEMUFileNonblock *s = g_malloc0(sizeof(*s));
+
+    s->fd = fd;
+    fcntl_setfl(fd, O_NONBLOCK);
+    s->file = qemu_fopen_ops(s, nonblock_put_buffer, NULL, nonblock_close,
+                             NULL, NULL, NULL);
+    return s;
+}
+
+/***************************************************************************
+ * umem daemon on destination <-> qemu on source protocol
+ */
+
+#define QEMU_UMEM_REQ_INIT              0x00
+#define QEMU_UMEM_REQ_ON_DEMAND         0x01
+#define QEMU_UMEM_REQ_ON_DEMAND_CONT    0x02
+#define QEMU_UMEM_REQ_BACKGROUND        0x03
+#define QEMU_UMEM_REQ_BACKGROUND_CONT   0x04
+#define QEMU_UMEM_REQ_REMOVE            0x05
+#define QEMU_UMEM_REQ_EOC               0x06
+
+struct qemu_umem_req {
+    int8_t cmd;
+    uint8_t len;
+    char *idstr;        /* ON_DEMAND, BACKGROUND, REMOVE */
+    uint32_t nr;        /* ON_DEMAND, ON_DEMAND_CONT,
+                           BACKGROUND, BACKGROUND_CONT, REMOVE */
+
+    /* in target page size as qemu migration protocol */
+    uint64_t *pgoffs;   /* ON_DEMAND, ON_DEMAND_CONT,
+                           BACKGROUND, BACKGROUND_CONT, REMOVE */
+};
+
+static void postcopy_incoming_send_req_idstr(QEMUFile *f, const char* idstr)
+{
+    qemu_put_byte(f, strlen(idstr));
+    qemu_put_buffer(f, (uint8_t *)idstr, strlen(idstr));
+}
+
+static void postcopy_incoming_send_req_pgoffs(QEMUFile *f, uint32_t nr,
+                                              const uint64_t *pgoffs)
+{
+    uint32_t i;
+
+    qemu_put_be32(f, nr);
+    for (i = 0; i < nr; i++) {
+        qemu_put_be64(f, pgoffs[i]);
+    }
+}
+
+static void postcopy_incoming_send_req_one(QEMUFile *f,
+                                           const struct qemu_umem_req *req)
+{
+    DPRINTF("cmd %d\n", req->cmd);
+    qemu_put_byte(f, req->cmd);
+    switch (req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+    case QEMU_UMEM_REQ_EOC:
+        /* nothing */
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+    case QEMU_UMEM_REQ_REMOVE:
+        postcopy_incoming_send_req_idstr(f, req->idstr);
+        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
+        break;
+    default:
+        abort();
+        break;
+    }
+}
+
+/* QEMUFile can buffer up to IO_BUF_SIZE = 32 * 1024.
+ * So one message size must be <= IO_BUF_SIZE
+ * cmd: 1
+ * id len: 1
+ * id: 256
+ * nr: 2
+ */
+#define MAX_PAGE_NR     ((32 * 1024 - 1 - 1 - 256 - 2) / sizeof(uint64_t))
+static void postcopy_incoming_send_req(QEMUFile *f,
+                                       const struct qemu_umem_req *req)
+{
+    uint32_t nr = req->nr;
+    struct qemu_umem_req tmp = *req;
+
+    switch (req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+    case QEMU_UMEM_REQ_EOC:
+        postcopy_incoming_send_req_one(f, &tmp);
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+        tmp.nr = MIN(nr, MAX_PAGE_NR);
+        postcopy_incoming_send_req_one(f, &tmp);
+
+        nr -= tmp.nr;
+        tmp.pgoffs += tmp.nr;
+        if (tmp.cmd == QEMU_UMEM_REQ_ON_DEMAND) {
+            tmp.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
+        }else {
+            tmp.cmd = QEMU_UMEM_REQ_BACKGROUND_CONT;
+        }
+        /* fall through */
+    case QEMU_UMEM_REQ_REMOVE:
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        while (nr > 0) {
+            tmp.nr = MIN(nr, MAX_PAGE_NR);
+            postcopy_incoming_send_req_one(f, &tmp);
+
+            nr -= tmp.nr;
+            tmp.pgoffs += tmp.nr;
+        }
+        break;
+    default:
+        abort();
+        break;
+    }
+}
+
+static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
+                                            struct qemu_umem_req *req,
+                                            size_t *offset)
+{
+    int ret;
+
+    req->len = qemu_peek_byte(f, *offset);
+    *offset += 1;
+    if (req->len == 0) {
+        return -EAGAIN;
+    }
+    req->idstr = g_malloc((int)req->len + 1);
+    ret = qemu_peek_buffer(f, (uint8_t*)req->idstr, req->len, *offset);
+    *offset += ret;
+    if (ret != req->len) {
+        g_free(req->idstr);
+        req->idstr = NULL;
+        return -EAGAIN;
+    }
+    req->idstr[req->len] = 0;
+    return 0;
+}
+
+static int postcopy_outgoing_recv_req_pgoffs(QEMUFile *f,
+                                             struct qemu_umem_req *req,
+                                             size_t *offset)
+{
+    int ret;
+    uint32_t be32;
+    uint32_t i;
+
+    ret = qemu_peek_buffer(f, (uint8_t*)&be32, sizeof(be32), *offset);
+    *offset += sizeof(be32);
+    if (ret != sizeof(be32)) {
+        return -EAGAIN;
+    }
+
+    req->nr = be32_to_cpu(be32);
+    req->pgoffs = g_new(uint64_t, req->nr);
+    for (i = 0; i < req->nr; i++) {
+        uint64_t be64;
+        ret = qemu_peek_buffer(f, (uint8_t*)&be64, sizeof(be64), *offset);
+        *offset += sizeof(be64);
+        if (ret != sizeof(be64)) {
+            g_free(req->pgoffs);
+            req->pgoffs = NULL;
+            return -EAGAIN;
+        }
+        req->pgoffs[i] = be64_to_cpu(be64);
+    }
+    return 0;
+}
+
+static int postcopy_outgoing_recv_req(QEMUFile *f, struct qemu_umem_req *req)
+{
+    int size;
+    int ret;
+    size_t offset = 0;
+
+    size = qemu_peek_buffer(f, (uint8_t*)&req->cmd, 1, offset);
+    if (size <= 0) {
+        return -EAGAIN;
+    }
+    offset += 1;
+
+    switch (req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+    case QEMU_UMEM_REQ_EOC:
+        /* nothing */
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+    case QEMU_UMEM_REQ_REMOVE:
+        ret = postcopy_outgoing_recv_req_idstr(f, req, &offset);
+        if (ret < 0) {
+            return ret;
+        }
+        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
+        if (ret < 0) {
+            return ret;
+        }
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
+        if (ret < 0) {
+            return ret;
+        }
+        break;
+    default:
+        abort();
+        break;
+    }
+    qemu_file_skip(f, offset);
+    DPRINTF("cmd %d\n", req->cmd);
+    return 0;
+}
+
+static void postcopy_outgoing_free_req(struct qemu_umem_req *req)
+{
+    g_free(req->idstr);
+    g_free(req->pgoffs);
+}
+
+/***************************************************************************
+ * outgoing part
+ */
+
+#define QEMU_SAVE_LIVE_STAGE_START      0x01    /* = QEMU_VM_SECTION_START */
+#define QEMU_SAVE_LIVE_STAGE_PART       0x02    /* = QEMU_VM_SECTION_PART */
+#define QEMU_SAVE_LIVE_STAGE_END        0x03    /* = QEMU_VM_SECTION_END */
+
+enum POState {
+    PO_STATE_ERROR_RECEIVE,
+    PO_STATE_ACTIVE,
+    PO_STATE_EOC_RECEIVED,
+    PO_STATE_ALL_PAGES_SENT,
+    PO_STATE_COMPLETED,
+};
+typedef enum POState POState;
+
+struct PostcopyOutgoingState {
+    POState state;
+    QEMUFile *mig_read;
+    int fd_read;
+    RAMBlock *last_block_read;
+
+    QEMUFile *mig_buffered_write;
+    MigrationState *ms;
+
+    /* For nobg mode. Check if all pages are sent */
+    RAMBlock *block;
+    ram_addr_t addr;
+};
+typedef struct PostcopyOutgoingState PostcopyOutgoingState;
+
+int postcopy_outgoing_create_read_socket(MigrationState *s)
+{
+    if (!s->params.postcopy) {
+        return 0;
+    }
+
+    s->fd_read = dup(s->fd);
+    if (s->fd_read == -1) {
+        int ret = -errno;
+        perror("dup");
+        return ret;
+    }
+    s->file_read = qemu_fopen_socket(s->fd_read);
+    if (s->file_read == NULL) {
+        return -EINVAL;
+    }
+    return 0;
+}
+
+int postcopy_outgoing_ram_save_live(Monitor *mon,
+                                    QEMUFile *f, int stage, void *opaque)
+{
+    int ret = 0;
+    DPRINTF("stage %d\n", stage);
+    if (stage == QEMU_SAVE_LIVE_STAGE_START) {
+        sort_ram_list();
+        ram_save_live_mem_size(f);
+    }
+    if (stage == QEMU_SAVE_LIVE_STAGE_PART) {
+        ret = 1;
+    }
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    return ret;
+}
+
+static RAMBlock *postcopy_outgoing_find_block(const char *idstr)
+{
+    RAMBlock *block;
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        if (!strncmp(idstr, block->idstr, strlen(idstr))) {
+            return block;
+        }
+    }
+    return NULL;
+}
+
+/*
+ * return value
+ *   0: continue postcopy mode
+ * > 0: completed postcopy mode.
+ * < 0: error
+ */
+static int postcopy_outgoing_handle_req(PostcopyOutgoingState *s,
+                                        const struct qemu_umem_req *req,
+                                        bool *written)
+{
+    int i;
+    RAMBlock *block;
+
+    DPRINTF("cmd %d state %d\n", req->cmd, s->state);
+    switch(req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+        /* nothing */
+        break;
+    case QEMU_UMEM_REQ_EOC:
+        /* tell to finish migration. */
+        if (s->state == PO_STATE_ALL_PAGES_SENT) {
+            s->state = PO_STATE_COMPLETED;
+            DPRINTF("-> PO_STATE_COMPLETED\n");
+        } else {
+            s->state = PO_STATE_EOC_RECEIVED;
+            DPRINTF("-> PO_STATE_EOC_RECEIVED\n");
+        }
+        return 1;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+        DPRINTF("idstr: %s\n", req->idstr);
+        block = postcopy_outgoing_find_block(req->idstr);
+        if (block == NULL) {
+            return -EINVAL;
+        }
+        s->last_block_read = block;
+        /* fall through */
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        DPRINTF("nr %d\n", req->nr);
+        for (i = 0; i < req->nr; i++) {
+            DPRINTF("offs[%d] 0x%"PRIx64"\n", i, req->pgoffs[i]);
+            int ret = ram_save_page(s->mig_buffered_write, s->last_block_read,
+                                    req->pgoffs[i] << TARGET_PAGE_BITS);
+            if (ret > 0) {
+                *written = true;
+            }
+        }
+        break;
+    case QEMU_UMEM_REQ_REMOVE:
+        block = postcopy_outgoing_find_block(req->idstr);
+        if (block == NULL) {
+            return -EINVAL;
+        }
+        for (i = 0; i < req->nr; i++) {
+            ram_addr_t addr = block->offset +
+                (req->pgoffs[i] << TARGET_PAGE_BITS);
+            cpu_physical_memory_reset_dirty(addr,
+                                            addr + TARGET_PAGE_SIZE,
+                                            MIGRATION_DIRTY_FLAG);
+        }
+        break;
+    default:
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static void postcopy_outgoing_close_mig_read(PostcopyOutgoingState *s)
+{
+    if (s->mig_read != NULL) {
+        qemu_set_fd_handler(s->fd_read, NULL, NULL, NULL);
+        qemu_fclose(s->mig_read);
+        s->mig_read = NULL;
+        fd_close(&s->fd_read);
+
+        s->ms->file_read = NULL;
+        s->ms->fd_read = -1;
+    }
+}
+
+static void postcopy_outgoing_completed(PostcopyOutgoingState *s)
+{
+    postcopy_outgoing_close_mig_read(s);
+    s->ms->postcopy = NULL;
+    g_free(s);
+}
+
+static void postcopy_outgoing_recv_handler(void *opaque)
+{
+    PostcopyOutgoingState *s = opaque;
+    bool written = false;
+    int ret = 0;
+
+    assert(s->state == PO_STATE_ACTIVE ||
+           s->state == PO_STATE_ALL_PAGES_SENT);
+
+    do {
+        struct qemu_umem_req req = {.idstr = NULL,
+                                    .pgoffs = NULL};
+
+        ret = postcopy_outgoing_recv_req(s->mig_read, &req);
+        if (ret < 0) {
+            if (ret == -EAGAIN) {
+                ret = 0;
+            }
+            break;
+        }
+        if (s->state == PO_STATE_ACTIVE) {
+            ret = postcopy_outgoing_handle_req(s, &req, &written);
+        }
+        postcopy_outgoing_free_req(&req);
+    } while (ret == 0);
+
+    /*
+     * flush buffered_file.
+     * Although mig_write is rate-limited buffered file, those written pages
+     * are requested on demand by the destination. So forcibly push
+     * those pages ignoring rate limiting
+     */
+    if (written) {
+        qemu_fflush(s->mig_buffered_write);
+        /* qemu_buffered_file_drain(s->mig_buffered_write); */
+    }
+
+    if (ret < 0) {
+        switch (s->state) {
+        case PO_STATE_ACTIVE:
+            s->state = PO_STATE_ERROR_RECEIVE;
+            DPRINTF("-> PO_STATE_ERROR_RECEIVE\n");
+            break;
+        case PO_STATE_ALL_PAGES_SENT:
+            s->state = PO_STATE_COMPLETED;
+            DPRINTF("-> PO_STATE_ALL_PAGES_SENT\n");
+            break;
+        default:
+            abort();
+        }
+    }
+    if (s->state == PO_STATE_ERROR_RECEIVE || s->state == PO_STATE_COMPLETED) {
+        postcopy_outgoing_close_mig_read(s);
+    }
+    if (s->state == PO_STATE_COMPLETED) {
+        DPRINTF("PO_STATE_COMPLETED\n");
+        MigrationState *ms = s->ms;
+        postcopy_outgoing_completed(s);
+        migrate_fd_completed(ms);
+    }
+}
+
+void *postcopy_outgoing_begin(MigrationState *ms)
+{
+    PostcopyOutgoingState *s = g_new(PostcopyOutgoingState, 1);
+    DPRINTF("outgoing begin\n");
+    qemu_fflush(ms->file);
+
+    s->ms = ms;
+    s->state = PO_STATE_ACTIVE;
+    s->fd_read = ms->fd_read;
+    s->mig_read = ms->file_read;
+    s->mig_buffered_write = ms->file;
+    s->block = NULL;
+    s->addr = 0;
+
+    /* Make sure all dirty bits are set */
+    ram_save_memory_set_dirty();
+
+    qemu_set_fd_handler(s->fd_read,
+                        &postcopy_outgoing_recv_handler, NULL, s);
+    return s;
+}
+
+static void postcopy_outgoing_ram_all_sent(QEMUFile *f,
+                                           PostcopyOutgoingState *s)
+{
+    assert(s->state == PO_STATE_ACTIVE);
+
+    s->state = PO_STATE_ALL_PAGES_SENT;
+    /* tell incoming side that all pages are sent */
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(f);
+    qemu_buffered_file_drain(f);
+    DPRINTF("sent RAM_SAVE_FLAG_EOS\n");
+    migrate_fd_cleanup(s->ms);
+
+    /* Later migrate_fd_complete() will be called which calls
+     * migrate_fd_cleanup() again. So dummy file is created
+     * for qemu monitor to keep working.
+     */
+    s->ms->file = qemu_fopen_ops(NULL, NULL, NULL, NULL, NULL,
+                                 NULL, NULL);
+}
+
+static int postcopy_outgoing_check_all_ram_sent(PostcopyOutgoingState *s,
+                                                RAMBlock *block,
+                                                ram_addr_t addr)
+{
+    if (block == NULL) {
+        block = QLIST_FIRST(&ram_list.blocks);
+        addr = block->offset;
+    }
+
+    for (; block != NULL;
+         s->block = QLIST_NEXT(s->block, next), addr = block->offset) {
+        for (; addr < block->offset + block->length;
+             addr += TARGET_PAGE_SIZE) {
+            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+                s->block = block;
+                s->addr = addr;
+                return 0;
+            }
+        }
+    }
+
+    return 1;
+}
+
+int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
+                                          void *postcopy)
+{
+    PostcopyOutgoingState *s = postcopy;
+
+    assert(s->state == PO_STATE_ACTIVE ||
+           s->state == PO_STATE_EOC_RECEIVED ||
+           s->state == PO_STATE_ERROR_RECEIVE);
+
+    switch (s->state) {
+    case PO_STATE_ACTIVE:
+        /* nothing. processed below */
+        break;
+    case PO_STATE_EOC_RECEIVED:
+        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+        s->state = PO_STATE_COMPLETED;
+        postcopy_outgoing_completed(s);
+        DPRINTF("PO_STATE_COMPLETED\n");
+        return 1;
+    case PO_STATE_ERROR_RECEIVE:
+        postcopy_outgoing_completed(s);
+        DPRINTF("PO_STATE_ERROR_RECEIVE\n");
+        return -1;
+    default:
+        abort();
+    }
+
+    if (s->ms->params.nobg) {
+        /* See if all pages are sent. */
+        if (postcopy_outgoing_check_all_ram_sent(s, s->block, s->addr) == 0) {
+            return 0;
+        }
+        /* ram_list can be reordered. (it doesn't seem so during migration,
+           though) So the whole list needs to be checked again */
+        if (postcopy_outgoing_check_all_ram_sent(s, NULL, 0) == 0) {
+            return 0;
+        }
+
+        postcopy_outgoing_ram_all_sent(f, s);
+        return 0;
+    }
+
+    DPRINTF("outgoing background state: %d\n", s->state);
+
+    while (qemu_file_rate_limit(f) == 0) {
+        if (ram_save_block(f) == 0) { /* no more blocks */
+            assert(s->state == PO_STATE_ACTIVE);
+            postcopy_outgoing_ram_all_sent(f, s);
+            return 0;
+        }
+    }
+
+    return 0;
+}
+
+/***************************************************************************
+ * incoming part
+ */
+
+/* flags for incoming mode to modify the behavior.
+   This is for benchmark/debug purpose */
+#define INCOMING_FLAGS_FAULT_REQUEST 0x01
+
+
+static void postcopy_incoming_umemd(void);
+
+#define PIS_STATE_QUIT_RECEIVED         0x01
+#define PIS_STATE_QUIT_QUEUED           0x02
+#define PIS_STATE_QUIT_SENT             0x04
+
+#define PIS_STATE_QUIT_MASK             (PIS_STATE_QUIT_RECEIVED | \
+                                         PIS_STATE_QUIT_QUEUED | \
+                                         PIS_STATE_QUIT_SENT)
+
+struct PostcopyIncomingState {
+    /* dest qemu state */
+    uint32_t    state;
+
+    UMemDev *dev;
+    int host_page_size;
+    int host_page_shift;
+
+    /* qemu side */
+    int to_umemd_fd;
+    QEMUFileNonblock *to_umemd;
+#define MAX_FAULTED_PAGES       256
+    struct umem_pages *faulted_pages;
+
+    int from_umemd_fd;
+    QEMUFile *from_umemd;
+    int version_id;     /* save/load format version id */
+};
+typedef struct PostcopyIncomingState PostcopyIncomingState;
+
+
+#define UMEM_STATE_EOS_RECEIVED         0x01    /* umem daemon <-> src qemu */
+#define UMEM_STATE_EOC_SENT             0x02    /* umem daemon <-> src qemu */
+#define UMEM_STATE_QUIT_RECEIVED        0x04    /* umem daemon <-> dst qemu */
+#define UMEM_STATE_QUIT_QUEUED          0x08    /* umem daemon <-> dst qemu */
+#define UMEM_STATE_QUIT_SENT            0x10    /* umem daemon <-> dst qemu */
+
+#define UMEM_STATE_QUIT_MASK            (UMEM_STATE_QUIT_QUEUED | \
+                                         UMEM_STATE_QUIT_SENT | \
+                                         UMEM_STATE_QUIT_RECEIVED)
+#define UMEM_STATE_END_MASK             (UMEM_STATE_EOS_RECEIVED | \
+                                         UMEM_STATE_EOC_SENT | \
+                                         UMEM_STATE_QUIT_MASK)
+
+struct PostcopyIncomingUMemDaemon {
+    /* umem daemon side */
+    uint32_t state;
+
+    int host_page_size;
+    int host_page_shift;
+    int nr_host_pages_per_target_page;
+    int host_to_target_page_shift;
+    int nr_target_pages_per_host_page;
+    int target_to_host_page_shift;
+    int version_id;     /* save/load format version id */
+
+    int to_qemu_fd;
+    QEMUFileNonblock *to_qemu;
+    int from_qemu_fd;
+    QEMUFile *from_qemu;
+
+    int mig_read_fd;
+    QEMUFile *mig_read;         /* qemu on source -> umem daemon */
+
+    int mig_write_fd;
+    QEMUFileNonblock *mig_write;        /* umem daemon -> qemu on source */
+
+    /* = KVM_MAX_VCPUS * (ASYNC_PF_PER_VCPUS + 1) */
+#define MAX_REQUESTS    (512 * (64 + 1))
+
+    struct umem_page_request page_request;
+    struct umem_page_cached page_cached;
+
+#define MAX_PRESENT_REQUESTS    MAX_FAULTED_PAGES
+    struct umem_pages *present_request;
+
+    uint64_t *target_pgoffs;
+
+    /* bitmap indexed by target page offset */
+    unsigned long *phys_requested;
+
+    /* bitmap indexed by target page offset */
+    unsigned long *phys_received;
+
+    RAMBlock *last_block_read;  /* qemu on source -> umem daemon */
+    RAMBlock *last_block_write; /* umem daemon -> qemu on source */
+};
+typedef struct PostcopyIncomingUMemDaemon PostcopyIncomingUMemDaemon;
+
+static PostcopyIncomingState state = {
+    .state = 0,
+    .dev = NULL,
+    .to_umemd_fd = -1,
+    .to_umemd = NULL,
+    .from_umemd_fd = -1,
+    .from_umemd = NULL,
+};
+
+static PostcopyIncomingUMemDaemon umemd = {
+    .state = 0,
+    .to_qemu_fd = -1,
+    .to_qemu = NULL,
+    .from_qemu_fd = -1,
+    .from_qemu = NULL,
+    .mig_read_fd = -1,
+    .mig_read = NULL,
+    .mig_write_fd = -1,
+    .mig_write = NULL,
+};
+
+int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
+{
+    /* incoming_postcopy makes sense only when incoming migration mode */
+    if (!incoming && incoming_postcopy) {
+        return -EINVAL;
+    }
+
+    if (!incoming_postcopy) {
+        return 0;
+    }
+
+    state.state = 0;
+    state.dev = umem_dev_new();
+    state.host_page_size = getpagesize();
+    state.host_page_shift = ffs(state.host_page_size) - 1;
+    state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
+                                               ram_save_live() */
+    return 0;
+}
+
+void postcopy_incoming_ram_alloc(const char *name,
+                                 size_t size, uint8_t **hostp, UMem **umemp)
+{
+    UMem *umem;
+    size = ALIGN_UP(size, state.host_page_size);
+    umem = umem_dev_create(state.dev, size, name);
+
+    *umemp = umem;
+    *hostp = umem->umem;
+}
+
+void postcopy_incoming_ram_free(UMem *umem)
+{
+    umem_unmap(umem);
+    umem_close(umem);
+    umem_destroy(umem);
+}
+
+void postcopy_incoming_prepare(void)
+{
+    RAMBlock *block;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        if (block->umem != NULL) {
+            umem_mmap(block->umem);
+        }
+    }
+}
+
+static int postcopy_incoming_ram_load_get64(QEMUFile *f,
+                                             ram_addr_t *addr, int *flags)
+{
+    *addr = qemu_get_be64(f);
+    *flags = *addr & ~TARGET_PAGE_MASK;
+    *addr &= TARGET_PAGE_MASK;
+    return qemu_file_get_error(f);
+}
+
+int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
+{
+    ram_addr_t addr;
+    int flags;
+    int error;
+
+    DPRINTF("incoming ram load\n");
+    /*
+     * RAM_SAVE_FLAGS_EOS or
+     * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
+     * see postcopy_outgoing_ram_save_live()
+     */
+
+    if (version_id != RAM_SAVE_VERSION_ID) {
+        DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
+                version_id, RAM_SAVE_VERSION_ID);
+        return -EINVAL;
+    }
+    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
+    DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
+    if (error) {
+        DPRINTF("error %d\n", error);
+        return error;
+    }
+    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
+        DPRINTF("EOS\n");
+        return 0;
+    }
+
+    if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
+        DPRINTF("-EINVAL flags 0x%x\n", flags);
+        return -EINVAL;
+    }
+    error = ram_load_mem_size(f, addr);
+    if (error) {
+        DPRINTF("addr 0x%lx error %d\n", addr, error);
+        return error;
+    }
+
+    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
+    if (error) {
+        DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
+        return error;
+    }
+    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
+        DPRINTF("done\n");
+        return 0;
+    }
+    DPRINTF("-EINVAL\n");
+    return -EINVAL;
+}
+
+void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
+{
+    int fds[2];
+    RAMBlock *block;
+
+    DPRINTF("fork\n");
+
+    /* socketpair(AF_UNIX)? */
+
+    if (qemu_pipe(fds) == -1) {
+        perror("qemu_pipe");
+        abort();
+    }
+    state.from_umemd_fd = fds[0];
+    umemd.to_qemu_fd = fds[1];
+
+    if (qemu_pipe(fds) == -1) {
+        perror("qemu_pipe");
+        abort();
+    }
+    umemd.from_qemu_fd = fds[0];
+    state.to_umemd_fd = fds[1];
+
+    pid_t child = fork();
+    if (child < 0) {
+        perror("fork");
+        abort();
+    }
+
+    if (child == 0) {
+        int mig_write_fd;
+
+        fd_close(&state.to_umemd_fd);
+        fd_close(&state.from_umemd_fd);
+        umemd.host_page_size = state.host_page_size;
+        umemd.host_page_shift = state.host_page_shift;
+
+        umemd.nr_host_pages_per_target_page =
+            TARGET_PAGE_SIZE / umemd.host_page_size;
+        umemd.nr_target_pages_per_host_page =
+            umemd.host_page_size / TARGET_PAGE_SIZE;
+
+        umemd.target_to_host_page_shift =
+            ffs(umemd.nr_host_pages_per_target_page) - 1;
+        umemd.host_to_target_page_shift =
+            ffs(umemd.nr_target_pages_per_host_page) - 1;
+
+        umemd.state = 0;
+        umemd.version_id = state.version_id;
+        umemd.mig_read_fd = mig_read_fd;
+        umemd.mig_read = mig_read;
+
+        mig_write_fd = dup(mig_read_fd);
+        if (mig_write_fd < 0) {
+            perror("could not dup for writable socket \n");
+            abort();
+        }
+        umemd.mig_write_fd = mig_write_fd;
+        umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
+
+        postcopy_incoming_umemd(); /* noreturn */
+    }
+
+    DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
+    fd_close(&umemd.to_qemu_fd);
+    fd_close(&umemd.from_qemu_fd);
+    state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAGES));
+    state.faulted_pages->nr = 0;
+
+    /* close all UMem.shmem_fd */
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        umem_close_shmem(block->umem);
+    }
+    umem_qemu_wait_for_daemon(state.from_umemd_fd);
+}
+
+static void postcopy_incoming_qemu_recv_quit(void)
+{
+    RAMBlock *block;
+    if (state.state & PIS_STATE_QUIT_RECEIVED) {
+        return;
+    }
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        if (block->umem != NULL) {
+            umem_destroy(block->umem);
+            block->umem = NULL;
+            block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
+        }
+    }
+
+    DPRINTF("|= PIS_STATE_QUIT_RECEIVED\n");
+    state.state |= PIS_STATE_QUIT_RECEIVED;
+    qemu_set_fd_handler(state.from_umemd_fd, NULL, NULL, NULL);
+    qemu_fclose(state.from_umemd);
+    state.from_umemd = NULL;
+    fd_close(&state.from_umemd_fd);
+}
+
+static void postcopy_incoming_qemu_fflush_to_umemd_handler(void *opaque)
+{
+    assert(state.to_umemd != NULL);
+
+    nonblock_fflush(state.to_umemd);
+    if (nonblock_pending_size(state.to_umemd) > 0) {
+        return;
+    }
+
+    qemu_set_fd_handler(state.to_umemd->fd, NULL, NULL, NULL);
+    if (state.state & PIS_STATE_QUIT_QUEUED) {
+        DPRINTF("|= PIS_STATE_QUIT_SENT\n");
+        state.state |= PIS_STATE_QUIT_SENT;
+        qemu_fclose(state.to_umemd->file);
+        state.to_umemd = NULL;
+        fd_close(&state.to_umemd_fd);
+        g_free(state.faulted_pages);
+        state.faulted_pages = NULL;
+    }
+}
+
+static void postcopy_incoming_qemu_fflush_to_umemd(void)
+{
+    qemu_set_fd_handler(state.to_umemd->fd, NULL,
+                        postcopy_incoming_qemu_fflush_to_umemd_handler, NULL);
+    postcopy_incoming_qemu_fflush_to_umemd_handler(NULL);
+}
+
+static void postcopy_incoming_qemu_queue_quit(void)
+{
+    if (state.state & PIS_STATE_QUIT_QUEUED) {
+        return;
+    }
+
+    DPRINTF("|= PIS_STATE_QUIT_QUEUED\n");
+    umem_qemu_quit(state.to_umemd->file);
+    state.state |= PIS_STATE_QUIT_QUEUED;
+}
+
+static void postcopy_incoming_qemu_send_pages_present(void)
+{
+    if (state.faulted_pages->nr > 0) {
+        umem_qemu_send_pages_present(state.to_umemd->file,
+                                     state.faulted_pages);
+        state.faulted_pages->nr = 0;
+    }
+}
+
+static void postcopy_incoming_qemu_faulted_pages(
+    const struct umem_pages *pages)
+{
+    assert(pages->nr <= MAX_FAULTED_PAGES);
+    assert(state.faulted_pages != NULL);
+
+    if (state.faulted_pages->nr + pages->nr > MAX_FAULTED_PAGES) {
+        postcopy_incoming_qemu_send_pages_present();
+    }
+    memcpy(&state.faulted_pages->pgoffs[state.faulted_pages->nr],
+           &pages->pgoffs[0], sizeof(pages->pgoffs[0]) * pages->nr);
+    state.faulted_pages->nr += pages->nr;
+}
+
+static void postcopy_incoming_qemu_cleanup_umem(void);
+
+static int postcopy_incoming_qemu_handle_req_one(void)
+{
+    int offset = 0;
+    int ret;
+    uint8_t cmd;
+
+    ret = qemu_peek_buffer(state.from_umemd, &cmd, sizeof(cmd), offset);
+    offset += sizeof(cmd);
+    if (ret != sizeof(cmd)) {
+        return -EAGAIN;
+    }
+    DPRINTF("cmd %c\n", cmd);
+
+    switch (cmd) {
+    case UMEM_DAEMON_QUIT:
+        postcopy_incoming_qemu_recv_quit();
+        postcopy_incoming_qemu_queue_quit();
+        postcopy_incoming_qemu_cleanup_umem();
+        break;
+    case UMEM_DAEMON_TRIGGER_PAGE_FAULT: {
+        struct umem_pages *pages =
+            umem_qemu_trigger_page_fault(state.from_umemd, &offset);
+        if (pages == NULL) {
+            return -EAGAIN;
+        }
+        if (state.to_umemd_fd >= 0 && !(state.state & PIS_STATE_QUIT_QUEUED)) {
+            postcopy_incoming_qemu_faulted_pages(pages);
+            g_free(pages);
+        }
+        break;
+    }
+    case UMEM_DAEMON_ERROR:
+        /* umem daemon hit troubles, so it warned us to stop vm execution */
+        vm_stop(RUN_STATE_IO_ERROR); /* or RUN_STATE_INTERNAL_ERROR */
+        break;
+    default:
+        abort();
+        break;
+    }
+
+    if (state.from_umemd != NULL) {
+        qemu_file_skip(state.from_umemd, offset);
+    }
+    return 0;
+}
+
+static void postcopy_incoming_qemu_handle_req(void *opaque)
+{
+    do {
+        int ret = postcopy_incoming_qemu_handle_req_one();
+        if (ret == -EAGAIN) {
+            break;
+        }
+    } while (state.from_umemd != NULL &&
+             qemu_pending_size(state.from_umemd) > 0);
+
+    if (state.to_umemd != NULL) {
+        if (state.faulted_pages->nr > 0) {
+            postcopy_incoming_qemu_send_pages_present();
+        }
+        postcopy_incoming_qemu_fflush_to_umemd();
+    }
+}
+
+void postcopy_incoming_qemu_ready(void)
+{
+    umem_qemu_ready(state.to_umemd_fd);
+
+    state.from_umemd = qemu_fopen_pipe(state.from_umemd_fd);
+    state.to_umemd = qemu_fopen_nonblock(state.to_umemd_fd);
+    qemu_set_fd_handler(state.from_umemd_fd,
+                        postcopy_incoming_qemu_handle_req, NULL, NULL);
+}
+
+static void postcopy_incoming_qemu_cleanup_umem(void)
+{
+    /* when qemu will quit before completing postcopy, tell umem daemon
+       to tear down umem device and exit. */
+    if (state.to_umemd_fd >= 0) {
+        postcopy_incoming_qemu_queue_quit();
+        postcopy_incoming_qemu_fflush_to_umemd();
+    }
+
+    if (state.dev) {
+        umem_dev_destroy(state.dev);
+        state.dev = NULL;
+    }
+}
+
+void postcopy_incoming_qemu_cleanup(void)
+{
+    postcopy_incoming_qemu_cleanup_umem();
+    if (state.to_umemd != NULL) {
+        nonblock_wait_for_flush(state.to_umemd);
+    }
+}
+
+void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
+{
+    uint64_t nr = DIV_ROUND_UP(size, state.host_page_size);
+    size_t len = umem_pages_size(nr);
+    ram_addr_t end = addr + size;
+    struct umem_pages *pages;
+    int i;
+
+    if (state.to_umemd_fd < 0 || state.state & PIS_STATE_QUIT_QUEUED) {
+        return;
+    }
+    pages = g_malloc(len);
+    pages->nr = nr;
+    for (i = 0; addr < end; addr += state.host_page_size, i++) {
+        pages->pgoffs[i] = addr >> state.host_page_shift;
+    }
+    umem_qemu_send_pages_unmapped(state.to_umemd->file, pages);
+    g_free(pages);
+    assert(state.to_umemd != NULL);
+    postcopy_incoming_qemu_fflush_to_umemd();
+}
+
+/**************************************************************************
+ * incoming umem daemon
+ */
+
+static void postcopy_incoming_umem_recv_quit(void)
+{
+    if (umemd.state & UMEM_STATE_QUIT_RECEIVED) {
+        return;
+    }
+    DPRINTF("|= UMEM_STATE_QUIT_RECEIVED\n");
+    umemd.state |= UMEM_STATE_QUIT_RECEIVED;
+    qemu_fclose(umemd.from_qemu);
+    umemd.from_qemu = NULL;
+    fd_close(&umemd.from_qemu_fd);
+}
+
+static void postcopy_incoming_umem_queue_quit(void)
+{
+    if (umemd.state & UMEM_STATE_QUIT_QUEUED) {
+        return;
+    }
+    DPRINTF("|= UMEM_STATE_QUIT_QUEUED\n");
+    umem_daemon_quit(umemd.to_qemu->file);
+    umemd.state |= UMEM_STATE_QUIT_QUEUED;
+}
+
+static void postcopy_incoming_umem_send_eoc_req(void)
+{
+    struct qemu_umem_req req;
+
+    if (umemd.state & UMEM_STATE_EOC_SENT) {
+        return;
+    }
+
+    DPRINTF("|= UMEM_STATE_EOC_SENT\n");
+    req.cmd = QEMU_UMEM_REQ_EOC;
+    postcopy_incoming_send_req(umemd.mig_write->file, &req);
+    umemd.state |= UMEM_STATE_EOC_SENT;
+    qemu_fclose(umemd.mig_write->file);
+    umemd.mig_write = NULL;
+    fd_close(&umemd.mig_write_fd);
+}
+
+static void postcopy_incoming_umem_send_page_req(RAMBlock *block)
+{
+    struct qemu_umem_req req;
+    int bit;
+    uint64_t target_pgoff;
+    int i;
+
+    umemd.page_request.nr = MAX_REQUESTS;
+    umem_get_page_request(block->umem, &umemd.page_request);
+    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
+            block->idstr, umemd.page_request.nr,
+            (uint64_t)umemd.page_request.pgoffs[0],
+            (uint64_t)umemd.page_request.pgoffs[1]);
+
+    if (umemd.last_block_write != block) {
+        req.cmd = QEMU_UMEM_REQ_ON_DEMAND;
+        req.idstr = block->idstr;
+    } else {
+        req.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
+    }
+
+    req.nr = 0;
+    req.pgoffs = umemd.target_pgoffs;
+    if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
+        for (i = 0; i < umemd.page_request.nr; i++) {
+            target_pgoff =
+                umemd.page_request.pgoffs[i] >> umemd.host_to_target_page_shift;
+            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
+
+            if (!test_and_set_bit(bit, umemd.phys_requested)) {
+                req.pgoffs[req.nr] = target_pgoff;
+                req.nr++;
+            }
+        }
+    } else {
+        for (i = 0; i < umemd.page_request.nr; i++) {
+            int j;
+            target_pgoff =
+                umemd.page_request.pgoffs[i] << umemd.host_to_target_page_shift;
+            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
+
+            for (j = 0; j < umemd.nr_target_pages_per_host_page; j++) {
+                if (!test_and_set_bit(bit + j, umemd.phys_requested)) {
+                    req.pgoffs[req.nr] = target_pgoff + j;
+                    req.nr++;
+                }
+            }
+        }
+    }
+
+    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
+            block->idstr, req.nr, req.pgoffs[0], req.pgoffs[1]);
+    if (req.nr > 0 && umemd.mig_write != NULL) {
+        postcopy_incoming_send_req(umemd.mig_write->file, &req);
+        umemd.last_block_write = block;
+    }
+}
+
+static void postcopy_incoming_umem_send_pages_present(void)
+{
+    if (umemd.present_request->nr > 0) {
+        umem_daemon_send_pages_present(umemd.to_qemu->file,
+                                       umemd.present_request);
+        umemd.present_request->nr = 0;
+    }
+}
+
+static void postcopy_incoming_umem_pages_present_one(
+    uint32_t nr, const __u64 *pgoffs, uint64_t ramblock_pgoffset)
+{
+    uint32_t i;
+    assert(nr <= MAX_PRESENT_REQUESTS);
+
+    if (umemd.present_request->nr + nr > MAX_PRESENT_REQUESTS) {
+        postcopy_incoming_umem_send_pages_present();
+    }
+
+    for (i = 0; i < nr; i++) {
+        umemd.present_request->pgoffs[umemd.present_request->nr + i] =
+            pgoffs[i] + ramblock_pgoffset;
+    }
+    umemd.present_request->nr += nr;
+}
+
+static void postcopy_incoming_umem_pages_present(
+    const struct umem_page_cached *page_cached, uint64_t ramblock_pgoffset)
+{
+    uint32_t left = page_cached->nr;
+    uint32_t offset = 0;
+
+    while (left > 0) {
+        uint32_t nr = MIN(left, MAX_PRESENT_REQUESTS);
+        postcopy_incoming_umem_pages_present_one(
+            nr, &page_cached->pgoffs[offset], ramblock_pgoffset);
+
+        left -= nr;
+        offset += nr;
+    }
+}
+
+static int postcopy_incoming_umem_ram_load(void)
+{
+    ram_addr_t offset;
+    int flags;
+    int error;
+    void *shmem;
+    int i;
+    int bit;
+
+    if (umemd.version_id != RAM_SAVE_VERSION_ID) {
+        return -EINVAL;
+    }
+
+    offset = qemu_get_be64(umemd.mig_read);
+
+    flags = offset & ~TARGET_PAGE_MASK;
+    offset &= TARGET_PAGE_MASK;
+
+    assert(!(flags & RAM_SAVE_FLAG_MEM_SIZE));
+
+    if (flags & RAM_SAVE_FLAG_EOS) {
+        DPRINTF("RAM_SAVE_FLAG_EOS\n");
+        postcopy_incoming_umem_send_eoc_req();
+
+        qemu_fclose(umemd.mig_read);
+        umemd.mig_read = NULL;
+        fd_close(&umemd.mig_read_fd);
+        umemd.state |= UMEM_STATE_EOS_RECEIVED;
+
+        postcopy_incoming_umem_queue_quit();
+        DPRINTF("|= UMEM_STATE_EOS_RECEIVED\n");
+        return 0;
+    }
+
+    shmem = ram_load_host_from_stream_offset(umemd.mig_read, offset, flags,
+                                             &umemd.last_block_read);
+    if (!shmem) {
+        DPRINTF("shmem == NULL\n");
+        return -EINVAL;
+    }
+
+    if (flags & RAM_SAVE_FLAG_COMPRESS) {
+        uint8_t ch = qemu_get_byte(umemd.mig_read);
+        memset(shmem, ch, TARGET_PAGE_SIZE);
+    } else if (flags & RAM_SAVE_FLAG_PAGE) {
+        qemu_get_buffer(umemd.mig_read, shmem, TARGET_PAGE_SIZE);
+    }
+
+    error = qemu_file_get_error(umemd.mig_read);
+    if (error) {
+        DPRINTF("error %d\n", error);
+        return error;
+    }
+
+    umemd.page_cached.nr = 0;
+    bit = (umemd.last_block_read->offset + offset) >> TARGET_PAGE_BITS;
+    if (!test_and_set_bit(bit, umemd.phys_received)) {
+        if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
+            __u64 pgoff = offset >> umemd.host_page_shift;
+            for (i = 0; i < umemd.nr_host_pages_per_target_page; i++) {
+                umemd.page_cached.pgoffs[umemd.page_cached.nr] = pgoff + i;
+                umemd.page_cached.nr++;
+            }
+        } else {
+            bool mark_cache = true;
+            for (i = 0; i < umemd.nr_target_pages_per_host_page; i++) {
+                if (!test_bit(bit + i, umemd.phys_received)) {
+                    mark_cache = false;
+                    break;
+                }
+            }
+            if (mark_cache) {
+                umemd.page_cached.pgoffs[0] = offset >> umemd.host_page_shift;
+                umemd.page_cached.nr = 1;
+            }
+        }
+    }
+
+    if (umemd.page_cached.nr > 0) {
+        umem_mark_page_cached(umemd.last_block_read->umem, &umemd.page_cached);
+
+        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED) && umemd.to_qemu_fd >=0 &&
+            (incoming_postcopy_flags & INCOMING_FLAGS_FAULT_REQUEST)) {
+            uint64_t ramblock_pgoffset;
+
+            ramblock_pgoffset =
+                umemd.last_block_read->offset >> umemd.host_page_shift;
+            postcopy_incoming_umem_pages_present(&umemd.page_cached,
+                                                 ramblock_pgoffset);
+        }
+    }
+
+    return 0;
+}
+
+static bool postcopy_incoming_umem_check_umem_done(void)
+{
+    bool all_done = true;
+    RAMBlock *block;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        UMem *umem = block->umem;
+        if (umem != NULL && umem->nsets == umem->nbits) {
+            umem_unmap_shmem(umem);
+            umem_destroy(umem);
+            block->umem = NULL;
+        }
+        if (block->umem != NULL) {
+            all_done = false;
+        }
+    }
+    return all_done;
+}
+
+static bool postcopy_incoming_umem_page_faulted(const struct umem_pages *pages)
+{
+    int i;
+
+    for (i = 0; i < pages->nr; i++) {
+        ram_addr_t addr = pages->pgoffs[i] << umemd.host_page_shift;
+        RAMBlock *block = qemu_get_ram_block(addr);
+        addr -= block->offset;
+        umem_remove_shmem(block->umem, addr, umemd.host_page_size);
+    }
+    return postcopy_incoming_umem_check_umem_done();
+}
+
+static bool
+postcopy_incoming_umem_page_unmapped(const struct umem_pages *pages)
+{
+    RAMBlock *block;
+    ram_addr_t addr;
+    int i;
+
+    struct qemu_umem_req req = {
+        .cmd = QEMU_UMEM_REQ_REMOVE,
+        .nr = 0,
+        .pgoffs = (uint64_t*)pages->pgoffs,
+    };
+
+    addr = pages->pgoffs[0] << umemd.host_page_shift;
+    block = qemu_get_ram_block(addr);
+
+    for (i = 0; i < pages->nr; i++)  {
+        int pgoff;
+
+        addr = pages->pgoffs[i] << umemd.host_page_shift;
+        pgoff = addr >> TARGET_PAGE_BITS;
+        if (!test_bit(pgoff, umemd.phys_received) &&
+            !test_bit(pgoff, umemd.phys_requested)) {
+            req.pgoffs[req.nr] = pgoff;
+            req.nr++;
+        }
+        set_bit(pgoff, umemd.phys_received);
+        set_bit(pgoff, umemd.phys_requested);
+
+        umem_remove_shmem(block->umem,
+                          addr - block->offset, umemd.host_page_size);
+    }
+    if (req.nr > 0 && umemd.mig_write != NULL) {
+        req.idstr = block->idstr;
+        postcopy_incoming_send_req(umemd.mig_write->file, &req);
+    }
+
+    return postcopy_incoming_umem_check_umem_done();
+}
+
+static void postcopy_incoming_umem_done(void)
+{
+    postcopy_incoming_umem_send_eoc_req();
+    postcopy_incoming_umem_queue_quit();
+}
+
+static int postcopy_incoming_umem_handle_qemu(void)
+{
+    int ret;
+    int offset = 0;
+    uint8_t cmd;
+
+    ret = qemu_peek_buffer(umemd.from_qemu, &cmd, sizeof(cmd), offset);
+    offset += sizeof(cmd);
+    if (ret != sizeof(cmd)) {
+        return -EAGAIN;
+    }
+    DPRINTF("cmd %c\n", cmd);
+    switch (cmd) {
+    case UMEM_QEMU_QUIT:
+        postcopy_incoming_umem_recv_quit();
+        postcopy_incoming_umem_done();
+        break;
+    case UMEM_QEMU_PAGE_FAULTED: {
+        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
+                                                   &offset);
+        if (pages == NULL) {
+            return -EAGAIN;
+        }
+        if (postcopy_incoming_umem_page_faulted(pages)){
+            postcopy_incoming_umem_done();
+        }
+        g_free(pages);
+        break;
+    }
+    case UMEM_QEMU_PAGE_UNMAPPED: {
+        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
+                                                   &offset);
+        if (pages == NULL) {
+            return -EAGAIN;
+        }
+        if (postcopy_incoming_umem_page_unmapped(pages)){
+            postcopy_incoming_umem_done();
+        }
+        g_free(pages);
+        break;
+    }
+    default:
+        abort();
+        break;
+    }
+    if (umemd.from_qemu != NULL) {
+        qemu_file_skip(umemd.from_qemu, offset);
+    }
+    return 0;
+}
+
+static void set_fd(int fd, fd_set *fds, int *nfds)
+{
+    FD_SET(fd, fds);
+    if (fd > *nfds) {
+        *nfds = fd;
+    }
+}
+
+static int postcopy_incoming_umemd_main_loop(void)
+{
+    fd_set writefds;
+    fd_set readfds;
+    int nfds;
+    RAMBlock *block;
+    int ret;
+
+    int pending_size;
+    bool get_page_request;
+
+    nfds = -1;
+    FD_ZERO(&writefds);
+    FD_ZERO(&readfds);
+
+    if (umemd.mig_write != NULL) {
+        pending_size = nonblock_pending_size(umemd.mig_write);
+        if (pending_size > 0) {
+            set_fd(umemd.mig_write_fd, &writefds, &nfds);
+        }
+    } else {
+        pending_size = 0;
+    }
+
+#define PENDING_SIZE_MAX (MAX_REQUESTS * sizeof(uint64_t) * 2)
+    /* If page request to the migration source is accumulated,
+       suspend getting page fault request. */
+    get_page_request = (pending_size <= PENDING_SIZE_MAX);
+
+    if (get_page_request) {
+        QLIST_FOREACH(block, &ram_list.blocks, next) {
+            if (block->umem != NULL) {
+                set_fd(block->umem->fd, &readfds, &nfds);
+            }
+        }
+    }
+
+    if (umemd.mig_read_fd >= 0) {
+        set_fd(umemd.mig_read_fd, &readfds, &nfds);
+    }
+
+    if (umemd.to_qemu != NULL &&
+        nonblock_pending_size(umemd.to_qemu) > 0) {
+        set_fd(umemd.to_qemu_fd, &writefds, &nfds);
+    }
+    if (umemd.from_qemu_fd >= 0) {
+        set_fd(umemd.from_qemu_fd, &readfds, &nfds);
+    }
+
+    ret = select(nfds + 1, &readfds, &writefds, NULL, NULL);
+    if (ret == -1) {
+        if (errno == EINTR) {
+            return 0;
+        }
+        return ret;
+    }
+
+    if (umemd.mig_write_fd >= 0 && FD_ISSET(umemd.mig_write_fd, &writefds)) {
+        nonblock_fflush(umemd.mig_write);
+    }
+    if (umemd.to_qemu_fd >= 0 && FD_ISSET(umemd.to_qemu_fd, &writefds)) {
+        nonblock_fflush(umemd.to_qemu);
+    }
+    if (get_page_request) {
+        QLIST_FOREACH(block, &ram_list.blocks, next) {
+            if (block->umem != NULL && FD_ISSET(block->umem->fd, &readfds)) {
+                postcopy_incoming_umem_send_page_req(block);
+            }
+        }
+    }
+    if (umemd.mig_read_fd >= 0 && FD_ISSET(umemd.mig_read_fd, &readfds)) {
+        do {
+            ret = postcopy_incoming_umem_ram_load();
+            if (ret < 0) {
+                return ret;
+            }
+        } while (umemd.mig_read != NULL &&
+                 qemu_pending_size(umemd.mig_read) > 0);
+    }
+    if (umemd.from_qemu_fd >= 0 && FD_ISSET(umemd.from_qemu_fd, &readfds)) {
+        do {
+            ret = postcopy_incoming_umem_handle_qemu();
+            if (ret == -EAGAIN) {
+                break;
+            }
+        } while (umemd.from_qemu != NULL &&
+                 qemu_pending_size(umemd.from_qemu) > 0);
+    }
+
+    if (umemd.mig_write != NULL) {
+        nonblock_fflush(umemd.mig_write);
+    }
+    if (umemd.to_qemu != NULL) {
+        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED)) {
+            postcopy_incoming_umem_send_pages_present();
+        }
+        nonblock_fflush(umemd.to_qemu);
+        if ((umemd.state & UMEM_STATE_QUIT_QUEUED) &&
+            nonblock_pending_size(umemd.to_qemu) == 0) {
+            DPRINTF("|= UMEM_STATE_QUIT_SENT\n");
+            qemu_fclose(umemd.to_qemu->file);
+            umemd.to_qemu = NULL;
+            fd_close(&umemd.to_qemu_fd);
+            umemd.state |= UMEM_STATE_QUIT_SENT;
+        }
+    }
+
+    return (umemd.state & UMEM_STATE_END_MASK) == UMEM_STATE_END_MASK;
+}
+
+static void postcopy_incoming_umemd(void)
+{
+    ram_addr_t last_ram_offset;
+    int nbits;
+    RAMBlock *block;
+    int ret;
+
+    qemu_daemon(1, 1);
+    signal(SIGPIPE, SIG_IGN);
+    DPRINTF("daemon pid: %d\n", getpid());
+
+    umemd.page_request.pgoffs = g_new(__u64, MAX_REQUESTS);
+    umemd.page_cached.pgoffs =
+        g_new(__u64, MAX_REQUESTS *
+              (TARGET_PAGE_SIZE >= umemd.host_page_size ?
+               1: umemd.nr_host_pages_per_target_page));
+    umemd.target_pgoffs =
+        g_new(uint64_t, MAX_REQUESTS *
+              MAX(umemd.nr_host_pages_per_target_page,
+                  umemd.nr_target_pages_per_host_page));
+    umemd.present_request = g_malloc(umem_pages_size(MAX_PRESENT_REQUESTS));
+    umemd.present_request->nr = 0;
+
+    last_ram_offset = qemu_last_ram_offset();
+    nbits = last_ram_offset >> TARGET_PAGE_BITS;
+    umemd.phys_requested = g_new0(unsigned long, BITS_TO_LONGS(nbits));
+    umemd.phys_received = g_new0(unsigned long, BITS_TO_LONGS(nbits));
+    umemd.last_block_read = NULL;
+    umemd.last_block_write = NULL;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        UMem *umem = block->umem;
+        umem->umem = NULL;      /* umem mapping area has VM_DONT_COPY flag,
+                                   so we lost those mappings by fork */
+        block->host = umem_map_shmem(umem);
+        umem_close_shmem(umem);
+    }
+    umem_daemon_ready(umemd.to_qemu_fd);
+    umemd.to_qemu = qemu_fopen_nonblock(umemd.to_qemu_fd);
+
+    /* wait for qemu to disown migration_fd */
+    umem_daemon_wait_for_qemu(umemd.from_qemu_fd);
+    umemd.from_qemu = qemu_fopen_pipe(umemd.from_qemu_fd);
+
+    DPRINTF("entering umemd main loop\n");
+    for (;;) {
+        ret = postcopy_incoming_umemd_main_loop();
+        if (ret != 0) {
+            break;
+        }
+    }
+    DPRINTF("exiting umemd main loop\n");
+
+    /* This daemon forked from qemu and the parent qemu is still running.
+     * Cleanups of linked libraries like SDL should not be triggered,
+     * otherwise the parent qemu may use resources which was already freed.
+     */
+    fflush(stdout);
+    fflush(stderr);
+    _exit(ret < 0? EXIT_FAILURE: 0);
+}
diff --git a/migration-tcp.c b/migration-tcp.c
index cf6a9b8..aa35050 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -63,18 +63,25 @@ static void tcp_wait_for_connect(void *opaque)
     } while (ret == -1 && (socket_error()) == EINTR);
 
     if (ret < 0) {
-        migrate_fd_error(s);
-        return;
+        goto error_out;
     }
 
     qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
 
-    if (val == 0)
+    if (val == 0) {
+        ret = postcopy_outgoing_create_read_socket(s);
+        if (ret < 0) {
+            goto error_out;
+        }
         migrate_fd_connect(s);
-    else {
+    } else {
         DPRINTF("error connecting %d\n", val);
-        migrate_fd_error(s);
+        goto error_out;
     }
+    return;
+
+error_out:
+    migrate_fd_error(s);
 }
 
 int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
@@ -112,11 +119,19 @@ int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
 
     if (ret < 0) {
         DPRINTF("connect failed\n");
-        migrate_fd_error(s);
-        return ret;
+        goto error_out;
+    }
+
+    ret = postcopy_outgoing_create_read_socket(s);
+    if (ret < 0) {
+        goto error_out;
     }
     migrate_fd_connect(s);
     return 0;
+
+error_out:
+    migrate_fd_error(s);
+    return ret;
 }
 
 static void tcp_accept_incoming_migration(void *opaque)
@@ -145,7 +160,15 @@ static void tcp_accept_incoming_migration(void *opaque)
     }
 
     process_incoming_migration(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_fork_umemd(c, f);
+    }
     qemu_fclose(f);
+    if (incoming_postcopy) {
+        /* now socket is disowned.
+           So tell umem server that it's safe to use it */
+        postcopy_incoming_qemu_ready();
+    }
 out:
     close(c);
 out2:
diff --git a/migration-unix.c b/migration-unix.c
index dfcf203..3707505 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -69,12 +69,20 @@ static void unix_wait_for_connect(void *opaque)
 
     qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
 
-    if (val == 0)
+    if (val == 0) {
+        ret = postcopy_outgoing_create_read_socket(s);
+        if (ret < 0) {
+            goto error_out;
+        }
         migrate_fd_connect(s);
-    else {
+    } else {
         DPRINTF("error connecting %d\n", val);
-        migrate_fd_error(s);
+        goto error_out;
     }
+    return;
+
+error_out:
+    migrate_fd_error(s);
 }
 
 int unix_start_outgoing_migration(MigrationState *s, const char *path)
@@ -109,11 +117,19 @@ int unix_start_outgoing_migration(MigrationState *s, const char *path)
 
     if (ret < 0) {
         DPRINTF("connect failed\n");
-        migrate_fd_error(s);
-        return ret;
+        goto error_out;
+    }
+
+    ret = postcopy_outgoing_create_read_socket(s);
+    if (ret < 0) {
+        goto error_out;
     }
     migrate_fd_connect(s);
     return 0;
+
+error_out:
+    migrate_fd_error(s);
+    return ret;
 }
 
 static void unix_accept_incoming_migration(void *opaque)
@@ -142,7 +158,13 @@ static void unix_accept_incoming_migration(void *opaque)
     }
 
     process_incoming_migration(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_fork_umemd(c, f);
+    }
     qemu_fclose(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_qemu_ready();
+    }
 out:
     close(c);
 out2:
diff --git a/migration.c b/migration.c
index 0149ab3..51efe44 100644
--- a/migration.c
+++ b/migration.c
@@ -39,6 +39,11 @@ enum {
     MIG_STATE_COMPLETED,
 };
 
+enum {
+    MIG_SUBSTATE_PRECOPY,
+    MIG_SUBSTATE_POSTCOPY,
+};
+
 #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
 
 static NotifierList migration_state_notifiers =
@@ -255,6 +260,18 @@ static void migrate_fd_put_ready(void *opaque)
         return;
     }
 
+    if (s->substate == MIG_SUBSTATE_POSTCOPY) {
+        /* PRINTF("postcopy background\n"); */
+        ret = postcopy_outgoing_ram_save_background(s->mon, s->file,
+                                                    s->postcopy);
+        if (ret > 0) {
+            migrate_fd_completed(s);
+        } else if (ret < 0) {
+            migrate_fd_error(s);
+        }
+        return;
+    }
+
     DPRINTF("iterate\n");
     ret = qemu_savevm_state_iterate(s->mon, s->file);
     if (ret < 0) {
@@ -265,6 +282,19 @@ static void migrate_fd_put_ready(void *opaque)
         DPRINTF("done iterating\n");
         vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
 
+        if (s->params.postcopy) {
+            if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
+                migrate_fd_error(s);
+                if (old_vm_running) {
+                    vm_start();
+                }
+                return;
+            }
+            s->substate = MIG_SUBSTATE_POSTCOPY;
+            s->postcopy = postcopy_outgoing_begin(s);
+            return;
+        }
+
         if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
             migrate_fd_error(s);
         } else {
@@ -357,6 +387,7 @@ void migrate_fd_connect(MigrationState *s)
     int ret;
 
     s->state = MIG_STATE_ACTIVE;
+    s->substate = MIG_SUBSTATE_PRECOPY;
     s->file = qemu_fopen_ops_buffered(s,
                                       s->bandwidth_limit,
                                       migrate_fd_put_buffer,
diff --git a/migration.h b/migration.h
index 90ae362..2809e99 100644
--- a/migration.h
+++ b/migration.h
@@ -40,6 +40,12 @@ struct MigrationState
     int (*write)(MigrationState *s, const void *buff, size_t size);
     void *opaque;
     MigrationParams params;
+
+    /* for postcopy */
+    int substate;              /* precopy or postcopy */
+    int fd_read;
+    QEMUFile *file_read;        /* connection from the detination */
+    void *postcopy;
 };
 
 void process_incoming_migration(QEMUFile *f);
@@ -86,6 +92,7 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_transferred(void);
 uint64_t ram_bytes_total(void);
 
+void ram_save_set_params(const MigrationParams *params, void *opaque);
 void sort_ram_list(void);
 int ram_save_block(QEMUFile *f);
 void ram_save_memory_set_dirty(void);
@@ -107,7 +114,30 @@ void migrate_add_blocker(Error *reason);
  */
 void migrate_del_blocker(Error *reason);
 
+/* For outgoing postcopy */
+int postcopy_outgoing_create_read_socket(MigrationState *s);
+int postcopy_outgoing_ram_save_live(Monitor *mon,
+                                    QEMUFile *f, int stage, void *opaque);
+void *postcopy_outgoing_begin(MigrationState *s);
+int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
+                                          void *postcopy);
+
+/* For incoming postcopy */
 extern bool incoming_postcopy;
 extern unsigned long incoming_postcopy_flags;
 
+int postcopy_incoming_init(const char *incoming, bool incoming_postcopy);
+void postcopy_incoming_ram_alloc(const char *name,
+                                 size_t size, uint8_t **hostp, UMem **umemp);
+void postcopy_incoming_ram_free(UMem *umem);
+void postcopy_incoming_prepare(void);
+
+int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id);
+void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read);
+void postcopy_incoming_qemu_ready(void);
+void postcopy_incoming_qemu_cleanup(void);
+#ifdef NEED_CPU_H
+void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size);
+#endif
+
 #endif
diff --git a/qemu-common.h b/qemu-common.h
index 725922b..d74a8c9 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -17,6 +17,7 @@ typedef struct DeviceState DeviceState;
 
 struct Monitor;
 typedef struct Monitor Monitor;
+typedef struct UMem UMem;
 
 /* we put basic includes here to avoid repeating them in device drivers */
 #include <stdlib.h>
diff --git a/qemu-options.hx b/qemu-options.hx
index 5c5b8f3..19e20f9 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2510,7 +2510,10 @@ DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
     "-postcopy-flags unsigned-int(flags)\n"
     "	                flags for postcopy incoming migration\n"
     "                   when -incoming and -postcopy are specified.\n"
-    "                   This is for benchmark/debug purpose (default: 0)\n",
+    "                   This is for benchmark/debug purpose (default: 0)\n"
+    "                   Currently supprted flags are\n"
+    "                   1: enable fault request from umemd to qemu\n"
+    "                      (default: disabled)\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -postcopy-flags int
diff --git a/umem.c b/umem.c
new file mode 100644
index 0000000..b7be006
--- /dev/null
+++ b/umem.c
@@ -0,0 +1,379 @@
+/*
+ * umem.c: user process backed memory module for postcopy livemigration
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/umem.h>
+
+#include "bitops.h"
+#include "sysemu.h"
+#include "hw/hw.h"
+#include "umem.h"
+
+//#define DEBUG_UMEM
+#ifdef DEBUG_UMEM
+#include <sys/syscall.h>
+#define DPRINTF(format, ...)                                            \
+    do {                                                                \
+        printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),    \
+               __func__, __LINE__, ## __VA_ARGS__);                     \
+    } while (0)
+#else
+#define DPRINTF(format, ...)    do { } while (0)
+#endif
+
+#define DEV_UMEM        "/dev/umem"
+
+struct UMemDev {
+    int fd;
+    int page_shift;
+};
+
+UMemDev *umem_dev_new(void)
+{
+    UMemDev *umem_dev;
+    int umem_dev_fd = open(DEV_UMEM, O_RDWR);
+    if (umem_dev_fd < 0) {
+        perror("can't open "DEV_UMEM);
+        abort();
+    }
+
+    umem_dev = g_new(UMemDev, 1);
+    umem_dev->fd = umem_dev_fd;
+    umem_dev->page_shift = ffs(getpagesize()) - 1;
+    return umem_dev;
+}
+
+void umem_dev_destroy(UMemDev *dev)
+{
+    close(dev->fd);
+    g_free(dev);
+}
+
+UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name)
+{
+    struct umem_create create = {
+        .size = size,
+        .async_req_max = 0,
+        .sync_req_max = 0,
+    };
+    UMem *umem;
+
+    snprintf(create.name.id, sizeof(create.name.id),
+             "pid-%"PRId64, (uint64_t)getpid());
+    create.name.id[UMEM_ID_MAX - 1] = 0;
+    strncpy(create.name.name, name, sizeof(create.name.name));
+    create.name.name[UMEM_NAME_MAX - 1] = 0;
+
+    assert((size % getpagesize()) == 0);
+    if (ioctl(dev->fd, UMEM_DEV_CREATE_UMEM, &create) < 0) {
+        perror("UMEM_DEV_CREATE_UMEM");
+        abort();
+    }
+    if (ftruncate(create.shmem_fd, create.size) < 0) {
+        perror("truncate(\"shmem_fd\")");
+        abort();
+    }
+
+    umem = g_new(UMem, 1);
+    umem->nbits = 0;
+    umem->nsets = 0;
+    umem->faulted = NULL;
+    umem->page_shift = dev->page_shift;
+    umem->fd = create.umem_fd;
+    umem->shmem_fd = create.shmem_fd;
+    umem->size = create.size;
+    umem->umem = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
+                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (umem->umem == MAP_FAILED) {
+        perror("mmap(UMem) failed");
+        abort();
+    }
+    return umem;
+}
+
+void umem_mmap(UMem *umem)
+{
+    void *ret = mmap(umem->umem, umem->size,
+                     PROT_EXEC | PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_FIXED, umem->fd, 0);
+    if (ret == MAP_FAILED) {
+        perror("umem_mmap(UMem) failed");
+        abort();
+    }
+}
+
+void umem_destroy(UMem *umem)
+{
+    if (umem->fd != -1) {
+        close(umem->fd);
+    }
+    if (umem->shmem_fd != -1) {
+        close(umem->shmem_fd);
+    }
+    g_free(umem->faulted);
+    g_free(umem);
+}
+
+void umem_get_page_request(UMem *umem, struct umem_page_request *page_request)
+{
+    if (ioctl(umem->fd, UMEM_GET_PAGE_REQUEST, page_request)) {
+        perror("daemon: UMEM_GET_PAGE_REQUEST");
+        abort();
+    }
+}
+
+void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached)
+{
+    if (ioctl(umem->fd, UMEM_MARK_PAGE_CACHED, page_cached)) {
+        perror("daemon: UMEM_MARK_PAGE_CACHED");
+        abort();
+    }
+}
+
+void umem_unmap(UMem *umem)
+{
+    munmap(umem->umem, umem->size);
+    umem->umem = NULL;
+}
+
+void umem_close(UMem *umem)
+{
+    close(umem->fd);
+    umem->fd = -1;
+}
+
+void *umem_map_shmem(UMem *umem)
+{
+    umem->nbits = umem->size >> umem->page_shift;
+    umem->nsets = 0;
+    umem->faulted = g_new0(unsigned long, BITS_TO_LONGS(umem->nbits));
+
+    umem->shmem = mmap(NULL, umem->size, PROT_READ | PROT_WRITE, MAP_SHARED,
+                       umem->shmem_fd, 0);
+    if (umem->shmem == MAP_FAILED) {
+        perror("daemon: mmap(\"shmem\")");
+        abort();
+    }
+    return umem->shmem;
+}
+
+void umem_unmap_shmem(UMem *umem)
+{
+    munmap(umem->shmem, umem->size);
+    umem->shmem = NULL;
+}
+
+void umem_remove_shmem(UMem *umem, size_t offset, size_t size)
+{
+    int s = offset >> umem->page_shift;
+    int e = (offset + size) >> umem->page_shift;
+    int i;
+
+    for (i = s; i < e; i++) {
+        if (!test_and_set_bit(i, umem->faulted)) {
+            umem->nsets++;
+#if defined(CONFIG_MADVISE) && defined(MADV_REMOVE)
+            madvise(umem->shmem + offset, size, MADV_REMOVE);
+#endif
+        }
+    }
+}
+
+void umem_close_shmem(UMem *umem)
+{
+    close(umem->shmem_fd);
+    umem->shmem_fd = -1;
+}
+
+/***************************************************************************/
+/* qemu <-> umem daemon communication */
+
+size_t umem_pages_size(uint64_t nr)
+{
+    return sizeof(struct umem_pages) + nr * sizeof(uint64_t);
+}
+
+static void umem_write_cmd(int fd, uint8_t cmd)
+{
+    DPRINTF("write cmd %c\n", cmd);
+
+    for (;;) {
+        ssize_t ret = write(fd, &cmd, 1);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == EPIPE) {
+                perror("pipe");
+                DPRINTF("write cmd %c %zd %d: pipe is closed\n",
+                        cmd, ret, errno);
+                break;
+            }
+
+            perror("pipe");
+            DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
+            abort();
+        }
+
+        break;
+    }
+}
+
+static void umem_read_cmd(int fd, uint8_t expect)
+{
+    uint8_t cmd;
+    for (;;) {
+        ssize_t ret = read(fd, &cmd, 1);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            }
+            perror("pipe");
+            DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
+            abort();
+        }
+
+        if (ret == 0) {
+            DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
+            abort();
+        }
+
+        break;
+    }
+
+    DPRINTF("read cmd %c\n", cmd);
+    if (cmd != expect) {
+        DPRINTF("cmd %c expect %d\n", cmd, expect);
+        abort();
+    }
+}
+
+struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
+{
+    int ret;
+    uint64_t nr;
+    size_t size;
+    struct umem_pages *pages;
+
+    ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
+    *offset += sizeof(nr);
+    DPRINTF("ret %d nr %ld\n", ret, nr);
+    if (ret != sizeof(nr) || nr == 0) {
+        return NULL;
+    }
+
+    size = umem_pages_size(nr);
+    pages = g_malloc(size);
+    pages->nr = nr;
+    size -= sizeof(pages->nr);
+
+    ret = qemu_peek_buffer(f, (uint8_t*)pages->pgoffs, size, *offset);
+    *offset += size;
+    if (ret != size) {
+        g_free(pages);
+        return NULL;
+    }
+    return pages;
+}
+
+static void umem_send_pages(QEMUFile *f, const struct umem_pages *pages)
+{
+    size_t len = umem_pages_size(pages->nr);
+    qemu_put_buffer(f, (const uint8_t*)pages, len);
+}
+
+/* umem daemon -> qemu */
+void umem_daemon_ready(int to_qemu_fd)
+{
+    umem_write_cmd(to_qemu_fd, UMEM_DAEMON_READY);
+}
+
+void umem_daemon_quit(QEMUFile *to_qemu)
+{
+    qemu_put_byte(to_qemu, UMEM_DAEMON_QUIT);
+}
+
+void umem_daemon_send_pages_present(QEMUFile *to_qemu,
+                                    struct umem_pages *pages)
+{
+    qemu_put_byte(to_qemu, UMEM_DAEMON_TRIGGER_PAGE_FAULT);
+    umem_send_pages(to_qemu, pages);
+}
+
+void umem_daemon_wait_for_qemu(int from_qemu_fd)
+{
+    umem_read_cmd(from_qemu_fd, UMEM_QEMU_READY);
+}
+
+/* qemu -> umem daemon */
+void umem_qemu_wait_for_daemon(int from_umemd_fd)
+{
+    umem_read_cmd(from_umemd_fd, UMEM_DAEMON_READY);
+}
+
+void umem_qemu_ready(int to_umemd_fd)
+{
+    umem_write_cmd(to_umemd_fd, UMEM_QEMU_READY);
+}
+
+void umem_qemu_quit(QEMUFile *to_umemd)
+{
+    qemu_put_byte(to_umemd, UMEM_QEMU_QUIT);
+}
+
+/* qemu side handler */
+struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
+                                                int *offset)
+{
+    uint64_t i;
+    int page_shift = ffs(getpagesize()) - 1;
+    struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
+    if (pages == NULL) {
+        return NULL;
+    }
+
+    for (i = 0; i < pages->nr; i++) {
+        ram_addr_t addr = pages->pgoffs[i] << page_shift;
+
+        /* make pages present by forcibly triggering page fault. */
+        volatile uint8_t *ram = qemu_get_ram_ptr(addr);
+        uint8_t dummy_read = ram[0];
+        (void)dummy_read;   /* suppress unused variable warning */
+    }
+
+    return pages;
+}
+
+void umem_qemu_send_pages_present(QEMUFile *to_umemd,
+                                  const struct umem_pages *pages)
+{
+    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_FAULTED);
+    umem_send_pages(to_umemd, pages);
+}
+
+void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
+                                   const struct umem_pages *pages)
+{
+    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_UNMAPPED);
+    umem_send_pages(to_umemd, pages);
+}
diff --git a/umem.h b/umem.h
new file mode 100644
index 0000000..5ca19ef
--- /dev/null
+++ b/umem.h
@@ -0,0 +1,105 @@
+/*
+ * umem.h: user process backed memory module for postcopy livemigration
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef QEMU_UMEM_H
+#define QEMU_UMEM_H
+
+#include <linux/umem.h>
+
+#include "qemu-common.h"
+
+typedef struct UMemDev UMemDev;
+
+struct UMem {
+    void *umem;
+    int fd;
+    void *shmem;
+    int shmem_fd;
+    uint64_t size;
+
+    /* indexed by host page size */
+    int page_shift;
+    int nbits;
+    int nsets;
+    unsigned long *faulted;
+};
+
+UMemDev *umem_dev_new(void);
+void umem_dev_destroy(UMemDev *dev);
+UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name);
+void umem_mmap(UMem *umem);
+
+void umem_destroy(UMem *umem);
+
+/* umem device operations */
+void umem_get_page_request(UMem *umem, struct umem_page_request *page_request);
+void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached);
+void umem_unmap(UMem *umem);
+void umem_close(UMem *umem);
+
+/* umem shmem operations */
+void *umem_map_shmem(UMem *umem);
+void umem_unmap_shmem(UMem *umem);
+void umem_remove_shmem(UMem *umem, size_t offset, size_t size);
+void umem_close_shmem(UMem *umem);
+
+/* qemu on source <-> umem daemon communication */
+
+struct umem_pages {
+    uint64_t nr;        /* nr = 0 means completed */
+    uint64_t pgoffs[0];
+};
+
+/* daemon -> qemu */
+#define UMEM_DAEMON_READY               'R'
+#define UMEM_DAEMON_QUIT                'Q'
+#define UMEM_DAEMON_TRIGGER_PAGE_FAULT  'T'
+#define UMEM_DAEMON_ERROR               'E'
+
+/* qemu -> daemon */
+#define UMEM_QEMU_READY                 'r'
+#define UMEM_QEMU_QUIT                  'q'
+#define UMEM_QEMU_PAGE_FAULTED          't'
+#define UMEM_QEMU_PAGE_UNMAPPED         'u'
+
+struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset);
+size_t umem_pages_size(uint64_t nr);
+
+/* for umem daemon */
+void umem_daemon_ready(int to_qemu_fd);
+void umem_daemon_wait_for_qemu(int from_qemu_fd);
+void umem_daemon_quit(QEMUFile *to_qemu);
+void umem_daemon_send_pages_present(QEMUFile *to_qemu,
+                                    struct umem_pages *pages);
+
+/* for qemu */
+void umem_qemu_wait_for_daemon(int from_umemd_fd);
+void umem_qemu_ready(int to_umemd_fd);
+void umem_qemu_quit(QEMUFile *to_umemd);
+struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
+                                                int *offset);
+void umem_qemu_send_pages_present(QEMUFile *to_umemd,
+                                  const struct umem_pages *pages);
+void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
+                                   const struct umem_pages *pages);
+
+#endif /* QEMU_UMEM_H */
diff --git a/vl.c b/vl.c
index 5430b8c..17427a0 100644
--- a/vl.c
+++ b/vl.c
@@ -3274,8 +3274,12 @@ int main(int argc, char **argv, char **envp)
     default_drive(default_sdcard, snapshot, machine->use_scsi,
                   IF_SD, 0, SD_OPTS);
 
-    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
-                         ram_save_live, NULL, ram_load, NULL);
+    if (postcopy_incoming_init(incoming, incoming_postcopy) < 0) {
+        exit(1);
+    }
+    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
+                         ram_save_set_params, ram_save_live, NULL,
+                         ram_load, NULL);
 
     if (nb_numa_nodes > 0) {
         int i;
@@ -3471,6 +3475,9 @@ int main(int argc, char **argv, char **envp)
 
     if (incoming) {
         runstate_set(RUN_STATE_INMIGRATE);
+        if (incoming_postcopy) {
+            postcopy_incoming_prepare();
+        }
         int ret = qemu_start_incoming_migration(incoming);
         if (ret < 0) {
             fprintf(stderr, "Migration failed. Exit code %s(%d), exiting.\n",
@@ -3488,6 +3495,9 @@ int main(int argc, char **argv, char **envp)
     bdrv_close_all();
     pause_all_vcpus();
     net_cleanup();
+    if (incoming_postcopy) {
+        postcopy_incoming_qemu_cleanup();
+    }
     res_free();
 
     return 0;
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
@ 2011-12-29  1:26   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:26 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This patch implements postcopy livemigration.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 Makefile.target           |    4 +
 arch_init.c               |   26 +-
 cpu-all.h                 |    7 +
 exec.c                    |   20 +-
 migration-exec.c          |    8 +
 migration-fd.c            |   30 +
 migration-postcopy-stub.c |   77 ++
 migration-postcopy.c      | 1891 +++++++++++++++++++++++++++++++++++++++++++++
 migration-tcp.c           |   37 +-
 migration-unix.c          |   32 +-
 migration.c               |   31 +
 migration.h               |   30 +
 qemu-common.h             |    1 +
 qemu-options.hx           |    5 +-
 umem.c                    |  379 +++++++++
 umem.h                    |  105 +++
 vl.c                      |   14 +-
 17 files changed, 2677 insertions(+), 20 deletions(-)
 create mode 100644 migration-postcopy-stub.c
 create mode 100644 migration-postcopy.c
 create mode 100644 umem.c
 create mode 100644 umem.h

diff --git a/Makefile.target b/Makefile.target
index 3261383..d94c53f 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -4,6 +4,7 @@ GENERATED_HEADERS = config-target.h
 CONFIG_NO_PCI = $(if $(subst n,,$(CONFIG_PCI)),n,y)
 CONFIG_NO_KVM = $(if $(subst n,,$(CONFIG_KVM)),n,y)
 CONFIG_NO_XEN = $(if $(subst n,,$(CONFIG_XEN)),n,y)
+CONFIG_NO_POSTCOPY = $(if $(subst n,,$(CONFIG_POSTCOPY)),n,y)
 
 include ../config-host.mak
 include config-devices.mak
@@ -199,6 +200,9 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
 obj-y += memory.o
 LIBS+=-lz
 
+common-obj-$(CONFIG_POSTCOPY) += migration-postcopy.o umem.o
+common-obj-$(CONFIG_NO_POSTCOPY) += migration-postcopy-stub.o
+
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
 QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
diff --git a/arch_init.c b/arch_init.c
index bc53092..8b3130d 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -102,6 +102,13 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
     return 1;
 }
 
+static bool outgoing_postcopy = false;
+
+void ram_save_set_params(const MigrationParams *params, void *opaque)
+{
+    outgoing_postcopy = params->postcopy;
+}
+
 static RAMBlock *last_block_sent = NULL;
 
 int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
@@ -284,6 +291,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     uint64_t expected_time = 0;
     int ret;
 
+    if (stage == 1) {
+        last_block_sent = NULL;
+
+        bytes_transferred = 0;
+        last_block = NULL;
+        last_offset = 0;
+    }
+    if (outgoing_postcopy) {
+        return postcopy_outgoing_ram_save_live(mon, f, stage, opaque);
+    }
+
     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
         return 0;
@@ -295,10 +313,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     }
 
     if (stage == 1) {
-        bytes_transferred = 0;
-        last_block_sent = NULL;
-        last_block = NULL;
-        last_offset = 0;
         sort_ram_list();
 
         /* Make sure all dirty bits are set */
@@ -436,6 +450,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
     int flags;
     int error;
 
+    if (incoming_postcopy) {
+        return postcopy_incoming_ram_load(f, opaque, version_id);
+    }
+
     if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
         return -EINVAL;
     }
diff --git a/cpu-all.h b/cpu-all.h
index 0244f7a..2e9d8a7 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -475,6 +475,9 @@ extern ram_addr_t ram_size;
 /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
 #define RAM_PREALLOC_MASK   (1 << 0)
 
+/* RAM is allocated via umem for postcopy incoming mode */
+#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
+
 typedef struct RAMBlock {
     uint8_t *host;
     ram_addr_t offset;
@@ -485,6 +488,10 @@ typedef struct RAMBlock {
 #if defined(__linux__) && !defined(TARGET_S390X)
     int fd;
 #endif
+
+#ifdef CONFIG_POSTCOPY
+    UMem *umem;    /* for incoming postcopy mode */
+#endif
 } RAMBlock;
 
 typedef struct RAMList {
diff --git a/exec.c b/exec.c
index c8c6692..90b0491 100644
--- a/exec.c
+++ b/exec.c
@@ -35,6 +35,7 @@
 #include "qemu-timer.h"
 #include "memory.h"
 #include "exec-memory.h"
+#include "migration.h"
 #if defined(CONFIG_USER_ONLY)
 #include <qemu.h>
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -2949,6 +2950,13 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
         new_block->host = host;
         new_block->flags |= RAM_PREALLOC_MASK;
     } else {
+#ifdef CONFIG_POSTCOPY
+        if (incoming_postcopy) {
+            postcopy_incoming_ram_alloc(name, size,
+                                        &new_block->host, &new_block->umem);
+            new_block->flags |= RAM_POSTCOPY_UMEM_MASK;
+        } else
+#endif
         if (mem_path) {
 #if defined (__linux__) && !defined(TARGET_S390X)
             new_block->host = file_ram_alloc(new_block, size, mem_path);
@@ -3027,7 +3035,13 @@ void qemu_ram_free(ram_addr_t addr)
             QLIST_REMOVE(block, next);
             if (block->flags & RAM_PREALLOC_MASK) {
                 ;
-            } else if (mem_path) {
+            }
+#ifdef CONFIG_POSTCOPY
+            else if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
+                postcopy_incoming_ram_free(block->umem);
+            }
+#endif
+            else if (mem_path) {
 #if defined (__linux__) && !defined(TARGET_S390X)
                 if (block->fd) {
                     munmap(block->host, block->length);
@@ -3073,6 +3087,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
             } else {
                 flags = MAP_FIXED;
                 munmap(vaddr, length);
+                if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
+                    postcopy_incoming_qemu_pages_unmapped(addr, length);
+                    block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
+                }
                 if (mem_path) {
 #if defined(__linux__) && !defined(TARGET_S390X)
                     if (block->fd) {
diff --git a/migration-exec.c b/migration-exec.c
index e14552e..2bd0c3b 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -62,6 +62,10 @@ int exec_start_outgoing_migration(MigrationState *s, const char *command)
 {
     FILE *f;
 
+    if (s->params.postcopy) {
+        return -ENOSYS;
+    }
+
     f = popen(command, "w");
     if (f == NULL) {
         DPRINTF("Unable to popen exec target\n");
@@ -104,6 +108,10 @@ int exec_start_incoming_migration(const char *command)
 {
     QEMUFile *f;
 
+    if (incoming_postcopy) {
+        return -ENOSYS;
+    }
+
     DPRINTF("Attempting to start an incoming migration\n");
     f = qemu_popen_cmd(command, "r");
     if(f == NULL) {
diff --git a/migration-fd.c b/migration-fd.c
index 6211124..5a62ab9 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -88,6 +88,23 @@ int fd_start_outgoing_migration(MigrationState *s, const char *fdname)
     s->write = fd_write;
     s->close = fd_close;
 
+    if (s->params.postcopy) {
+        int flags = fcntl(s->fd, F_GETFL);
+        if ((flags & O_ACCMODE) != O_RDWR) {
+            goto err_after_open;
+        }
+
+        s->fd_read = dup(s->fd);
+        if (s->fd_read == -1) {
+            goto err_after_open;
+        }
+        s->file_read = qemu_fdopen(s->fd_read, "r");
+        if (s->file_read == NULL) {
+            close(s->fd_read);
+            goto err_after_open;
+        }
+    }
+
     migrate_fd_connect(s);
     return 0;
 
@@ -103,7 +120,14 @@ static void fd_accept_incoming_migration(void *opaque)
 
     process_incoming_migration(f);
     qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
+    if (incoming_postcopy) {
+        postcopy_incoming_fork_umemd(qemu_stdio_fd(f), f);
+    }
     qemu_fclose(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_qemu_ready();
+    }
+    return;
 }
 
 int fd_start_incoming_migration(const char *infd)
@@ -114,6 +138,12 @@ int fd_start_incoming_migration(const char *infd)
     DPRINTF("Attempting to start an incoming migration via fd\n");
 
     fd = strtol(infd, NULL, 0);
+    if (incoming_postcopy) {
+        int flags = fcntl(fd, F_GETFL);
+        if ((flags & O_ACCMODE) != O_RDWR) {
+            return -EINVAL;
+        }
+    }
     f = qemu_fdopen(fd, "rb");
     if(f == NULL) {
         DPRINTF("Unable to apply qemu wrapper to file descriptor\n");
diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
new file mode 100644
index 0000000..0b78de7
--- /dev/null
+++ b/migration-postcopy-stub.c
@@ -0,0 +1,77 @@
+/*
+ * migration-postcopy-stub.c: postcopy livemigration
+ *                            stub functions for non-supported hosts
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "sysemu.h"
+#include "migration.h"
+
+int postcopy_outgoing_create_read_socket(MigrationState *s)
+{
+    return -ENOSYS;
+}
+
+int postcopy_outgoing_ram_save_live(Monitor *mon,
+                                    QEMUFile *f, int stage, void *opaque)
+{
+    return -ENOSYS;
+}
+
+void *postcopy_outgoing_begin(MigrationState *ms)
+{
+    return NULL;
+}
+
+int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
+                                          void *postcopy)
+{
+    return -ENOSYS;
+}
+
+int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
+{
+    return -ENOSYS;
+}
+
+void postcopy_incoming_prepare(void)
+{
+}
+
+int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
+{
+    return -ENOSYS;
+}
+
+void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
+{
+}
+
+void postcopy_incoming_qemu_ready(void)
+{
+}
+
+void postcopy_incoming_qemu_cleanup(void)
+{
+}
+
+void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
+{
+}
diff --git a/migration-postcopy.c b/migration-postcopy.c
new file mode 100644
index 0000000..ed0d574
--- /dev/null
+++ b/migration-postcopy.c
@@ -0,0 +1,1891 @@
+/*
+ * migration-postcopy.c: postcopy livemigration
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "bitmap.h"
+#include "sysemu.h"
+#include "hw/hw.h"
+#include "arch_init.h"
+#include "migration.h"
+#include "umem.h"
+
+#include "memory.h"
+#define WANT_EXEC_OBSOLETE
+#include "exec-obsolete.h"
+
+//#define DEBUG_POSTCOPY
+#ifdef DEBUG_POSTCOPY
+#include <sys/syscall.h>
+#define DPRINTF(fmt, ...)                                               \
+    do {                                                                \
+        printf("%d:%ld %s:%d: " fmt, getpid(), syscall(SYS_gettid),     \
+               __func__, __LINE__, ## __VA_ARGS__);                     \
+    } while (0)
+#else
+#define DPRINTF(fmt, ...)       do { } while (0)
+#endif
+
+#define ALIGN_UP(size, align)   (((size) + (align) - 1) & ~((align) - 1))
+
+static void fd_close(int *fd)
+{
+    if (*fd >= 0) {
+        close(*fd);
+        *fd = -1;
+    }
+}
+
+/***************************************************************************
+ * QEMUFile for non blocking pipe
+ */
+
+/* read only */
+struct QEMUFilePipe {
+    int fd;
+    QEMUFile *file;
+};
+typedef struct QEMUFilePipe QEMUFilePipe;
+
+static int pipe_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFilePipe *s = opaque;
+    ssize_t len = 0;
+
+    while (size > 0) {
+        ssize_t ret = read(s->fd, buf, size);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            }
+            if (len == 0) {
+                len = -errno;
+            }
+            break;
+        }
+
+        if (ret == 0) {
+            /* the write end of the pipe is closed */
+            break;
+        }
+        len += ret;
+        buf += ret;
+        size -= ret;
+    }
+
+    return len;
+}
+
+static int pipe_close(void *opaque)
+{
+    QEMUFilePipe *s = opaque;
+    g_free(s);
+    return 0;
+}
+
+static QEMUFile *qemu_fopen_pipe(int fd)
+{
+    QEMUFilePipe *s = g_malloc0(sizeof(*s));
+
+    s->fd = fd;
+    fcntl_setfl(fd, O_NONBLOCK);
+    s->file = qemu_fopen_ops(s, NULL, pipe_get_buffer, pipe_close,
+                             NULL, NULL, NULL);
+    return s->file;
+}
+
+/* write only */
+struct QEMUFileNonblock {
+    int fd;
+    QEMUFile *file;
+
+    /* for pipe-write nonblocking mode */
+#define BUF_SIZE_INC    (32 * 1024)     /* = IO_BUF_SIZE */
+    uint8_t *buffer;
+    size_t buffer_size;
+    size_t buffer_capacity;
+    bool freeze_output;
+};
+typedef struct QEMUFileNonblock QEMUFileNonblock;
+
+static void nonblock_flush_buffer(QEMUFileNonblock *s)
+{
+    size_t offset = 0;
+    ssize_t ret;
+
+    while (offset < s->buffer_size) {
+        ret = write(s->fd, s->buffer + offset, s->buffer_size - offset);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == EAGAIN) {
+                s->freeze_output = true;
+            } else {
+                qemu_file_set_error(s->file, errno);
+            }
+            break;
+        }
+
+        if (ret == 0) {
+            DPRINTF("ret == 0\n");
+            break;
+        }
+
+        offset += ret;
+    }
+
+    if (offset > 0) {
+        assert(s->buffer_size >= offset);
+        memmove(s->buffer, s->buffer + offset, s->buffer_size - offset);
+        s->buffer_size -= offset;
+    }
+    if (s->buffer_size > 0) {
+        s->freeze_output = true;
+    }
+}
+
+static int nonblock_put_buffer(void *opaque,
+                               const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileNonblock *s = opaque;
+    int error;
+    ssize_t len = 0;
+
+    error = qemu_file_get_error(s->file);
+    if (error) {
+        return error;
+    }
+
+    nonblock_flush_buffer(s);
+    error = qemu_file_get_error(s->file);
+    if (error) {
+        return error;
+    }
+
+    while (!s->freeze_output && size > 0) {
+        ssize_t ret;
+        assert(s->buffer_size == 0);
+
+        ret = write(s->fd, buf, size);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == EAGAIN) {
+                s->freeze_output = true;
+            } else {
+                qemu_file_set_error(s->file, errno);
+            }
+            break;
+        }
+
+        len += ret;
+        buf += ret;
+        size -= ret;
+    }
+
+    if (size > 0) {
+        int inc = size - (s->buffer_capacity - s->buffer_size);
+        if (inc > 0) {
+            s->buffer_capacity +=
+                DIV_ROUND_UP(inc, BUF_SIZE_INC) * BUF_SIZE_INC;
+            s->buffer = g_realloc(s->buffer, s->buffer_capacity);
+        }
+        memcpy(s->buffer + s->buffer_size, buf, size);
+        s->buffer_size += size;
+
+        len += size;
+    }
+
+    return len;
+}
+
+static int nonblock_pending_size(QEMUFileNonblock *s)
+{
+    return qemu_pending_size(s->file) + s->buffer_size;
+}
+
+static void nonblock_fflush(QEMUFileNonblock *s)
+{
+    s->freeze_output = false;
+    nonblock_flush_buffer(s);
+    if (!s->freeze_output) {
+        qemu_fflush(s->file);
+    }
+}
+
+static void nonblock_wait_for_flush(QEMUFileNonblock *s)
+{
+    while (nonblock_pending_size(s) > 0) {
+        fd_set fds;
+        FD_ZERO(&fds);
+        FD_SET(s->fd, &fds);
+        select(s->fd + 1, NULL, &fds, NULL, NULL);
+
+        nonblock_fflush(s);
+    }
+}
+
+static int nonblock_close(void *opaque)
+{
+    QEMUFileNonblock *s = opaque;
+    nonblock_wait_for_flush(s);
+    g_free(s->buffer);
+    g_free(s);
+    return 0;
+}
+
+static QEMUFileNonblock *qemu_fopen_nonblock(int fd)
+{
+    QEMUFileNonblock *s = g_malloc0(sizeof(*s));
+
+    s->fd = fd;
+    fcntl_setfl(fd, O_NONBLOCK);
+    s->file = qemu_fopen_ops(s, nonblock_put_buffer, NULL, nonblock_close,
+                             NULL, NULL, NULL);
+    return s;
+}
+
+/***************************************************************************
+ * umem daemon on destination <-> qemu on source protocol
+ */
+
+#define QEMU_UMEM_REQ_INIT              0x00
+#define QEMU_UMEM_REQ_ON_DEMAND         0x01
+#define QEMU_UMEM_REQ_ON_DEMAND_CONT    0x02
+#define QEMU_UMEM_REQ_BACKGROUND        0x03
+#define QEMU_UMEM_REQ_BACKGROUND_CONT   0x04
+#define QEMU_UMEM_REQ_REMOVE            0x05
+#define QEMU_UMEM_REQ_EOC               0x06
+
+struct qemu_umem_req {
+    int8_t cmd;
+    uint8_t len;
+    char *idstr;        /* ON_DEMAND, BACKGROUND, REMOVE */
+    uint32_t nr;        /* ON_DEMAND, ON_DEMAND_CONT,
+                           BACKGROUND, BACKGROUND_CONT, REMOVE */
+
+    /* in target page size as qemu migration protocol */
+    uint64_t *pgoffs;   /* ON_DEMAND, ON_DEMAND_CONT,
+                           BACKGROUND, BACKGROUND_CONT, REMOVE */
+};
+
+static void postcopy_incoming_send_req_idstr(QEMUFile *f, const char* idstr)
+{
+    qemu_put_byte(f, strlen(idstr));
+    qemu_put_buffer(f, (uint8_t *)idstr, strlen(idstr));
+}
+
+static void postcopy_incoming_send_req_pgoffs(QEMUFile *f, uint32_t nr,
+                                              const uint64_t *pgoffs)
+{
+    uint32_t i;
+
+    qemu_put_be32(f, nr);
+    for (i = 0; i < nr; i++) {
+        qemu_put_be64(f, pgoffs[i]);
+    }
+}
+
+static void postcopy_incoming_send_req_one(QEMUFile *f,
+                                           const struct qemu_umem_req *req)
+{
+    DPRINTF("cmd %d\n", req->cmd);
+    qemu_put_byte(f, req->cmd);
+    switch (req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+    case QEMU_UMEM_REQ_EOC:
+        /* nothing */
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+    case QEMU_UMEM_REQ_REMOVE:
+        postcopy_incoming_send_req_idstr(f, req->idstr);
+        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
+        break;
+    default:
+        abort();
+        break;
+    }
+}
+
+/* QEMUFile can buffer up to IO_BUF_SIZE = 32 * 1024.
+ * So one message size must be <= IO_BUF_SIZE
+ * cmd: 1
+ * id len: 1
+ * id: 256
+ * nr: 2
+ */
+#define MAX_PAGE_NR     ((32 * 1024 - 1 - 1 - 256 - 2) / sizeof(uint64_t))
+static void postcopy_incoming_send_req(QEMUFile *f,
+                                       const struct qemu_umem_req *req)
+{
+    uint32_t nr = req->nr;
+    struct qemu_umem_req tmp = *req;
+
+    switch (req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+    case QEMU_UMEM_REQ_EOC:
+        postcopy_incoming_send_req_one(f, &tmp);
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+        tmp.nr = MIN(nr, MAX_PAGE_NR);
+        postcopy_incoming_send_req_one(f, &tmp);
+
+        nr -= tmp.nr;
+        tmp.pgoffs += tmp.nr;
+        if (tmp.cmd == QEMU_UMEM_REQ_ON_DEMAND) {
+            tmp.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
+        }else {
+            tmp.cmd = QEMU_UMEM_REQ_BACKGROUND_CONT;
+        }
+        /* fall through */
+    case QEMU_UMEM_REQ_REMOVE:
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        while (nr > 0) {
+            tmp.nr = MIN(nr, MAX_PAGE_NR);
+            postcopy_incoming_send_req_one(f, &tmp);
+
+            nr -= tmp.nr;
+            tmp.pgoffs += tmp.nr;
+        }
+        break;
+    default:
+        abort();
+        break;
+    }
+}
+
+static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
+                                            struct qemu_umem_req *req,
+                                            size_t *offset)
+{
+    int ret;
+
+    req->len = qemu_peek_byte(f, *offset);
+    *offset += 1;
+    if (req->len == 0) {
+        return -EAGAIN;
+    }
+    req->idstr = g_malloc((int)req->len + 1);
+    ret = qemu_peek_buffer(f, (uint8_t*)req->idstr, req->len, *offset);
+    *offset += ret;
+    if (ret != req->len) {
+        g_free(req->idstr);
+        req->idstr = NULL;
+        return -EAGAIN;
+    }
+    req->idstr[req->len] = 0;
+    return 0;
+}
+
+static int postcopy_outgoing_recv_req_pgoffs(QEMUFile *f,
+                                             struct qemu_umem_req *req,
+                                             size_t *offset)
+{
+    int ret;
+    uint32_t be32;
+    uint32_t i;
+
+    ret = qemu_peek_buffer(f, (uint8_t*)&be32, sizeof(be32), *offset);
+    *offset += sizeof(be32);
+    if (ret != sizeof(be32)) {
+        return -EAGAIN;
+    }
+
+    req->nr = be32_to_cpu(be32);
+    req->pgoffs = g_new(uint64_t, req->nr);
+    for (i = 0; i < req->nr; i++) {
+        uint64_t be64;
+        ret = qemu_peek_buffer(f, (uint8_t*)&be64, sizeof(be64), *offset);
+        *offset += sizeof(be64);
+        if (ret != sizeof(be64)) {
+            g_free(req->pgoffs);
+            req->pgoffs = NULL;
+            return -EAGAIN;
+        }
+        req->pgoffs[i] = be64_to_cpu(be64);
+    }
+    return 0;
+}
+
+static int postcopy_outgoing_recv_req(QEMUFile *f, struct qemu_umem_req *req)
+{
+    int size;
+    int ret;
+    size_t offset = 0;
+
+    size = qemu_peek_buffer(f, (uint8_t*)&req->cmd, 1, offset);
+    if (size <= 0) {
+        return -EAGAIN;
+    }
+    offset += 1;
+
+    switch (req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+    case QEMU_UMEM_REQ_EOC:
+        /* nothing */
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+    case QEMU_UMEM_REQ_REMOVE:
+        ret = postcopy_outgoing_recv_req_idstr(f, req, &offset);
+        if (ret < 0) {
+            return ret;
+        }
+        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
+        if (ret < 0) {
+            return ret;
+        }
+        break;
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
+        if (ret < 0) {
+            return ret;
+        }
+        break;
+    default:
+        abort();
+        break;
+    }
+    qemu_file_skip(f, offset);
+    DPRINTF("cmd %d\n", req->cmd);
+    return 0;
+}
+
+static void postcopy_outgoing_free_req(struct qemu_umem_req *req)
+{
+    g_free(req->idstr);
+    g_free(req->pgoffs);
+}
+
+/***************************************************************************
+ * outgoing part
+ */
+
+#define QEMU_SAVE_LIVE_STAGE_START      0x01    /* = QEMU_VM_SECTION_START */
+#define QEMU_SAVE_LIVE_STAGE_PART       0x02    /* = QEMU_VM_SECTION_PART */
+#define QEMU_SAVE_LIVE_STAGE_END        0x03    /* = QEMU_VM_SECTION_END */
+
+enum POState {
+    PO_STATE_ERROR_RECEIVE,
+    PO_STATE_ACTIVE,
+    PO_STATE_EOC_RECEIVED,
+    PO_STATE_ALL_PAGES_SENT,
+    PO_STATE_COMPLETED,
+};
+typedef enum POState POState;
+
+struct PostcopyOutgoingState {
+    POState state;
+    QEMUFile *mig_read;
+    int fd_read;
+    RAMBlock *last_block_read;
+
+    QEMUFile *mig_buffered_write;
+    MigrationState *ms;
+
+    /* For nobg mode. Check if all pages are sent */
+    RAMBlock *block;
+    ram_addr_t addr;
+};
+typedef struct PostcopyOutgoingState PostcopyOutgoingState;
+
+int postcopy_outgoing_create_read_socket(MigrationState *s)
+{
+    if (!s->params.postcopy) {
+        return 0;
+    }
+
+    s->fd_read = dup(s->fd);
+    if (s->fd_read == -1) {
+        int ret = -errno;
+        perror("dup");
+        return ret;
+    }
+    s->file_read = qemu_fopen_socket(s->fd_read);
+    if (s->file_read == NULL) {
+        return -EINVAL;
+    }
+    return 0;
+}
+
+int postcopy_outgoing_ram_save_live(Monitor *mon,
+                                    QEMUFile *f, int stage, void *opaque)
+{
+    int ret = 0;
+    DPRINTF("stage %d\n", stage);
+    if (stage == QEMU_SAVE_LIVE_STAGE_START) {
+        sort_ram_list();
+        ram_save_live_mem_size(f);
+    }
+    if (stage == QEMU_SAVE_LIVE_STAGE_PART) {
+        ret = 1;
+    }
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    return ret;
+}
+
+static RAMBlock *postcopy_outgoing_find_block(const char *idstr)
+{
+    RAMBlock *block;
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        if (!strncmp(idstr, block->idstr, strlen(idstr))) {
+            return block;
+        }
+    }
+    return NULL;
+}
+
+/*
+ * return value
+ *   0: continue postcopy mode
+ * > 0: completed postcopy mode.
+ * < 0: error
+ */
+static int postcopy_outgoing_handle_req(PostcopyOutgoingState *s,
+                                        const struct qemu_umem_req *req,
+                                        bool *written)
+{
+    int i;
+    RAMBlock *block;
+
+    DPRINTF("cmd %d state %d\n", req->cmd, s->state);
+    switch(req->cmd) {
+    case QEMU_UMEM_REQ_INIT:
+        /* nothing */
+        break;
+    case QEMU_UMEM_REQ_EOC:
+        /* tell to finish migration. */
+        if (s->state == PO_STATE_ALL_PAGES_SENT) {
+            s->state = PO_STATE_COMPLETED;
+            DPRINTF("-> PO_STATE_COMPLETED\n");
+        } else {
+            s->state = PO_STATE_EOC_RECEIVED;
+            DPRINTF("-> PO_STATE_EOC_RECEIVED\n");
+        }
+        return 1;
+    case QEMU_UMEM_REQ_ON_DEMAND:
+    case QEMU_UMEM_REQ_BACKGROUND:
+        DPRINTF("idstr: %s\n", req->idstr);
+        block = postcopy_outgoing_find_block(req->idstr);
+        if (block == NULL) {
+            return -EINVAL;
+        }
+        s->last_block_read = block;
+        /* fall through */
+    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
+    case QEMU_UMEM_REQ_BACKGROUND_CONT:
+        DPRINTF("nr %d\n", req->nr);
+        for (i = 0; i < req->nr; i++) {
+            DPRINTF("offs[%d] 0x%"PRIx64"\n", i, req->pgoffs[i]);
+            int ret = ram_save_page(s->mig_buffered_write, s->last_block_read,
+                                    req->pgoffs[i] << TARGET_PAGE_BITS);
+            if (ret > 0) {
+                *written = true;
+            }
+        }
+        break;
+    case QEMU_UMEM_REQ_REMOVE:
+        block = postcopy_outgoing_find_block(req->idstr);
+        if (block == NULL) {
+            return -EINVAL;
+        }
+        for (i = 0; i < req->nr; i++) {
+            ram_addr_t addr = block->offset +
+                (req->pgoffs[i] << TARGET_PAGE_BITS);
+            cpu_physical_memory_reset_dirty(addr,
+                                            addr + TARGET_PAGE_SIZE,
+                                            MIGRATION_DIRTY_FLAG);
+        }
+        break;
+    default:
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static void postcopy_outgoing_close_mig_read(PostcopyOutgoingState *s)
+{
+    if (s->mig_read != NULL) {
+        qemu_set_fd_handler(s->fd_read, NULL, NULL, NULL);
+        qemu_fclose(s->mig_read);
+        s->mig_read = NULL;
+        fd_close(&s->fd_read);
+
+        s->ms->file_read = NULL;
+        s->ms->fd_read = -1;
+    }
+}
+
+static void postcopy_outgoing_completed(PostcopyOutgoingState *s)
+{
+    postcopy_outgoing_close_mig_read(s);
+    s->ms->postcopy = NULL;
+    g_free(s);
+}
+
+static void postcopy_outgoing_recv_handler(void *opaque)
+{
+    PostcopyOutgoingState *s = opaque;
+    bool written = false;
+    int ret = 0;
+
+    assert(s->state == PO_STATE_ACTIVE ||
+           s->state == PO_STATE_ALL_PAGES_SENT);
+
+    do {
+        struct qemu_umem_req req = {.idstr = NULL,
+                                    .pgoffs = NULL};
+
+        ret = postcopy_outgoing_recv_req(s->mig_read, &req);
+        if (ret < 0) {
+            if (ret == -EAGAIN) {
+                ret = 0;
+            }
+            break;
+        }
+        if (s->state == PO_STATE_ACTIVE) {
+            ret = postcopy_outgoing_handle_req(s, &req, &written);
+        }
+        postcopy_outgoing_free_req(&req);
+    } while (ret == 0);
+
+    /*
+     * flush buffered_file.
+     * Although mig_write is rate-limited buffered file, those written pages
+     * are requested on demand by the destination. So forcibly push
+     * those pages ignoring rate limiting
+     */
+    if (written) {
+        qemu_fflush(s->mig_buffered_write);
+        /* qemu_buffered_file_drain(s->mig_buffered_write); */
+    }
+
+    if (ret < 0) {
+        switch (s->state) {
+        case PO_STATE_ACTIVE:
+            s->state = PO_STATE_ERROR_RECEIVE;
+            DPRINTF("-> PO_STATE_ERROR_RECEIVE\n");
+            break;
+        case PO_STATE_ALL_PAGES_SENT:
+            s->state = PO_STATE_COMPLETED;
+            DPRINTF("-> PO_STATE_ALL_PAGES_SENT\n");
+            break;
+        default:
+            abort();
+        }
+    }
+    if (s->state == PO_STATE_ERROR_RECEIVE || s->state == PO_STATE_COMPLETED) {
+        postcopy_outgoing_close_mig_read(s);
+    }
+    if (s->state == PO_STATE_COMPLETED) {
+        DPRINTF("PO_STATE_COMPLETED\n");
+        MigrationState *ms = s->ms;
+        postcopy_outgoing_completed(s);
+        migrate_fd_completed(ms);
+    }
+}
+
+void *postcopy_outgoing_begin(MigrationState *ms)
+{
+    PostcopyOutgoingState *s = g_new(PostcopyOutgoingState, 1);
+    DPRINTF("outgoing begin\n");
+    qemu_fflush(ms->file);
+
+    s->ms = ms;
+    s->state = PO_STATE_ACTIVE;
+    s->fd_read = ms->fd_read;
+    s->mig_read = ms->file_read;
+    s->mig_buffered_write = ms->file;
+    s->block = NULL;
+    s->addr = 0;
+
+    /* Make sure all dirty bits are set */
+    ram_save_memory_set_dirty();
+
+    qemu_set_fd_handler(s->fd_read,
+                        &postcopy_outgoing_recv_handler, NULL, s);
+    return s;
+}
+
+static void postcopy_outgoing_ram_all_sent(QEMUFile *f,
+                                           PostcopyOutgoingState *s)
+{
+    assert(s->state == PO_STATE_ACTIVE);
+
+    s->state = PO_STATE_ALL_PAGES_SENT;
+    /* tell incoming side that all pages are sent */
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(f);
+    qemu_buffered_file_drain(f);
+    DPRINTF("sent RAM_SAVE_FLAG_EOS\n");
+    migrate_fd_cleanup(s->ms);
+
+    /* Later migrate_fd_complete() will be called which calls
+     * migrate_fd_cleanup() again. So dummy file is created
+     * for qemu monitor to keep working.
+     */
+    s->ms->file = qemu_fopen_ops(NULL, NULL, NULL, NULL, NULL,
+                                 NULL, NULL);
+}
+
+static int postcopy_outgoing_check_all_ram_sent(PostcopyOutgoingState *s,
+                                                RAMBlock *block,
+                                                ram_addr_t addr)
+{
+    if (block == NULL) {
+        block = QLIST_FIRST(&ram_list.blocks);
+        addr = block->offset;
+    }
+
+    for (; block != NULL;
+         s->block = QLIST_NEXT(s->block, next), addr = block->offset) {
+        for (; addr < block->offset + block->length;
+             addr += TARGET_PAGE_SIZE) {
+            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+                s->block = block;
+                s->addr = addr;
+                return 0;
+            }
+        }
+    }
+
+    return 1;
+}
+
+int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
+                                          void *postcopy)
+{
+    PostcopyOutgoingState *s = postcopy;
+
+    assert(s->state == PO_STATE_ACTIVE ||
+           s->state == PO_STATE_EOC_RECEIVED ||
+           s->state == PO_STATE_ERROR_RECEIVE);
+
+    switch (s->state) {
+    case PO_STATE_ACTIVE:
+        /* nothing. processed below */
+        break;
+    case PO_STATE_EOC_RECEIVED:
+        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+        s->state = PO_STATE_COMPLETED;
+        postcopy_outgoing_completed(s);
+        DPRINTF("PO_STATE_COMPLETED\n");
+        return 1;
+    case PO_STATE_ERROR_RECEIVE:
+        postcopy_outgoing_completed(s);
+        DPRINTF("PO_STATE_ERROR_RECEIVE\n");
+        return -1;
+    default:
+        abort();
+    }
+
+    if (s->ms->params.nobg) {
+        /* See if all pages are sent. */
+        if (postcopy_outgoing_check_all_ram_sent(s, s->block, s->addr) == 0) {
+            return 0;
+        }
+        /* ram_list can be reordered. (it doesn't seem so during migration,
+           though) So the whole list needs to be checked again */
+        if (postcopy_outgoing_check_all_ram_sent(s, NULL, 0) == 0) {
+            return 0;
+        }
+
+        postcopy_outgoing_ram_all_sent(f, s);
+        return 0;
+    }
+
+    DPRINTF("outgoing background state: %d\n", s->state);
+
+    while (qemu_file_rate_limit(f) == 0) {
+        if (ram_save_block(f) == 0) { /* no more blocks */
+            assert(s->state == PO_STATE_ACTIVE);
+            postcopy_outgoing_ram_all_sent(f, s);
+            return 0;
+        }
+    }
+
+    return 0;
+}
+
+/***************************************************************************
+ * incoming part
+ */
+
+/* flags for incoming mode to modify the behavior.
+   This is for benchmark/debug purpose */
+#define INCOMING_FLAGS_FAULT_REQUEST 0x01
+
+
+static void postcopy_incoming_umemd(void);
+
+#define PIS_STATE_QUIT_RECEIVED         0x01
+#define PIS_STATE_QUIT_QUEUED           0x02
+#define PIS_STATE_QUIT_SENT             0x04
+
+#define PIS_STATE_QUIT_MASK             (PIS_STATE_QUIT_RECEIVED | \
+                                         PIS_STATE_QUIT_QUEUED | \
+                                         PIS_STATE_QUIT_SENT)
+
+struct PostcopyIncomingState {
+    /* dest qemu state */
+    uint32_t    state;
+
+    UMemDev *dev;
+    int host_page_size;
+    int host_page_shift;
+
+    /* qemu side */
+    int to_umemd_fd;
+    QEMUFileNonblock *to_umemd;
+#define MAX_FAULTED_PAGES       256
+    struct umem_pages *faulted_pages;
+
+    int from_umemd_fd;
+    QEMUFile *from_umemd;
+    int version_id;     /* save/load format version id */
+};
+typedef struct PostcopyIncomingState PostcopyIncomingState;
+
+
+#define UMEM_STATE_EOS_RECEIVED         0x01    /* umem daemon <-> src qemu */
+#define UMEM_STATE_EOC_SENT             0x02    /* umem daemon <-> src qemu */
+#define UMEM_STATE_QUIT_RECEIVED        0x04    /* umem daemon <-> dst qemu */
+#define UMEM_STATE_QUIT_QUEUED          0x08    /* umem daemon <-> dst qemu */
+#define UMEM_STATE_QUIT_SENT            0x10    /* umem daemon <-> dst qemu */
+
+#define UMEM_STATE_QUIT_MASK            (UMEM_STATE_QUIT_QUEUED | \
+                                         UMEM_STATE_QUIT_SENT | \
+                                         UMEM_STATE_QUIT_RECEIVED)
+#define UMEM_STATE_END_MASK             (UMEM_STATE_EOS_RECEIVED | \
+                                         UMEM_STATE_EOC_SENT | \
+                                         UMEM_STATE_QUIT_MASK)
+
+struct PostcopyIncomingUMemDaemon {
+    /* umem daemon side */
+    uint32_t state;
+
+    int host_page_size;
+    int host_page_shift;
+    int nr_host_pages_per_target_page;
+    int host_to_target_page_shift;
+    int nr_target_pages_per_host_page;
+    int target_to_host_page_shift;
+    int version_id;     /* save/load format version id */
+
+    int to_qemu_fd;
+    QEMUFileNonblock *to_qemu;
+    int from_qemu_fd;
+    QEMUFile *from_qemu;
+
+    int mig_read_fd;
+    QEMUFile *mig_read;         /* qemu on source -> umem daemon */
+
+    int mig_write_fd;
+    QEMUFileNonblock *mig_write;        /* umem daemon -> qemu on source */
+
+    /* = KVM_MAX_VCPUS * (ASYNC_PF_PER_VCPUS + 1) */
+#define MAX_REQUESTS    (512 * (64 + 1))
+
+    struct umem_page_request page_request;
+    struct umem_page_cached page_cached;
+
+#define MAX_PRESENT_REQUESTS    MAX_FAULTED_PAGES
+    struct umem_pages *present_request;
+
+    uint64_t *target_pgoffs;
+
+    /* bitmap indexed by target page offset */
+    unsigned long *phys_requested;
+
+    /* bitmap indexed by target page offset */
+    unsigned long *phys_received;
+
+    RAMBlock *last_block_read;  /* qemu on source -> umem daemon */
+    RAMBlock *last_block_write; /* umem daemon -> qemu on source */
+};
+typedef struct PostcopyIncomingUMemDaemon PostcopyIncomingUMemDaemon;
+
+static PostcopyIncomingState state = {
+    .state = 0,
+    .dev = NULL,
+    .to_umemd_fd = -1,
+    .to_umemd = NULL,
+    .from_umemd_fd = -1,
+    .from_umemd = NULL,
+};
+
+static PostcopyIncomingUMemDaemon umemd = {
+    .state = 0,
+    .to_qemu_fd = -1,
+    .to_qemu = NULL,
+    .from_qemu_fd = -1,
+    .from_qemu = NULL,
+    .mig_read_fd = -1,
+    .mig_read = NULL,
+    .mig_write_fd = -1,
+    .mig_write = NULL,
+};
+
+int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
+{
+    /* incoming_postcopy makes sense only when incoming migration mode */
+    if (!incoming && incoming_postcopy) {
+        return -EINVAL;
+    }
+
+    if (!incoming_postcopy) {
+        return 0;
+    }
+
+    state.state = 0;
+    state.dev = umem_dev_new();
+    state.host_page_size = getpagesize();
+    state.host_page_shift = ffs(state.host_page_size) - 1;
+    state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
+                                               ram_save_live() */
+    return 0;
+}
+
+void postcopy_incoming_ram_alloc(const char *name,
+                                 size_t size, uint8_t **hostp, UMem **umemp)
+{
+    UMem *umem;
+    size = ALIGN_UP(size, state.host_page_size);
+    umem = umem_dev_create(state.dev, size, name);
+
+    *umemp = umem;
+    *hostp = umem->umem;
+}
+
+void postcopy_incoming_ram_free(UMem *umem)
+{
+    umem_unmap(umem);
+    umem_close(umem);
+    umem_destroy(umem);
+}
+
+void postcopy_incoming_prepare(void)
+{
+    RAMBlock *block;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        if (block->umem != NULL) {
+            umem_mmap(block->umem);
+        }
+    }
+}
+
+static int postcopy_incoming_ram_load_get64(QEMUFile *f,
+                                             ram_addr_t *addr, int *flags)
+{
+    *addr = qemu_get_be64(f);
+    *flags = *addr & ~TARGET_PAGE_MASK;
+    *addr &= TARGET_PAGE_MASK;
+    return qemu_file_get_error(f);
+}
+
+int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
+{
+    ram_addr_t addr;
+    int flags;
+    int error;
+
+    DPRINTF("incoming ram load\n");
+    /*
+     * RAM_SAVE_FLAGS_EOS or
+     * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
+     * see postcopy_outgoing_ram_save_live()
+     */
+
+    if (version_id != RAM_SAVE_VERSION_ID) {
+        DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
+                version_id, RAM_SAVE_VERSION_ID);
+        return -EINVAL;
+    }
+    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
+    DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
+    if (error) {
+        DPRINTF("error %d\n", error);
+        return error;
+    }
+    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
+        DPRINTF("EOS\n");
+        return 0;
+    }
+
+    if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
+        DPRINTF("-EINVAL flags 0x%x\n", flags);
+        return -EINVAL;
+    }
+    error = ram_load_mem_size(f, addr);
+    if (error) {
+        DPRINTF("addr 0x%lx error %d\n", addr, error);
+        return error;
+    }
+
+    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
+    if (error) {
+        DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
+        return error;
+    }
+    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
+        DPRINTF("done\n");
+        return 0;
+    }
+    DPRINTF("-EINVAL\n");
+    return -EINVAL;
+}
+
+void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
+{
+    int fds[2];
+    RAMBlock *block;
+
+    DPRINTF("fork\n");
+
+    /* socketpair(AF_UNIX)? */
+
+    if (qemu_pipe(fds) == -1) {
+        perror("qemu_pipe");
+        abort();
+    }
+    state.from_umemd_fd = fds[0];
+    umemd.to_qemu_fd = fds[1];
+
+    if (qemu_pipe(fds) == -1) {
+        perror("qemu_pipe");
+        abort();
+    }
+    umemd.from_qemu_fd = fds[0];
+    state.to_umemd_fd = fds[1];
+
+    pid_t child = fork();
+    if (child < 0) {
+        perror("fork");
+        abort();
+    }
+
+    if (child == 0) {
+        int mig_write_fd;
+
+        fd_close(&state.to_umemd_fd);
+        fd_close(&state.from_umemd_fd);
+        umemd.host_page_size = state.host_page_size;
+        umemd.host_page_shift = state.host_page_shift;
+
+        umemd.nr_host_pages_per_target_page =
+            TARGET_PAGE_SIZE / umemd.host_page_size;
+        umemd.nr_target_pages_per_host_page =
+            umemd.host_page_size / TARGET_PAGE_SIZE;
+
+        umemd.target_to_host_page_shift =
+            ffs(umemd.nr_host_pages_per_target_page) - 1;
+        umemd.host_to_target_page_shift =
+            ffs(umemd.nr_target_pages_per_host_page) - 1;
+
+        umemd.state = 0;
+        umemd.version_id = state.version_id;
+        umemd.mig_read_fd = mig_read_fd;
+        umemd.mig_read = mig_read;
+
+        mig_write_fd = dup(mig_read_fd);
+        if (mig_write_fd < 0) {
+            perror("could not dup for writable socket \n");
+            abort();
+        }
+        umemd.mig_write_fd = mig_write_fd;
+        umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
+
+        postcopy_incoming_umemd(); /* noreturn */
+    }
+
+    DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
+    fd_close(&umemd.to_qemu_fd);
+    fd_close(&umemd.from_qemu_fd);
+    state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAGES));
+    state.faulted_pages->nr = 0;
+
+    /* close all UMem.shmem_fd */
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        umem_close_shmem(block->umem);
+    }
+    umem_qemu_wait_for_daemon(state.from_umemd_fd);
+}
+
+static void postcopy_incoming_qemu_recv_quit(void)
+{
+    RAMBlock *block;
+    if (state.state & PIS_STATE_QUIT_RECEIVED) {
+        return;
+    }
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        if (block->umem != NULL) {
+            umem_destroy(block->umem);
+            block->umem = NULL;
+            block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
+        }
+    }
+
+    DPRINTF("|= PIS_STATE_QUIT_RECEIVED\n");
+    state.state |= PIS_STATE_QUIT_RECEIVED;
+    qemu_set_fd_handler(state.from_umemd_fd, NULL, NULL, NULL);
+    qemu_fclose(state.from_umemd);
+    state.from_umemd = NULL;
+    fd_close(&state.from_umemd_fd);
+}
+
+static void postcopy_incoming_qemu_fflush_to_umemd_handler(void *opaque)
+{
+    assert(state.to_umemd != NULL);
+
+    nonblock_fflush(state.to_umemd);
+    if (nonblock_pending_size(state.to_umemd) > 0) {
+        return;
+    }
+
+    qemu_set_fd_handler(state.to_umemd->fd, NULL, NULL, NULL);
+    if (state.state & PIS_STATE_QUIT_QUEUED) {
+        DPRINTF("|= PIS_STATE_QUIT_SENT\n");
+        state.state |= PIS_STATE_QUIT_SENT;
+        qemu_fclose(state.to_umemd->file);
+        state.to_umemd = NULL;
+        fd_close(&state.to_umemd_fd);
+        g_free(state.faulted_pages);
+        state.faulted_pages = NULL;
+    }
+}
+
+static void postcopy_incoming_qemu_fflush_to_umemd(void)
+{
+    qemu_set_fd_handler(state.to_umemd->fd, NULL,
+                        postcopy_incoming_qemu_fflush_to_umemd_handler, NULL);
+    postcopy_incoming_qemu_fflush_to_umemd_handler(NULL);
+}
+
+static void postcopy_incoming_qemu_queue_quit(void)
+{
+    if (state.state & PIS_STATE_QUIT_QUEUED) {
+        return;
+    }
+
+    DPRINTF("|= PIS_STATE_QUIT_QUEUED\n");
+    umem_qemu_quit(state.to_umemd->file);
+    state.state |= PIS_STATE_QUIT_QUEUED;
+}
+
+static void postcopy_incoming_qemu_send_pages_present(void)
+{
+    if (state.faulted_pages->nr > 0) {
+        umem_qemu_send_pages_present(state.to_umemd->file,
+                                     state.faulted_pages);
+        state.faulted_pages->nr = 0;
+    }
+}
+
+static void postcopy_incoming_qemu_faulted_pages(
+    const struct umem_pages *pages)
+{
+    assert(pages->nr <= MAX_FAULTED_PAGES);
+    assert(state.faulted_pages != NULL);
+
+    if (state.faulted_pages->nr + pages->nr > MAX_FAULTED_PAGES) {
+        postcopy_incoming_qemu_send_pages_present();
+    }
+    memcpy(&state.faulted_pages->pgoffs[state.faulted_pages->nr],
+           &pages->pgoffs[0], sizeof(pages->pgoffs[0]) * pages->nr);
+    state.faulted_pages->nr += pages->nr;
+}
+
+static void postcopy_incoming_qemu_cleanup_umem(void);
+
+static int postcopy_incoming_qemu_handle_req_one(void)
+{
+    int offset = 0;
+    int ret;
+    uint8_t cmd;
+
+    ret = qemu_peek_buffer(state.from_umemd, &cmd, sizeof(cmd), offset);
+    offset += sizeof(cmd);
+    if (ret != sizeof(cmd)) {
+        return -EAGAIN;
+    }
+    DPRINTF("cmd %c\n", cmd);
+
+    switch (cmd) {
+    case UMEM_DAEMON_QUIT:
+        postcopy_incoming_qemu_recv_quit();
+        postcopy_incoming_qemu_queue_quit();
+        postcopy_incoming_qemu_cleanup_umem();
+        break;
+    case UMEM_DAEMON_TRIGGER_PAGE_FAULT: {
+        struct umem_pages *pages =
+            umem_qemu_trigger_page_fault(state.from_umemd, &offset);
+        if (pages == NULL) {
+            return -EAGAIN;
+        }
+        if (state.to_umemd_fd >= 0 && !(state.state & PIS_STATE_QUIT_QUEUED)) {
+            postcopy_incoming_qemu_faulted_pages(pages);
+            g_free(pages);
+        }
+        break;
+    }
+    case UMEM_DAEMON_ERROR:
+        /* umem daemon hit troubles, so it warned us to stop vm execution */
+        vm_stop(RUN_STATE_IO_ERROR); /* or RUN_STATE_INTERNAL_ERROR */
+        break;
+    default:
+        abort();
+        break;
+    }
+
+    if (state.from_umemd != NULL) {
+        qemu_file_skip(state.from_umemd, offset);
+    }
+    return 0;
+}
+
+static void postcopy_incoming_qemu_handle_req(void *opaque)
+{
+    do {
+        int ret = postcopy_incoming_qemu_handle_req_one();
+        if (ret == -EAGAIN) {
+            break;
+        }
+    } while (state.from_umemd != NULL &&
+             qemu_pending_size(state.from_umemd) > 0);
+
+    if (state.to_umemd != NULL) {
+        if (state.faulted_pages->nr > 0) {
+            postcopy_incoming_qemu_send_pages_present();
+        }
+        postcopy_incoming_qemu_fflush_to_umemd();
+    }
+}
+
+void postcopy_incoming_qemu_ready(void)
+{
+    umem_qemu_ready(state.to_umemd_fd);
+
+    state.from_umemd = qemu_fopen_pipe(state.from_umemd_fd);
+    state.to_umemd = qemu_fopen_nonblock(state.to_umemd_fd);
+    qemu_set_fd_handler(state.from_umemd_fd,
+                        postcopy_incoming_qemu_handle_req, NULL, NULL);
+}
+
+static void postcopy_incoming_qemu_cleanup_umem(void)
+{
+    /* when qemu will quit before completing postcopy, tell umem daemon
+       to tear down umem device and exit. */
+    if (state.to_umemd_fd >= 0) {
+        postcopy_incoming_qemu_queue_quit();
+        postcopy_incoming_qemu_fflush_to_umemd();
+    }
+
+    if (state.dev) {
+        umem_dev_destroy(state.dev);
+        state.dev = NULL;
+    }
+}
+
+void postcopy_incoming_qemu_cleanup(void)
+{
+    postcopy_incoming_qemu_cleanup_umem();
+    if (state.to_umemd != NULL) {
+        nonblock_wait_for_flush(state.to_umemd);
+    }
+}
+
+void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
+{
+    uint64_t nr = DIV_ROUND_UP(size, state.host_page_size);
+    size_t len = umem_pages_size(nr);
+    ram_addr_t end = addr + size;
+    struct umem_pages *pages;
+    int i;
+
+    if (state.to_umemd_fd < 0 || state.state & PIS_STATE_QUIT_QUEUED) {
+        return;
+    }
+    pages = g_malloc(len);
+    pages->nr = nr;
+    for (i = 0; addr < end; addr += state.host_page_size, i++) {
+        pages->pgoffs[i] = addr >> state.host_page_shift;
+    }
+    umem_qemu_send_pages_unmapped(state.to_umemd->file, pages);
+    g_free(pages);
+    assert(state.to_umemd != NULL);
+    postcopy_incoming_qemu_fflush_to_umemd();
+}
+
+/**************************************************************************
+ * incoming umem daemon
+ */
+
+static void postcopy_incoming_umem_recv_quit(void)
+{
+    if (umemd.state & UMEM_STATE_QUIT_RECEIVED) {
+        return;
+    }
+    DPRINTF("|= UMEM_STATE_QUIT_RECEIVED\n");
+    umemd.state |= UMEM_STATE_QUIT_RECEIVED;
+    qemu_fclose(umemd.from_qemu);
+    umemd.from_qemu = NULL;
+    fd_close(&umemd.from_qemu_fd);
+}
+
+static void postcopy_incoming_umem_queue_quit(void)
+{
+    if (umemd.state & UMEM_STATE_QUIT_QUEUED) {
+        return;
+    }
+    DPRINTF("|= UMEM_STATE_QUIT_QUEUED\n");
+    umem_daemon_quit(umemd.to_qemu->file);
+    umemd.state |= UMEM_STATE_QUIT_QUEUED;
+}
+
+static void postcopy_incoming_umem_send_eoc_req(void)
+{
+    struct qemu_umem_req req;
+
+    if (umemd.state & UMEM_STATE_EOC_SENT) {
+        return;
+    }
+
+    DPRINTF("|= UMEM_STATE_EOC_SENT\n");
+    req.cmd = QEMU_UMEM_REQ_EOC;
+    postcopy_incoming_send_req(umemd.mig_write->file, &req);
+    umemd.state |= UMEM_STATE_EOC_SENT;
+    qemu_fclose(umemd.mig_write->file);
+    umemd.mig_write = NULL;
+    fd_close(&umemd.mig_write_fd);
+}
+
+static void postcopy_incoming_umem_send_page_req(RAMBlock *block)
+{
+    struct qemu_umem_req req;
+    int bit;
+    uint64_t target_pgoff;
+    int i;
+
+    umemd.page_request.nr = MAX_REQUESTS;
+    umem_get_page_request(block->umem, &umemd.page_request);
+    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
+            block->idstr, umemd.page_request.nr,
+            (uint64_t)umemd.page_request.pgoffs[0],
+            (uint64_t)umemd.page_request.pgoffs[1]);
+
+    if (umemd.last_block_write != block) {
+        req.cmd = QEMU_UMEM_REQ_ON_DEMAND;
+        req.idstr = block->idstr;
+    } else {
+        req.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
+    }
+
+    req.nr = 0;
+    req.pgoffs = umemd.target_pgoffs;
+    if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
+        for (i = 0; i < umemd.page_request.nr; i++) {
+            target_pgoff =
+                umemd.page_request.pgoffs[i] >> umemd.host_to_target_page_shift;
+            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
+
+            if (!test_and_set_bit(bit, umemd.phys_requested)) {
+                req.pgoffs[req.nr] = target_pgoff;
+                req.nr++;
+            }
+        }
+    } else {
+        for (i = 0; i < umemd.page_request.nr; i++) {
+            int j;
+            target_pgoff =
+                umemd.page_request.pgoffs[i] << umemd.host_to_target_page_shift;
+            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
+
+            for (j = 0; j < umemd.nr_target_pages_per_host_page; j++) {
+                if (!test_and_set_bit(bit + j, umemd.phys_requested)) {
+                    req.pgoffs[req.nr] = target_pgoff + j;
+                    req.nr++;
+                }
+            }
+        }
+    }
+
+    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
+            block->idstr, req.nr, req.pgoffs[0], req.pgoffs[1]);
+    if (req.nr > 0 && umemd.mig_write != NULL) {
+        postcopy_incoming_send_req(umemd.mig_write->file, &req);
+        umemd.last_block_write = block;
+    }
+}
+
+static void postcopy_incoming_umem_send_pages_present(void)
+{
+    if (umemd.present_request->nr > 0) {
+        umem_daemon_send_pages_present(umemd.to_qemu->file,
+                                       umemd.present_request);
+        umemd.present_request->nr = 0;
+    }
+}
+
+static void postcopy_incoming_umem_pages_present_one(
+    uint32_t nr, const __u64 *pgoffs, uint64_t ramblock_pgoffset)
+{
+    uint32_t i;
+    assert(nr <= MAX_PRESENT_REQUESTS);
+
+    if (umemd.present_request->nr + nr > MAX_PRESENT_REQUESTS) {
+        postcopy_incoming_umem_send_pages_present();
+    }
+
+    for (i = 0; i < nr; i++) {
+        umemd.present_request->pgoffs[umemd.present_request->nr + i] =
+            pgoffs[i] + ramblock_pgoffset;
+    }
+    umemd.present_request->nr += nr;
+}
+
+static void postcopy_incoming_umem_pages_present(
+    const struct umem_page_cached *page_cached, uint64_t ramblock_pgoffset)
+{
+    uint32_t left = page_cached->nr;
+    uint32_t offset = 0;
+
+    while (left > 0) {
+        uint32_t nr = MIN(left, MAX_PRESENT_REQUESTS);
+        postcopy_incoming_umem_pages_present_one(
+            nr, &page_cached->pgoffs[offset], ramblock_pgoffset);
+
+        left -= nr;
+        offset += nr;
+    }
+}
+
+static int postcopy_incoming_umem_ram_load(void)
+{
+    ram_addr_t offset;
+    int flags;
+    int error;
+    void *shmem;
+    int i;
+    int bit;
+
+    if (umemd.version_id != RAM_SAVE_VERSION_ID) {
+        return -EINVAL;
+    }
+
+    offset = qemu_get_be64(umemd.mig_read);
+
+    flags = offset & ~TARGET_PAGE_MASK;
+    offset &= TARGET_PAGE_MASK;
+
+    assert(!(flags & RAM_SAVE_FLAG_MEM_SIZE));
+
+    if (flags & RAM_SAVE_FLAG_EOS) {
+        DPRINTF("RAM_SAVE_FLAG_EOS\n");
+        postcopy_incoming_umem_send_eoc_req();
+
+        qemu_fclose(umemd.mig_read);
+        umemd.mig_read = NULL;
+        fd_close(&umemd.mig_read_fd);
+        umemd.state |= UMEM_STATE_EOS_RECEIVED;
+
+        postcopy_incoming_umem_queue_quit();
+        DPRINTF("|= UMEM_STATE_EOS_RECEIVED\n");
+        return 0;
+    }
+
+    shmem = ram_load_host_from_stream_offset(umemd.mig_read, offset, flags,
+                                             &umemd.last_block_read);
+    if (!shmem) {
+        DPRINTF("shmem == NULL\n");
+        return -EINVAL;
+    }
+
+    if (flags & RAM_SAVE_FLAG_COMPRESS) {
+        uint8_t ch = qemu_get_byte(umemd.mig_read);
+        memset(shmem, ch, TARGET_PAGE_SIZE);
+    } else if (flags & RAM_SAVE_FLAG_PAGE) {
+        qemu_get_buffer(umemd.mig_read, shmem, TARGET_PAGE_SIZE);
+    }
+
+    error = qemu_file_get_error(umemd.mig_read);
+    if (error) {
+        DPRINTF("error %d\n", error);
+        return error;
+    }
+
+    umemd.page_cached.nr = 0;
+    bit = (umemd.last_block_read->offset + offset) >> TARGET_PAGE_BITS;
+    if (!test_and_set_bit(bit, umemd.phys_received)) {
+        if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
+            __u64 pgoff = offset >> umemd.host_page_shift;
+            for (i = 0; i < umemd.nr_host_pages_per_target_page; i++) {
+                umemd.page_cached.pgoffs[umemd.page_cached.nr] = pgoff + i;
+                umemd.page_cached.nr++;
+            }
+        } else {
+            bool mark_cache = true;
+            for (i = 0; i < umemd.nr_target_pages_per_host_page; i++) {
+                if (!test_bit(bit + i, umemd.phys_received)) {
+                    mark_cache = false;
+                    break;
+                }
+            }
+            if (mark_cache) {
+                umemd.page_cached.pgoffs[0] = offset >> umemd.host_page_shift;
+                umemd.page_cached.nr = 1;
+            }
+        }
+    }
+
+    if (umemd.page_cached.nr > 0) {
+        umem_mark_page_cached(umemd.last_block_read->umem, &umemd.page_cached);
+
+        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED) && umemd.to_qemu_fd >=0 &&
+            (incoming_postcopy_flags & INCOMING_FLAGS_FAULT_REQUEST)) {
+            uint64_t ramblock_pgoffset;
+
+            ramblock_pgoffset =
+                umemd.last_block_read->offset >> umemd.host_page_shift;
+            postcopy_incoming_umem_pages_present(&umemd.page_cached,
+                                                 ramblock_pgoffset);
+        }
+    }
+
+    return 0;
+}
+
+static bool postcopy_incoming_umem_check_umem_done(void)
+{
+    bool all_done = true;
+    RAMBlock *block;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        UMem *umem = block->umem;
+        if (umem != NULL && umem->nsets == umem->nbits) {
+            umem_unmap_shmem(umem);
+            umem_destroy(umem);
+            block->umem = NULL;
+        }
+        if (block->umem != NULL) {
+            all_done = false;
+        }
+    }
+    return all_done;
+}
+
+static bool postcopy_incoming_umem_page_faulted(const struct umem_pages *pages)
+{
+    int i;
+
+    for (i = 0; i < pages->nr; i++) {
+        ram_addr_t addr = pages->pgoffs[i] << umemd.host_page_shift;
+        RAMBlock *block = qemu_get_ram_block(addr);
+        addr -= block->offset;
+        umem_remove_shmem(block->umem, addr, umemd.host_page_size);
+    }
+    return postcopy_incoming_umem_check_umem_done();
+}
+
+static bool
+postcopy_incoming_umem_page_unmapped(const struct umem_pages *pages)
+{
+    RAMBlock *block;
+    ram_addr_t addr;
+    int i;
+
+    struct qemu_umem_req req = {
+        .cmd = QEMU_UMEM_REQ_REMOVE,
+        .nr = 0,
+        .pgoffs = (uint64_t*)pages->pgoffs,
+    };
+
+    addr = pages->pgoffs[0] << umemd.host_page_shift;
+    block = qemu_get_ram_block(addr);
+
+    for (i = 0; i < pages->nr; i++)  {
+        int pgoff;
+
+        addr = pages->pgoffs[i] << umemd.host_page_shift;
+        pgoff = addr >> TARGET_PAGE_BITS;
+        if (!test_bit(pgoff, umemd.phys_received) &&
+            !test_bit(pgoff, umemd.phys_requested)) {
+            req.pgoffs[req.nr] = pgoff;
+            req.nr++;
+        }
+        set_bit(pgoff, umemd.phys_received);
+        set_bit(pgoff, umemd.phys_requested);
+
+        umem_remove_shmem(block->umem,
+                          addr - block->offset, umemd.host_page_size);
+    }
+    if (req.nr > 0 && umemd.mig_write != NULL) {
+        req.idstr = block->idstr;
+        postcopy_incoming_send_req(umemd.mig_write->file, &req);
+    }
+
+    return postcopy_incoming_umem_check_umem_done();
+}
+
+static void postcopy_incoming_umem_done(void)
+{
+    postcopy_incoming_umem_send_eoc_req();
+    postcopy_incoming_umem_queue_quit();
+}
+
+static int postcopy_incoming_umem_handle_qemu(void)
+{
+    int ret;
+    int offset = 0;
+    uint8_t cmd;
+
+    ret = qemu_peek_buffer(umemd.from_qemu, &cmd, sizeof(cmd), offset);
+    offset += sizeof(cmd);
+    if (ret != sizeof(cmd)) {
+        return -EAGAIN;
+    }
+    DPRINTF("cmd %c\n", cmd);
+    switch (cmd) {
+    case UMEM_QEMU_QUIT:
+        postcopy_incoming_umem_recv_quit();
+        postcopy_incoming_umem_done();
+        break;
+    case UMEM_QEMU_PAGE_FAULTED: {
+        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
+                                                   &offset);
+        if (pages == NULL) {
+            return -EAGAIN;
+        }
+        if (postcopy_incoming_umem_page_faulted(pages)){
+            postcopy_incoming_umem_done();
+        }
+        g_free(pages);
+        break;
+    }
+    case UMEM_QEMU_PAGE_UNMAPPED: {
+        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
+                                                   &offset);
+        if (pages == NULL) {
+            return -EAGAIN;
+        }
+        if (postcopy_incoming_umem_page_unmapped(pages)){
+            postcopy_incoming_umem_done();
+        }
+        g_free(pages);
+        break;
+    }
+    default:
+        abort();
+        break;
+    }
+    if (umemd.from_qemu != NULL) {
+        qemu_file_skip(umemd.from_qemu, offset);
+    }
+    return 0;
+}
+
+static void set_fd(int fd, fd_set *fds, int *nfds)
+{
+    FD_SET(fd, fds);
+    if (fd > *nfds) {
+        *nfds = fd;
+    }
+}
+
+static int postcopy_incoming_umemd_main_loop(void)
+{
+    fd_set writefds;
+    fd_set readfds;
+    int nfds;
+    RAMBlock *block;
+    int ret;
+
+    int pending_size;
+    bool get_page_request;
+
+    nfds = -1;
+    FD_ZERO(&writefds);
+    FD_ZERO(&readfds);
+
+    if (umemd.mig_write != NULL) {
+        pending_size = nonblock_pending_size(umemd.mig_write);
+        if (pending_size > 0) {
+            set_fd(umemd.mig_write_fd, &writefds, &nfds);
+        }
+    } else {
+        pending_size = 0;
+    }
+
+#define PENDING_SIZE_MAX (MAX_REQUESTS * sizeof(uint64_t) * 2)
+    /* If page request to the migration source is accumulated,
+       suspend getting page fault request. */
+    get_page_request = (pending_size <= PENDING_SIZE_MAX);
+
+    if (get_page_request) {
+        QLIST_FOREACH(block, &ram_list.blocks, next) {
+            if (block->umem != NULL) {
+                set_fd(block->umem->fd, &readfds, &nfds);
+            }
+        }
+    }
+
+    if (umemd.mig_read_fd >= 0) {
+        set_fd(umemd.mig_read_fd, &readfds, &nfds);
+    }
+
+    if (umemd.to_qemu != NULL &&
+        nonblock_pending_size(umemd.to_qemu) > 0) {
+        set_fd(umemd.to_qemu_fd, &writefds, &nfds);
+    }
+    if (umemd.from_qemu_fd >= 0) {
+        set_fd(umemd.from_qemu_fd, &readfds, &nfds);
+    }
+
+    ret = select(nfds + 1, &readfds, &writefds, NULL, NULL);
+    if (ret == -1) {
+        if (errno == EINTR) {
+            return 0;
+        }
+        return ret;
+    }
+
+    if (umemd.mig_write_fd >= 0 && FD_ISSET(umemd.mig_write_fd, &writefds)) {
+        nonblock_fflush(umemd.mig_write);
+    }
+    if (umemd.to_qemu_fd >= 0 && FD_ISSET(umemd.to_qemu_fd, &writefds)) {
+        nonblock_fflush(umemd.to_qemu);
+    }
+    if (get_page_request) {
+        QLIST_FOREACH(block, &ram_list.blocks, next) {
+            if (block->umem != NULL && FD_ISSET(block->umem->fd, &readfds)) {
+                postcopy_incoming_umem_send_page_req(block);
+            }
+        }
+    }
+    if (umemd.mig_read_fd >= 0 && FD_ISSET(umemd.mig_read_fd, &readfds)) {
+        do {
+            ret = postcopy_incoming_umem_ram_load();
+            if (ret < 0) {
+                return ret;
+            }
+        } while (umemd.mig_read != NULL &&
+                 qemu_pending_size(umemd.mig_read) > 0);
+    }
+    if (umemd.from_qemu_fd >= 0 && FD_ISSET(umemd.from_qemu_fd, &readfds)) {
+        do {
+            ret = postcopy_incoming_umem_handle_qemu();
+            if (ret == -EAGAIN) {
+                break;
+            }
+        } while (umemd.from_qemu != NULL &&
+                 qemu_pending_size(umemd.from_qemu) > 0);
+    }
+
+    if (umemd.mig_write != NULL) {
+        nonblock_fflush(umemd.mig_write);
+    }
+    if (umemd.to_qemu != NULL) {
+        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED)) {
+            postcopy_incoming_umem_send_pages_present();
+        }
+        nonblock_fflush(umemd.to_qemu);
+        if ((umemd.state & UMEM_STATE_QUIT_QUEUED) &&
+            nonblock_pending_size(umemd.to_qemu) == 0) {
+            DPRINTF("|= UMEM_STATE_QUIT_SENT\n");
+            qemu_fclose(umemd.to_qemu->file);
+            umemd.to_qemu = NULL;
+            fd_close(&umemd.to_qemu_fd);
+            umemd.state |= UMEM_STATE_QUIT_SENT;
+        }
+    }
+
+    return (umemd.state & UMEM_STATE_END_MASK) == UMEM_STATE_END_MASK;
+}
+
+static void postcopy_incoming_umemd(void)
+{
+    ram_addr_t last_ram_offset;
+    int nbits;
+    RAMBlock *block;
+    int ret;
+
+    qemu_daemon(1, 1);
+    signal(SIGPIPE, SIG_IGN);
+    DPRINTF("daemon pid: %d\n", getpid());
+
+    umemd.page_request.pgoffs = g_new(__u64, MAX_REQUESTS);
+    umemd.page_cached.pgoffs =
+        g_new(__u64, MAX_REQUESTS *
+              (TARGET_PAGE_SIZE >= umemd.host_page_size ?
+               1: umemd.nr_host_pages_per_target_page));
+    umemd.target_pgoffs =
+        g_new(uint64_t, MAX_REQUESTS *
+              MAX(umemd.nr_host_pages_per_target_page,
+                  umemd.nr_target_pages_per_host_page));
+    umemd.present_request = g_malloc(umem_pages_size(MAX_PRESENT_REQUESTS));
+    umemd.present_request->nr = 0;
+
+    last_ram_offset = qemu_last_ram_offset();
+    nbits = last_ram_offset >> TARGET_PAGE_BITS;
+    umemd.phys_requested = g_new0(unsigned long, BITS_TO_LONGS(nbits));
+    umemd.phys_received = g_new0(unsigned long, BITS_TO_LONGS(nbits));
+    umemd.last_block_read = NULL;
+    umemd.last_block_write = NULL;
+
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        UMem *umem = block->umem;
+        umem->umem = NULL;      /* umem mapping area has VM_DONT_COPY flag,
+                                   so we lost those mappings by fork */
+        block->host = umem_map_shmem(umem);
+        umem_close_shmem(umem);
+    }
+    umem_daemon_ready(umemd.to_qemu_fd);
+    umemd.to_qemu = qemu_fopen_nonblock(umemd.to_qemu_fd);
+
+    /* wait for qemu to disown migration_fd */
+    umem_daemon_wait_for_qemu(umemd.from_qemu_fd);
+    umemd.from_qemu = qemu_fopen_pipe(umemd.from_qemu_fd);
+
+    DPRINTF("entering umemd main loop\n");
+    for (;;) {
+        ret = postcopy_incoming_umemd_main_loop();
+        if (ret != 0) {
+            break;
+        }
+    }
+    DPRINTF("exiting umemd main loop\n");
+
+    /* This daemon forked from qemu and the parent qemu is still running.
+     * Cleanups of linked libraries like SDL should not be triggered,
+     * otherwise the parent qemu may use resources which was already freed.
+     */
+    fflush(stdout);
+    fflush(stderr);
+    _exit(ret < 0? EXIT_FAILURE: 0);
+}
diff --git a/migration-tcp.c b/migration-tcp.c
index cf6a9b8..aa35050 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -63,18 +63,25 @@ static void tcp_wait_for_connect(void *opaque)
     } while (ret == -1 && (socket_error()) == EINTR);
 
     if (ret < 0) {
-        migrate_fd_error(s);
-        return;
+        goto error_out;
     }
 
     qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
 
-    if (val == 0)
+    if (val == 0) {
+        ret = postcopy_outgoing_create_read_socket(s);
+        if (ret < 0) {
+            goto error_out;
+        }
         migrate_fd_connect(s);
-    else {
+    } else {
         DPRINTF("error connecting %d\n", val);
-        migrate_fd_error(s);
+        goto error_out;
     }
+    return;
+
+error_out:
+    migrate_fd_error(s);
 }
 
 int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
@@ -112,11 +119,19 @@ int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
 
     if (ret < 0) {
         DPRINTF("connect failed\n");
-        migrate_fd_error(s);
-        return ret;
+        goto error_out;
+    }
+
+    ret = postcopy_outgoing_create_read_socket(s);
+    if (ret < 0) {
+        goto error_out;
     }
     migrate_fd_connect(s);
     return 0;
+
+error_out:
+    migrate_fd_error(s);
+    return ret;
 }
 
 static void tcp_accept_incoming_migration(void *opaque)
@@ -145,7 +160,15 @@ static void tcp_accept_incoming_migration(void *opaque)
     }
 
     process_incoming_migration(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_fork_umemd(c, f);
+    }
     qemu_fclose(f);
+    if (incoming_postcopy) {
+        /* now socket is disowned.
+           So tell umem server that it's safe to use it */
+        postcopy_incoming_qemu_ready();
+    }
 out:
     close(c);
 out2:
diff --git a/migration-unix.c b/migration-unix.c
index dfcf203..3707505 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -69,12 +69,20 @@ static void unix_wait_for_connect(void *opaque)
 
     qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
 
-    if (val == 0)
+    if (val == 0) {
+        ret = postcopy_outgoing_create_read_socket(s);
+        if (ret < 0) {
+            goto error_out;
+        }
         migrate_fd_connect(s);
-    else {
+    } else {
         DPRINTF("error connecting %d\n", val);
-        migrate_fd_error(s);
+        goto error_out;
     }
+    return;
+
+error_out:
+    migrate_fd_error(s);
 }
 
 int unix_start_outgoing_migration(MigrationState *s, const char *path)
@@ -109,11 +117,19 @@ int unix_start_outgoing_migration(MigrationState *s, const char *path)
 
     if (ret < 0) {
         DPRINTF("connect failed\n");
-        migrate_fd_error(s);
-        return ret;
+        goto error_out;
+    }
+
+    ret = postcopy_outgoing_create_read_socket(s);
+    if (ret < 0) {
+        goto error_out;
     }
     migrate_fd_connect(s);
     return 0;
+
+error_out:
+    migrate_fd_error(s);
+    return ret;
 }
 
 static void unix_accept_incoming_migration(void *opaque)
@@ -142,7 +158,13 @@ static void unix_accept_incoming_migration(void *opaque)
     }
 
     process_incoming_migration(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_fork_umemd(c, f);
+    }
     qemu_fclose(f);
+    if (incoming_postcopy) {
+        postcopy_incoming_qemu_ready();
+    }
 out:
     close(c);
 out2:
diff --git a/migration.c b/migration.c
index 0149ab3..51efe44 100644
--- a/migration.c
+++ b/migration.c
@@ -39,6 +39,11 @@ enum {
     MIG_STATE_COMPLETED,
 };
 
+enum {
+    MIG_SUBSTATE_PRECOPY,
+    MIG_SUBSTATE_POSTCOPY,
+};
+
 #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
 
 static NotifierList migration_state_notifiers =
@@ -255,6 +260,18 @@ static void migrate_fd_put_ready(void *opaque)
         return;
     }
 
+    if (s->substate == MIG_SUBSTATE_POSTCOPY) {
+        /* PRINTF("postcopy background\n"); */
+        ret = postcopy_outgoing_ram_save_background(s->mon, s->file,
+                                                    s->postcopy);
+        if (ret > 0) {
+            migrate_fd_completed(s);
+        } else if (ret < 0) {
+            migrate_fd_error(s);
+        }
+        return;
+    }
+
     DPRINTF("iterate\n");
     ret = qemu_savevm_state_iterate(s->mon, s->file);
     if (ret < 0) {
@@ -265,6 +282,19 @@ static void migrate_fd_put_ready(void *opaque)
         DPRINTF("done iterating\n");
         vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
 
+        if (s->params.postcopy) {
+            if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
+                migrate_fd_error(s);
+                if (old_vm_running) {
+                    vm_start();
+                }
+                return;
+            }
+            s->substate = MIG_SUBSTATE_POSTCOPY;
+            s->postcopy = postcopy_outgoing_begin(s);
+            return;
+        }
+
         if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
             migrate_fd_error(s);
         } else {
@@ -357,6 +387,7 @@ void migrate_fd_connect(MigrationState *s)
     int ret;
 
     s->state = MIG_STATE_ACTIVE;
+    s->substate = MIG_SUBSTATE_PRECOPY;
     s->file = qemu_fopen_ops_buffered(s,
                                       s->bandwidth_limit,
                                       migrate_fd_put_buffer,
diff --git a/migration.h b/migration.h
index 90ae362..2809e99 100644
--- a/migration.h
+++ b/migration.h
@@ -40,6 +40,12 @@ struct MigrationState
     int (*write)(MigrationState *s, const void *buff, size_t size);
     void *opaque;
     MigrationParams params;
+
+    /* for postcopy */
+    int substate;              /* precopy or postcopy */
+    int fd_read;
+    QEMUFile *file_read;        /* connection from the detination */
+    void *postcopy;
 };
 
 void process_incoming_migration(QEMUFile *f);
@@ -86,6 +92,7 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_transferred(void);
 uint64_t ram_bytes_total(void);
 
+void ram_save_set_params(const MigrationParams *params, void *opaque);
 void sort_ram_list(void);
 int ram_save_block(QEMUFile *f);
 void ram_save_memory_set_dirty(void);
@@ -107,7 +114,30 @@ void migrate_add_blocker(Error *reason);
  */
 void migrate_del_blocker(Error *reason);
 
+/* For outgoing postcopy */
+int postcopy_outgoing_create_read_socket(MigrationState *s);
+int postcopy_outgoing_ram_save_live(Monitor *mon,
+                                    QEMUFile *f, int stage, void *opaque);
+void *postcopy_outgoing_begin(MigrationState *s);
+int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
+                                          void *postcopy);
+
+/* For incoming postcopy */
 extern bool incoming_postcopy;
 extern unsigned long incoming_postcopy_flags;
 
+int postcopy_incoming_init(const char *incoming, bool incoming_postcopy);
+void postcopy_incoming_ram_alloc(const char *name,
+                                 size_t size, uint8_t **hostp, UMem **umemp);
+void postcopy_incoming_ram_free(UMem *umem);
+void postcopy_incoming_prepare(void);
+
+int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id);
+void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read);
+void postcopy_incoming_qemu_ready(void);
+void postcopy_incoming_qemu_cleanup(void);
+#ifdef NEED_CPU_H
+void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size);
+#endif
+
 #endif
diff --git a/qemu-common.h b/qemu-common.h
index 725922b..d74a8c9 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -17,6 +17,7 @@ typedef struct DeviceState DeviceState;
 
 struct Monitor;
 typedef struct Monitor Monitor;
+typedef struct UMem UMem;
 
 /* we put basic includes here to avoid repeating them in device drivers */
 #include <stdlib.h>
diff --git a/qemu-options.hx b/qemu-options.hx
index 5c5b8f3..19e20f9 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2510,7 +2510,10 @@ DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
     "-postcopy-flags unsigned-int(flags)\n"
     "	                flags for postcopy incoming migration\n"
     "                   when -incoming and -postcopy are specified.\n"
-    "                   This is for benchmark/debug purpose (default: 0)\n",
+    "                   This is for benchmark/debug purpose (default: 0)\n"
+    "                   Currently supprted flags are\n"
+    "                   1: enable fault request from umemd to qemu\n"
+    "                      (default: disabled)\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -postcopy-flags int
diff --git a/umem.c b/umem.c
new file mode 100644
index 0000000..b7be006
--- /dev/null
+++ b/umem.c
@@ -0,0 +1,379 @@
+/*
+ * umem.c: user process backed memory module for postcopy livemigration
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/umem.h>
+
+#include "bitops.h"
+#include "sysemu.h"
+#include "hw/hw.h"
+#include "umem.h"
+
+//#define DEBUG_UMEM
+#ifdef DEBUG_UMEM
+#include <sys/syscall.h>
+#define DPRINTF(format, ...)                                            \
+    do {                                                                \
+        printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),    \
+               __func__, __LINE__, ## __VA_ARGS__);                     \
+    } while (0)
+#else
+#define DPRINTF(format, ...)    do { } while (0)
+#endif
+
+#define DEV_UMEM        "/dev/umem"
+
+struct UMemDev {
+    int fd;
+    int page_shift;
+};
+
+UMemDev *umem_dev_new(void)
+{
+    UMemDev *umem_dev;
+    int umem_dev_fd = open(DEV_UMEM, O_RDWR);
+    if (umem_dev_fd < 0) {
+        perror("can't open "DEV_UMEM);
+        abort();
+    }
+
+    umem_dev = g_new(UMemDev, 1);
+    umem_dev->fd = umem_dev_fd;
+    umem_dev->page_shift = ffs(getpagesize()) - 1;
+    return umem_dev;
+}
+
+void umem_dev_destroy(UMemDev *dev)
+{
+    close(dev->fd);
+    g_free(dev);
+}
+
+UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name)
+{
+    struct umem_create create = {
+        .size = size,
+        .async_req_max = 0,
+        .sync_req_max = 0,
+    };
+    UMem *umem;
+
+    snprintf(create.name.id, sizeof(create.name.id),
+             "pid-%"PRId64, (uint64_t)getpid());
+    create.name.id[UMEM_ID_MAX - 1] = 0;
+    strncpy(create.name.name, name, sizeof(create.name.name));
+    create.name.name[UMEM_NAME_MAX - 1] = 0;
+
+    assert((size % getpagesize()) == 0);
+    if (ioctl(dev->fd, UMEM_DEV_CREATE_UMEM, &create) < 0) {
+        perror("UMEM_DEV_CREATE_UMEM");
+        abort();
+    }
+    if (ftruncate(create.shmem_fd, create.size) < 0) {
+        perror("truncate(\"shmem_fd\")");
+        abort();
+    }
+
+    umem = g_new(UMem, 1);
+    umem->nbits = 0;
+    umem->nsets = 0;
+    umem->faulted = NULL;
+    umem->page_shift = dev->page_shift;
+    umem->fd = create.umem_fd;
+    umem->shmem_fd = create.shmem_fd;
+    umem->size = create.size;
+    umem->umem = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
+                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (umem->umem == MAP_FAILED) {
+        perror("mmap(UMem) failed");
+        abort();
+    }
+    return umem;
+}
+
+void umem_mmap(UMem *umem)
+{
+    void *ret = mmap(umem->umem, umem->size,
+                     PROT_EXEC | PROT_READ | PROT_WRITE,
+                     MAP_PRIVATE | MAP_FIXED, umem->fd, 0);
+    if (ret == MAP_FAILED) {
+        perror("umem_mmap(UMem) failed");
+        abort();
+    }
+}
+
+void umem_destroy(UMem *umem)
+{
+    if (umem->fd != -1) {
+        close(umem->fd);
+    }
+    if (umem->shmem_fd != -1) {
+        close(umem->shmem_fd);
+    }
+    g_free(umem->faulted);
+    g_free(umem);
+}
+
+void umem_get_page_request(UMem *umem, struct umem_page_request *page_request)
+{
+    if (ioctl(umem->fd, UMEM_GET_PAGE_REQUEST, page_request)) {
+        perror("daemon: UMEM_GET_PAGE_REQUEST");
+        abort();
+    }
+}
+
+void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached)
+{
+    if (ioctl(umem->fd, UMEM_MARK_PAGE_CACHED, page_cached)) {
+        perror("daemon: UMEM_MARK_PAGE_CACHED");
+        abort();
+    }
+}
+
+void umem_unmap(UMem *umem)
+{
+    munmap(umem->umem, umem->size);
+    umem->umem = NULL;
+}
+
+void umem_close(UMem *umem)
+{
+    close(umem->fd);
+    umem->fd = -1;
+}
+
+void *umem_map_shmem(UMem *umem)
+{
+    umem->nbits = umem->size >> umem->page_shift;
+    umem->nsets = 0;
+    umem->faulted = g_new0(unsigned long, BITS_TO_LONGS(umem->nbits));
+
+    umem->shmem = mmap(NULL, umem->size, PROT_READ | PROT_WRITE, MAP_SHARED,
+                       umem->shmem_fd, 0);
+    if (umem->shmem == MAP_FAILED) {
+        perror("daemon: mmap(\"shmem\")");
+        abort();
+    }
+    return umem->shmem;
+}
+
+void umem_unmap_shmem(UMem *umem)
+{
+    munmap(umem->shmem, umem->size);
+    umem->shmem = NULL;
+}
+
+void umem_remove_shmem(UMem *umem, size_t offset, size_t size)
+{
+    int s = offset >> umem->page_shift;
+    int e = (offset + size) >> umem->page_shift;
+    int i;
+
+    for (i = s; i < e; i++) {
+        if (!test_and_set_bit(i, umem->faulted)) {
+            umem->nsets++;
+#if defined(CONFIG_MADVISE) && defined(MADV_REMOVE)
+            madvise(umem->shmem + offset, size, MADV_REMOVE);
+#endif
+        }
+    }
+}
+
+void umem_close_shmem(UMem *umem)
+{
+    close(umem->shmem_fd);
+    umem->shmem_fd = -1;
+}
+
+/***************************************************************************/
+/* qemu <-> umem daemon communication */
+
+size_t umem_pages_size(uint64_t nr)
+{
+    return sizeof(struct umem_pages) + nr * sizeof(uint64_t);
+}
+
+static void umem_write_cmd(int fd, uint8_t cmd)
+{
+    DPRINTF("write cmd %c\n", cmd);
+
+    for (;;) {
+        ssize_t ret = write(fd, &cmd, 1);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == EPIPE) {
+                perror("pipe");
+                DPRINTF("write cmd %c %zd %d: pipe is closed\n",
+                        cmd, ret, errno);
+                break;
+            }
+
+            perror("pipe");
+            DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
+            abort();
+        }
+
+        break;
+    }
+}
+
+static void umem_read_cmd(int fd, uint8_t expect)
+{
+    uint8_t cmd;
+    for (;;) {
+        ssize_t ret = read(fd, &cmd, 1);
+        if (ret == -1) {
+            if (errno == EINTR) {
+                continue;
+            }
+            perror("pipe");
+            DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
+            abort();
+        }
+
+        if (ret == 0) {
+            DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
+            abort();
+        }
+
+        break;
+    }
+
+    DPRINTF("read cmd %c\n", cmd);
+    if (cmd != expect) {
+        DPRINTF("cmd %c expect %d\n", cmd, expect);
+        abort();
+    }
+}
+
+struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
+{
+    int ret;
+    uint64_t nr;
+    size_t size;
+    struct umem_pages *pages;
+
+    ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
+    *offset += sizeof(nr);
+    DPRINTF("ret %d nr %ld\n", ret, nr);
+    if (ret != sizeof(nr) || nr == 0) {
+        return NULL;
+    }
+
+    size = umem_pages_size(nr);
+    pages = g_malloc(size);
+    pages->nr = nr;
+    size -= sizeof(pages->nr);
+
+    ret = qemu_peek_buffer(f, (uint8_t*)pages->pgoffs, size, *offset);
+    *offset += size;
+    if (ret != size) {
+        g_free(pages);
+        return NULL;
+    }
+    return pages;
+}
+
+static void umem_send_pages(QEMUFile *f, const struct umem_pages *pages)
+{
+    size_t len = umem_pages_size(pages->nr);
+    qemu_put_buffer(f, (const uint8_t*)pages, len);
+}
+
+/* umem daemon -> qemu */
+void umem_daemon_ready(int to_qemu_fd)
+{
+    umem_write_cmd(to_qemu_fd, UMEM_DAEMON_READY);
+}
+
+void umem_daemon_quit(QEMUFile *to_qemu)
+{
+    qemu_put_byte(to_qemu, UMEM_DAEMON_QUIT);
+}
+
+void umem_daemon_send_pages_present(QEMUFile *to_qemu,
+                                    struct umem_pages *pages)
+{
+    qemu_put_byte(to_qemu, UMEM_DAEMON_TRIGGER_PAGE_FAULT);
+    umem_send_pages(to_qemu, pages);
+}
+
+void umem_daemon_wait_for_qemu(int from_qemu_fd)
+{
+    umem_read_cmd(from_qemu_fd, UMEM_QEMU_READY);
+}
+
+/* qemu -> umem daemon */
+void umem_qemu_wait_for_daemon(int from_umemd_fd)
+{
+    umem_read_cmd(from_umemd_fd, UMEM_DAEMON_READY);
+}
+
+void umem_qemu_ready(int to_umemd_fd)
+{
+    umem_write_cmd(to_umemd_fd, UMEM_QEMU_READY);
+}
+
+void umem_qemu_quit(QEMUFile *to_umemd)
+{
+    qemu_put_byte(to_umemd, UMEM_QEMU_QUIT);
+}
+
+/* qemu side handler */
+struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
+                                                int *offset)
+{
+    uint64_t i;
+    int page_shift = ffs(getpagesize()) - 1;
+    struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
+    if (pages == NULL) {
+        return NULL;
+    }
+
+    for (i = 0; i < pages->nr; i++) {
+        ram_addr_t addr = pages->pgoffs[i] << page_shift;
+
+        /* make pages present by forcibly triggering page fault. */
+        volatile uint8_t *ram = qemu_get_ram_ptr(addr);
+        uint8_t dummy_read = ram[0];
+        (void)dummy_read;   /* suppress unused variable warning */
+    }
+
+    return pages;
+}
+
+void umem_qemu_send_pages_present(QEMUFile *to_umemd,
+                                  const struct umem_pages *pages)
+{
+    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_FAULTED);
+    umem_send_pages(to_umemd, pages);
+}
+
+void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
+                                   const struct umem_pages *pages)
+{
+    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_UNMAPPED);
+    umem_send_pages(to_umemd, pages);
+}
diff --git a/umem.h b/umem.h
new file mode 100644
index 0000000..5ca19ef
--- /dev/null
+++ b/umem.h
@@ -0,0 +1,105 @@
+/*
+ * umem.h: user process backed memory module for postcopy livemigration
+ *
+ * Copyright (c) 2011
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef QEMU_UMEM_H
+#define QEMU_UMEM_H
+
+#include <linux/umem.h>
+
+#include "qemu-common.h"
+
+typedef struct UMemDev UMemDev;
+
+struct UMem {
+    void *umem;
+    int fd;
+    void *shmem;
+    int shmem_fd;
+    uint64_t size;
+
+    /* indexed by host page size */
+    int page_shift;
+    int nbits;
+    int nsets;
+    unsigned long *faulted;
+};
+
+UMemDev *umem_dev_new(void);
+void umem_dev_destroy(UMemDev *dev);
+UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name);
+void umem_mmap(UMem *umem);
+
+void umem_destroy(UMem *umem);
+
+/* umem device operations */
+void umem_get_page_request(UMem *umem, struct umem_page_request *page_request);
+void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached);
+void umem_unmap(UMem *umem);
+void umem_close(UMem *umem);
+
+/* umem shmem operations */
+void *umem_map_shmem(UMem *umem);
+void umem_unmap_shmem(UMem *umem);
+void umem_remove_shmem(UMem *umem, size_t offset, size_t size);
+void umem_close_shmem(UMem *umem);
+
+/* qemu on source <-> umem daemon communication */
+
+struct umem_pages {
+    uint64_t nr;        /* nr = 0 means completed */
+    uint64_t pgoffs[0];
+};
+
+/* daemon -> qemu */
+#define UMEM_DAEMON_READY               'R'
+#define UMEM_DAEMON_QUIT                'Q'
+#define UMEM_DAEMON_TRIGGER_PAGE_FAULT  'T'
+#define UMEM_DAEMON_ERROR               'E'
+
+/* qemu -> daemon */
+#define UMEM_QEMU_READY                 'r'
+#define UMEM_QEMU_QUIT                  'q'
+#define UMEM_QEMU_PAGE_FAULTED          't'
+#define UMEM_QEMU_PAGE_UNMAPPED         'u'
+
+struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset);
+size_t umem_pages_size(uint64_t nr);
+
+/* for umem daemon */
+void umem_daemon_ready(int to_qemu_fd);
+void umem_daemon_wait_for_qemu(int from_qemu_fd);
+void umem_daemon_quit(QEMUFile *to_qemu);
+void umem_daemon_send_pages_present(QEMUFile *to_qemu,
+                                    struct umem_pages *pages);
+
+/* for qemu */
+void umem_qemu_wait_for_daemon(int from_umemd_fd);
+void umem_qemu_ready(int to_umemd_fd);
+void umem_qemu_quit(QEMUFile *to_umemd);
+struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
+                                                int *offset);
+void umem_qemu_send_pages_present(QEMUFile *to_umemd,
+                                  const struct umem_pages *pages);
+void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
+                                   const struct umem_pages *pages);
+
+#endif /* QEMU_UMEM_H */
diff --git a/vl.c b/vl.c
index 5430b8c..17427a0 100644
--- a/vl.c
+++ b/vl.c
@@ -3274,8 +3274,12 @@ int main(int argc, char **argv, char **envp)
     default_drive(default_sdcard, snapshot, machine->use_scsi,
                   IF_SD, 0, SD_OPTS);
 
-    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
-                         ram_save_live, NULL, ram_load, NULL);
+    if (postcopy_incoming_init(incoming, incoming_postcopy) < 0) {
+        exit(1);
+    }
+    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
+                         ram_save_set_params, ram_save_live, NULL,
+                         ram_load, NULL);
 
     if (nb_numa_nodes > 0) {
         int i;
@@ -3471,6 +3475,9 @@ int main(int argc, char **argv, char **envp)
 
     if (incoming) {
         runstate_set(RUN_STATE_INMIGRATE);
+        if (incoming_postcopy) {
+            postcopy_incoming_prepare();
+        }
         int ret = qemu_start_incoming_migration(incoming);
         if (ret < 0) {
             fprintf(stderr, "Migration failed. Exit code %s(%d), exiting.\n",
@@ -3488,6 +3495,9 @@ int main(int argc, char **argv, char **envp)
     bdrv_close_all();
     pause_all_vcpus();
     net_cleanup();
+    if (incoming_postcopy) {
+        postcopy_incoming_qemu_cleanup();
+    }
     res_free();
 
     return 0;
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
  2011-12-29  1:26   ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29 15:51     ` Orit Wasserman
  -1 siblings, 0 replies; 88+ messages in thread
From: Orit Wasserman @ 2011-12-29 15:51 UTC (permalink / raw)
  To: Isaku Yamahata, satoshi.itoh; +Cc: kvm, qemu-devel, t.hirofuchi

Hi,
A general comment this patch is a bit too long,which makes it hard to review.
Can you split it please?

On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This patch implements postcopy livemigration.
> 
> Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
> ---
>  Makefile.target           |    4 +
>  arch_init.c               |   26 +-
>  cpu-all.h                 |    7 +
>  exec.c                    |   20 +-
>  migration-exec.c          |    8 +
>  migration-fd.c            |   30 +
>  migration-postcopy-stub.c |   77 ++
>  migration-postcopy.c      | 1891 +++++++++++++++++++++++++++++++++++++++++++++
>  migration-tcp.c           |   37 +-
>  migration-unix.c          |   32 +-
>  migration.c               |   31 +
>  migration.h               |   30 +
>  qemu-common.h             |    1 +
>  qemu-options.hx           |    5 +-
>  umem.c                    |  379 +++++++++
>  umem.h                    |  105 +++
>  vl.c                      |   14 +-
>  17 files changed, 2677 insertions(+), 20 deletions(-)
>  create mode 100644 migration-postcopy-stub.c
>  create mode 100644 migration-postcopy.c
>  create mode 100644 umem.c
>  create mode 100644 umem.h
> 
> diff --git a/Makefile.target b/Makefile.target
> index 3261383..d94c53f 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -4,6 +4,7 @@ GENERATED_HEADERS = config-target.h
>  CONFIG_NO_PCI = $(if $(subst n,,$(CONFIG_PCI)),n,y)
>  CONFIG_NO_KVM = $(if $(subst n,,$(CONFIG_KVM)),n,y)
>  CONFIG_NO_XEN = $(if $(subst n,,$(CONFIG_XEN)),n,y)
> +CONFIG_NO_POSTCOPY = $(if $(subst n,,$(CONFIG_POSTCOPY)),n,y)
>  
>  include ../config-host.mak
>  include config-devices.mak
> @@ -199,6 +200,9 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
>  obj-y += memory.o
>  LIBS+=-lz
>  
> +common-obj-$(CONFIG_POSTCOPY) += migration-postcopy.o umem.o
> +common-obj-$(CONFIG_NO_POSTCOPY) += migration-postcopy-stub.o
> +
>  QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
>  QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
>  QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
> diff --git a/arch_init.c b/arch_init.c
> index bc53092..8b3130d 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -102,6 +102,13 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
>      return 1;
>  }
>  
> +static bool outgoing_postcopy = false;
> +
> +void ram_save_set_params(const MigrationParams *params, void *opaque)
> +{
> +    outgoing_postcopy = params->postcopy;
> +}
> +
>  static RAMBlock *last_block_sent = NULL;
>  
>  int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
> @@ -284,6 +291,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      uint64_t expected_time = 0;
>      int ret;
>  
> +    if (stage == 1) {
> +        last_block_sent = NULL;
> +
> +        bytes_transferred = 0;
> +        last_block = NULL;
> +        last_offset = 0;

Changing of line order + new empty line

> +    }
> +    if (outgoing_postcopy) {
> +        return postcopy_outgoing_ram_save_live(mon, f, stage, opaque);
> +    }
> +

I would just do :

unregister_savevm_live and then register_savevm_live(...,postcopy_outgoing_ram_save_live,...)
when starting outgoing postcopy migration.

>      if (stage < 0) {
>          cpu_physical_memory_set_dirty_tracking(0);
>          return 0;
> @@ -295,10 +313,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      }
>  
>      if (stage == 1) {
> -        bytes_transferred = 0;
> -        last_block_sent = NULL;
> -        last_block = NULL;
> -        last_offset = 0;
>          sort_ram_list();
>  
>          /* Make sure all dirty bits are set */
> @@ -436,6 +450,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>      int flags;
>      int error;
>  
> +    if (incoming_postcopy) {
> +        return postcopy_incoming_ram_load(f, opaque, version_id);
> +    }
> +
why not call register_savevm_live(...,postcopy_incoming_ram_load,...) when starting guest with postcopy_incoming

>      if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
>          return -EINVAL;
>      }
> diff --git a/cpu-all.h b/cpu-all.h
> index 0244f7a..2e9d8a7 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -475,6 +475,9 @@ extern ram_addr_t ram_size;
>  /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
>  #define RAM_PREALLOC_MASK   (1 << 0)
>  
> +/* RAM is allocated via umem for postcopy incoming mode */
> +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> +
>  typedef struct RAMBlock {
>      uint8_t *host;
>      ram_addr_t offset;
> @@ -485,6 +488,10 @@ typedef struct RAMBlock {
>  #if defined(__linux__) && !defined(TARGET_S390X)
>      int fd;
>  #endif
> +
> +#ifdef CONFIG_POSTCOPY
> +    UMem *umem;    /* for incoming postcopy mode */
> +#endif
>  } RAMBlock;
>  
>  typedef struct RAMList {
> diff --git a/exec.c b/exec.c
> index c8c6692..90b0491 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -35,6 +35,7 @@
>  #include "qemu-timer.h"
>  #include "memory.h"
>  #include "exec-memory.h"
> +#include "migration.h"
>  #if defined(CONFIG_USER_ONLY)
>  #include <qemu.h>
>  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> @@ -2949,6 +2950,13 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
>          new_block->host = host;
>          new_block->flags |= RAM_PREALLOC_MASK;
>      } else {
> +#ifdef CONFIG_POSTCOPY
> +        if (incoming_postcopy) {
> +            postcopy_incoming_ram_alloc(name, size,
> +                                        &new_block->host, &new_block->umem);
> +            new_block->flags |= RAM_POSTCOPY_UMEM_MASK;
> +        } else
> +#endif
>          if (mem_path) {
>  #if defined (__linux__) && !defined(TARGET_S390X)
>              new_block->host = file_ram_alloc(new_block, size, mem_path);
> @@ -3027,7 +3035,13 @@ void qemu_ram_free(ram_addr_t addr)
>              QLIST_REMOVE(block, next);
>              if (block->flags & RAM_PREALLOC_MASK) {
>                  ;
> -            } else if (mem_path) {
> +            }
> +#ifdef CONFIG_POSTCOPY
> +            else if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> +                postcopy_incoming_ram_free(block->umem);
> +            }
> +#endif
> +            else if (mem_path) {
>  #if defined (__linux__) && !defined(TARGET_S390X)
>                  if (block->fd) {
>                      munmap(block->host, block->length);
> @@ -3073,6 +3087,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>              } else {
>                  flags = MAP_FIXED;
>                  munmap(vaddr, length);
> +                if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> +                    postcopy_incoming_qemu_pages_unmapped(addr, length);
> +                    block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> +                }
>                  if (mem_path) {
>  #if defined(__linux__) && !defined(TARGET_S390X)
>                      if (block->fd) {
> diff --git a/migration-exec.c b/migration-exec.c
> index e14552e..2bd0c3b 100644
> --- a/migration-exec.c
> +++ b/migration-exec.c
> @@ -62,6 +62,10 @@ int exec_start_outgoing_migration(MigrationState *s, const char *command)
>  {
>      FILE *f;
>  
> +    if (s->params.postcopy) {
> +        return -ENOSYS;
> +    }
> +
>      f = popen(command, "w");
>      if (f == NULL) {
>          DPRINTF("Unable to popen exec target\n");
> @@ -104,6 +108,10 @@ int exec_start_incoming_migration(const char *command)
>  {
>      QEMUFile *f;
>  
> +    if (incoming_postcopy) {
> +        return -ENOSYS;
> +    }
> +
>      DPRINTF("Attempting to start an incoming migration\n");
>      f = qemu_popen_cmd(command, "r");
>      if(f == NULL) {
> diff --git a/migration-fd.c b/migration-fd.c
> index 6211124..5a62ab9 100644
> --- a/migration-fd.c
> +++ b/migration-fd.c
> @@ -88,6 +88,23 @@ int fd_start_outgoing_migration(MigrationState *s, const char *fdname)
>      s->write = fd_write;
>      s->close = fd_close;
>  
> +    if (s->params.postcopy) {
> +        int flags = fcntl(s->fd, F_GETFL);
> +        if ((flags & O_ACCMODE) != O_RDWR) {
> +            goto err_after_open;
> +        }
> +
> +        s->fd_read = dup(s->fd);
> +        if (s->fd_read == -1) {
> +            goto err_after_open;
> +        }
> +        s->file_read = qemu_fdopen(s->fd_read, "r");
> +        if (s->file_read == NULL) {
> +            close(s->fd_read);
> +            goto err_after_open;
> +        }
> +    }
> +
>      migrate_fd_connect(s);
>      return 0;
>  
> @@ -103,7 +120,14 @@ static void fd_accept_incoming_migration(void *opaque)
>  
>      process_incoming_migration(f);
>      qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_fork_umemd(qemu_stdio_fd(f), f);
> +    }
>      qemu_fclose(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_qemu_ready();
> +    }
> +    return;
>  }
>  
>  int fd_start_incoming_migration(const char *infd)
> @@ -114,6 +138,12 @@ int fd_start_incoming_migration(const char *infd)
>      DPRINTF("Attempting to start an incoming migration via fd\n");
>  
>      fd = strtol(infd, NULL, 0);
> +    if (incoming_postcopy) {
> +        int flags = fcntl(fd, F_GETFL);
> +        if ((flags & O_ACCMODE) != O_RDWR) {
> +            return -EINVAL;
> +        }
> +    }
>      f = qemu_fdopen(fd, "rb");
>      if(f == NULL) {
>          DPRINTF("Unable to apply qemu wrapper to file descriptor\n");
> diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
> new file mode 100644
> index 0000000..0b78de7
> --- /dev/null
> +++ b/migration-postcopy-stub.c
> @@ -0,0 +1,77 @@
> +/*
> + * migration-postcopy-stub.c: postcopy livemigration
> + *                            stub functions for non-supported hosts
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "sysemu.h"
> +#include "migration.h"
> +
> +int postcopy_outgoing_create_read_socket(MigrationState *s)
> +{
> +    return -ENOSYS;
> +}
> +
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +                                    QEMUFile *f, int stage, void *opaque)
> +{
> +    return -ENOSYS;
> +}
> +
> +void *postcopy_outgoing_begin(MigrationState *ms)
> +{
> +    return NULL;
> +}
> +
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +                                          void *postcopy)
> +{
> +    return -ENOSYS;
> +}
> +
> +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> +{
> +    return -ENOSYS;
> +}
> +
> +void postcopy_incoming_prepare(void)
> +{
> +}
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    return -ENOSYS;
> +}
> +
> +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> +{
> +}
> +
> +void postcopy_incoming_qemu_ready(void)
> +{
> +}
> +
> +void postcopy_incoming_qemu_cleanup(void)
> +{
> +}
> +
> +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> +{
> +}
> diff --git a/migration-postcopy.c b/migration-postcopy.c
> new file mode 100644
> index 0000000..ed0d574
> --- /dev/null
> +++ b/migration-postcopy.c
> @@ -0,0 +1,1891 @@
> +/*
> + * migration-postcopy.c: postcopy livemigration
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "bitmap.h"
> +#include "sysemu.h"
> +#include "hw/hw.h"
> +#include "arch_init.h"
> +#include "migration.h"
> +#include "umem.h"
> +
> +#include "memory.h"
> +#define WANT_EXEC_OBSOLETE
> +#include "exec-obsolete.h"
> +
> +//#define DEBUG_POSTCOPY
> +#ifdef DEBUG_POSTCOPY
> +#include <sys/syscall.h>
> +#define DPRINTF(fmt, ...)                                               \
> +    do {                                                                \
> +        printf("%d:%ld %s:%d: " fmt, getpid(), syscall(SYS_gettid),     \
> +               __func__, __LINE__, ## __VA_ARGS__);                     \
> +    } while (0)
> +#else
> +#define DPRINTF(fmt, ...)       do { } while (0)
> +#endif
> +
> +#define ALIGN_UP(size, align)   (((size) + (align) - 1) & ~((align) - 1))
> +
> +static void fd_close(int *fd)
> +{
> +    if (*fd >= 0) {
> +        close(*fd);
> +        *fd = -1;
> +    }
> +}
> +
> +/***************************************************************************
> + * QEMUFile for non blocking pipe
> + */
> +
> +/* read only */
> +struct QEMUFilePipe {
> +    int fd;
> +    QEMUFile *file;
> +};

Why not use QEMUFileSocket ?

> +typedef struct QEMUFilePipe QEMUFilePipe;
> +
> +static int pipe_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
> +{
> +    QEMUFilePipe *s = opaque;
> +    ssize_t len = 0;
> +
> +    while (size > 0) {
> +        ssize_t ret = read(s->fd, buf, size);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            }
> +            if (len == 0) {
> +                len = -errno;
> +            }
> +            break;
> +        }
> +
> +        if (ret == 0) {
> +            /* the write end of the pipe is closed */
> +            break;
> +        }
> +        len += ret;
> +        buf += ret;
> +        size -= ret;
> +    }
> +
> +    return len;
> +}
> +
> +static int pipe_close(void *opaque)
> +{
> +    QEMUFilePipe *s = opaque;
> +    g_free(s);
> +    return 0;
> +}
> +
> +static QEMUFile *qemu_fopen_pipe(int fd)
> +{
> +    QEMUFilePipe *s = g_malloc0(sizeof(*s));
> +
> +    s->fd = fd;
> +    fcntl_setfl(fd, O_NONBLOCK);
> +    s->file = qemu_fopen_ops(s, NULL, pipe_get_buffer, pipe_close,
> +                             NULL, NULL, NULL);
> +    return s->file;
> +}
> +
> +/* write only */
> +struct QEMUFileNonblock {
> +    int fd;
> +    QEMUFile *file;
> +
> +    /* for pipe-write nonblocking mode */
> +#define BUF_SIZE_INC    (32 * 1024)     /* = IO_BUF_SIZE */
> +    uint8_t *buffer;
> +    size_t buffer_size;
> +    size_t buffer_capacity;
> +    bool freeze_output;
> +};
> +typedef struct QEMUFileNonblock QEMUFileNonblock;
> +

Couldn't you use QEMUFileBuffered ?

> +static void nonblock_flush_buffer(QEMUFileNonblock *s)
> +{
> +    size_t offset = 0;
> +    ssize_t ret;
> +
> +    while (offset < s->buffer_size) {
> +        ret = write(s->fd, s->buffer + offset, s->buffer_size - offset);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == EAGAIN) {
> +                s->freeze_output = true;
> +            } else {
> +                qemu_file_set_error(s->file, errno);
> +            }
> +            break;
> +        }
> +
> +        if (ret == 0) {
> +            DPRINTF("ret == 0\n");
> +            break;
> +        }
> +
> +        offset += ret;
> +    }
> +
> +    if (offset > 0) {
> +        assert(s->buffer_size >= offset);
> +        memmove(s->buffer, s->buffer + offset, s->buffer_size - offset);
> +        s->buffer_size -= offset;
> +    }
> +    if (s->buffer_size > 0) {
> +        s->freeze_output = true;
> +    }
> +}
> +
> +static int nonblock_put_buffer(void *opaque,
> +                               const uint8_t *buf, int64_t pos, int size)
> +{
> +    QEMUFileNonblock *s = opaque;
> +    int error;
> +    ssize_t len = 0;
> +
> +    error = qemu_file_get_error(s->file);
> +    if (error) {
> +        return error;
> +    }
> +
> +    nonblock_flush_buffer(s);
> +    error = qemu_file_get_error(s->file);
> +    if (error) {
> +        return error;
> +    }
> +
> +    while (!s->freeze_output && size > 0) {
> +        ssize_t ret;
> +        assert(s->buffer_size == 0);
> +
> +        ret = write(s->fd, buf, size);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == EAGAIN) {
> +                s->freeze_output = true;
> +            } else {
> +                qemu_file_set_error(s->file, errno);
> +            }
> +            break;
> +        }
> +
> +        len += ret;
> +        buf += ret;
> +        size -= ret;
> +    }
> +
> +    if (size > 0) {
> +        int inc = size - (s->buffer_capacity - s->buffer_size);
> +        if (inc > 0) {
> +            s->buffer_capacity +=
> +                DIV_ROUND_UP(inc, BUF_SIZE_INC) * BUF_SIZE_INC;
> +            s->buffer = g_realloc(s->buffer, s->buffer_capacity);
> +        }
> +        memcpy(s->buffer + s->buffer_size, buf, size);
> +        s->buffer_size += size;
> +
> +        len += size;
> +    }
> +
> +    return len;
> +}
> +
> +static int nonblock_pending_size(QEMUFileNonblock *s)
> +{
> +    return qemu_pending_size(s->file) + s->buffer_size;
> +}
> +
> +static void nonblock_fflush(QEMUFileNonblock *s)
> +{
> +    s->freeze_output = false;
> +    nonblock_flush_buffer(s);
> +    if (!s->freeze_output) {
> +        qemu_fflush(s->file);
> +    }
> +}
> +
> +static void nonblock_wait_for_flush(QEMUFileNonblock *s)
> +{
> +    while (nonblock_pending_size(s) > 0) {
> +        fd_set fds;
> +        FD_ZERO(&fds);
> +        FD_SET(s->fd, &fds);
> +        select(s->fd + 1, NULL, &fds, NULL, NULL);
> +
> +        nonblock_fflush(s);
> +    }
> +}
> +
> +static int nonblock_close(void *opaque)
> +{
> +    QEMUFileNonblock *s = opaque;
> +    nonblock_wait_for_flush(s);
> +    g_free(s->buffer);
> +    g_free(s);
> +    return 0;
> +}
> +
> +static QEMUFileNonblock *qemu_fopen_nonblock(int fd)
> +{
> +    QEMUFileNonblock *s = g_malloc0(sizeof(*s));
> +
> +    s->fd = fd;
> +    fcntl_setfl(fd, O_NONBLOCK);
> +    s->file = qemu_fopen_ops(s, nonblock_put_buffer, NULL, nonblock_close,
> +                             NULL, NULL, NULL);
> +    return s;
> +}
> +
> +/***************************************************************************
> + * umem daemon on destination <-> qemu on source protocol
> + */
> +
> +#define QEMU_UMEM_REQ_INIT              0x00
> +#define QEMU_UMEM_REQ_ON_DEMAND         0x01
> +#define QEMU_UMEM_REQ_ON_DEMAND_CONT    0x02
> +#define QEMU_UMEM_REQ_BACKGROUND        0x03
> +#define QEMU_UMEM_REQ_BACKGROUND_CONT   0x04
> +#define QEMU_UMEM_REQ_REMOVE            0x05
> +#define QEMU_UMEM_REQ_EOC               0x06
> +
> +struct qemu_umem_req {
> +    int8_t cmd;
> +    uint8_t len;
> +    char *idstr;        /* ON_DEMAND, BACKGROUND, REMOVE */
> +    uint32_t nr;        /* ON_DEMAND, ON_DEMAND_CONT,
> +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> +
> +    /* in target page size as qemu migration protocol */
> +    uint64_t *pgoffs;   /* ON_DEMAND, ON_DEMAND_CONT,
> +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> +};
> +
> +static void postcopy_incoming_send_req_idstr(QEMUFile *f, const char* idstr)
> +{
> +    qemu_put_byte(f, strlen(idstr));
> +    qemu_put_buffer(f, (uint8_t *)idstr, strlen(idstr));
> +}
> +
> +static void postcopy_incoming_send_req_pgoffs(QEMUFile *f, uint32_t nr,
> +                                              const uint64_t *pgoffs)
> +{
> +    uint32_t i;
> +
> +    qemu_put_be32(f, nr);
> +    for (i = 0; i < nr; i++) {
> +        qemu_put_be64(f, pgoffs[i]);
> +    }
> +}
> +
> +static void postcopy_incoming_send_req_one(QEMUFile *f,
> +                                           const struct qemu_umem_req *req)
> +{
> +    DPRINTF("cmd %d\n", req->cmd);
> +    qemu_put_byte(f, req->cmd);
> +    switch (req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +    case QEMU_UMEM_REQ_EOC:
> +        /* nothing */
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +    case QEMU_UMEM_REQ_REMOVE:
> +        postcopy_incoming_send_req_idstr(f, req->idstr);
> +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +}
> +
> +/* QEMUFile can buffer up to IO_BUF_SIZE = 32 * 1024.
> + * So one message size must be <= IO_BUF_SIZE
> + * cmd: 1
> + * id len: 1
> + * id: 256
> + * nr: 2
> + */
> +#define MAX_PAGE_NR     ((32 * 1024 - 1 - 1 - 256 - 2) / sizeof(uint64_t))
> +static void postcopy_incoming_send_req(QEMUFile *f,
> +                                       const struct qemu_umem_req *req)
> +{
> +    uint32_t nr = req->nr;
> +    struct qemu_umem_req tmp = *req;
> +
> +    switch (req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +    case QEMU_UMEM_REQ_EOC:
> +        postcopy_incoming_send_req_one(f, &tmp);
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +        tmp.nr = MIN(nr, MAX_PAGE_NR);
> +        postcopy_incoming_send_req_one(f, &tmp);
> +
> +        nr -= tmp.nr;
> +        tmp.pgoffs += tmp.nr;
> +        if (tmp.cmd == QEMU_UMEM_REQ_ON_DEMAND) {
> +            tmp.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> +        }else {
> +            tmp.cmd = QEMU_UMEM_REQ_BACKGROUND_CONT;
> +        }
> +        /* fall through */
> +    case QEMU_UMEM_REQ_REMOVE:
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        while (nr > 0) {
> +            tmp.nr = MIN(nr, MAX_PAGE_NR);
> +            postcopy_incoming_send_req_one(f, &tmp);
> +
> +            nr -= tmp.nr;
> +            tmp.pgoffs += tmp.nr;
> +        }
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +}
> +
> +static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
> +                                            struct qemu_umem_req *req,
> +                                            size_t *offset)
> +{
> +    int ret;
> +
> +    req->len = qemu_peek_byte(f, *offset);
> +    *offset += 1;
> +    if (req->len == 0) {
> +        return -EAGAIN;
> +    }
> +    req->idstr = g_malloc((int)req->len + 1);
> +    ret = qemu_peek_buffer(f, (uint8_t*)req->idstr, req->len, *offset);
> +    *offset += ret;
> +    if (ret != req->len) {
> +        g_free(req->idstr);
> +        req->idstr = NULL;
> +        return -EAGAIN;
> +    }
> +    req->idstr[req->len] = 0;
> +    return 0;
> +}
> +
> +static int postcopy_outgoing_recv_req_pgoffs(QEMUFile *f,
> +                                             struct qemu_umem_req *req,
> +                                             size_t *offset)
> +{
> +    int ret;
> +    uint32_t be32;
> +    uint32_t i;
> +
> +    ret = qemu_peek_buffer(f, (uint8_t*)&be32, sizeof(be32), *offset);
> +    *offset += sizeof(be32);
> +    if (ret != sizeof(be32)) {
> +        return -EAGAIN;
> +    }
> +
> +    req->nr = be32_to_cpu(be32);
> +    req->pgoffs = g_new(uint64_t, req->nr);
> +    for (i = 0; i < req->nr; i++) {
> +        uint64_t be64;
> +        ret = qemu_peek_buffer(f, (uint8_t*)&be64, sizeof(be64), *offset);
> +        *offset += sizeof(be64);
> +        if (ret != sizeof(be64)) {
> +            g_free(req->pgoffs);
> +            req->pgoffs = NULL;
> +            return -EAGAIN;
> +        }
> +        req->pgoffs[i] = be64_to_cpu(be64);
> +    }
> +    return 0;
> +}
> +
> +static int postcopy_outgoing_recv_req(QEMUFile *f, struct qemu_umem_req *req)
> +{
> +    int size;
> +    int ret;
> +    size_t offset = 0;
> +
> +    size = qemu_peek_buffer(f, (uint8_t*)&req->cmd, 1, offset);
> +    if (size <= 0) {
> +        return -EAGAIN;
> +    }
> +    offset += 1;
> +
> +    switch (req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +    case QEMU_UMEM_REQ_EOC:
> +        /* nothing */
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +    case QEMU_UMEM_REQ_REMOVE:
> +        ret = postcopy_outgoing_recv_req_idstr(f, req, &offset);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +    qemu_file_skip(f, offset);
> +    DPRINTF("cmd %d\n", req->cmd);
> +    return 0;
> +}
> +
> +static void postcopy_outgoing_free_req(struct qemu_umem_req *req)
> +{
> +    g_free(req->idstr);
> +    g_free(req->pgoffs);
> +}
> +
> +/***************************************************************************
> + * outgoing part
> + */
> +
> +#define QEMU_SAVE_LIVE_STAGE_START      0x01    /* = QEMU_VM_SECTION_START */
> +#define QEMU_SAVE_LIVE_STAGE_PART       0x02    /* = QEMU_VM_SECTION_PART */
> +#define QEMU_SAVE_LIVE_STAGE_END        0x03    /* = QEMU_VM_SECTION_END */
> +
> +enum POState {
> +    PO_STATE_ERROR_RECEIVE,
> +    PO_STATE_ACTIVE,
> +    PO_STATE_EOC_RECEIVED,
> +    PO_STATE_ALL_PAGES_SENT,
> +    PO_STATE_COMPLETED,
> +};
> +typedef enum POState POState;
> +
> +struct PostcopyOutgoingState {
> +    POState state;
> +    QEMUFile *mig_read;
> +    int fd_read;
> +    RAMBlock *last_block_read;
> +
> +    QEMUFile *mig_buffered_write;
> +    MigrationState *ms;
> +
> +    /* For nobg mode. Check if all pages are sent */
> +    RAMBlock *block;
> +    ram_addr_t addr;
> +};
> +typedef struct PostcopyOutgoingState PostcopyOutgoingState;
> +
> +int postcopy_outgoing_create_read_socket(MigrationState *s)
> +{
> +    if (!s->params.postcopy) {
> +        return 0;
> +    }
> +
> +    s->fd_read = dup(s->fd);
> +    if (s->fd_read == -1) {
> +        int ret = -errno;
> +        perror("dup");
> +        return ret;
> +    }
> +    s->file_read = qemu_fopen_socket(s->fd_read);
> +    if (s->file_read == NULL) {
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +                                    QEMUFile *f, int stage, void *opaque)
> +{
> +    int ret = 0;
> +    DPRINTF("stage %d\n", stage);
> +    if (stage == QEMU_SAVE_LIVE_STAGE_START) {
> +        sort_ram_list();
> +        ram_save_live_mem_size(f);
> +    }
> +    if (stage == QEMU_SAVE_LIVE_STAGE_PART) {
> +        ret = 1;
> +    }
> +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +    return ret;
> +}
> +
> +static RAMBlock *postcopy_outgoing_find_block(const char *idstr)
> +{
> +    RAMBlock *block;
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        if (!strncmp(idstr, block->idstr, strlen(idstr))) {
> +            return block;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * return value
> + *   0: continue postcopy mode
> + * > 0: completed postcopy mode.
> + * < 0: error
> + */
> +static int postcopy_outgoing_handle_req(PostcopyOutgoingState *s,
> +                                        const struct qemu_umem_req *req,
> +                                        bool *written)
> +{
> +    int i;
> +    RAMBlock *block;
> +
> +    DPRINTF("cmd %d state %d\n", req->cmd, s->state);
> +    switch(req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +        /* nothing */
> +        break;
> +    case QEMU_UMEM_REQ_EOC:
> +        /* tell to finish migration. */
> +        if (s->state == PO_STATE_ALL_PAGES_SENT) {
> +            s->state = PO_STATE_COMPLETED;
> +            DPRINTF("-> PO_STATE_COMPLETED\n");
> +        } else {
> +            s->state = PO_STATE_EOC_RECEIVED;
> +            DPRINTF("-> PO_STATE_EOC_RECEIVED\n");
> +        }
> +        return 1;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +        DPRINTF("idstr: %s\n", req->idstr);
> +        block = postcopy_outgoing_find_block(req->idstr);
> +        if (block == NULL) {
> +            return -EINVAL;
> +        }
> +        s->last_block_read = block;
> +        /* fall through */
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        DPRINTF("nr %d\n", req->nr);
> +        for (i = 0; i < req->nr; i++) {
> +            DPRINTF("offs[%d] 0x%"PRIx64"\n", i, req->pgoffs[i]);
> +            int ret = ram_save_page(s->mig_buffered_write, s->last_block_read,
> +                                    req->pgoffs[i] << TARGET_PAGE_BITS);
> +            if (ret > 0) {
> +                *written = true;
> +            }
> +        }
> +        break;
> +    case QEMU_UMEM_REQ_REMOVE:
> +        block = postcopy_outgoing_find_block(req->idstr);
> +        if (block == NULL) {
> +            return -EINVAL;
> +        }
> +        for (i = 0; i < req->nr; i++) {
> +            ram_addr_t addr = block->offset +
> +                (req->pgoffs[i] << TARGET_PAGE_BITS);
> +            cpu_physical_memory_reset_dirty(addr,
> +                                            addr + TARGET_PAGE_SIZE,
> +                                            MIGRATION_DIRTY_FLAG);
> +        }
> +        break;
> +    default:
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +static void postcopy_outgoing_close_mig_read(PostcopyOutgoingState *s)
> +{
> +    if (s->mig_read != NULL) {
> +        qemu_set_fd_handler(s->fd_read, NULL, NULL, NULL);
> +        qemu_fclose(s->mig_read);
> +        s->mig_read = NULL;
> +        fd_close(&s->fd_read);
> +
> +        s->ms->file_read = NULL;
> +        s->ms->fd_read = -1;
> +    }
> +}
> +
> +static void postcopy_outgoing_completed(PostcopyOutgoingState *s)
> +{
> +    postcopy_outgoing_close_mig_read(s);
> +    s->ms->postcopy = NULL;
> +    g_free(s);
> +}
> +
> +static void postcopy_outgoing_recv_handler(void *opaque)
> +{
> +    PostcopyOutgoingState *s = opaque;
> +    bool written = false;
> +    int ret = 0;
> +
> +    assert(s->state == PO_STATE_ACTIVE ||
> +           s->state == PO_STATE_ALL_PAGES_SENT);
> +
> +    do {
> +        struct qemu_umem_req req = {.idstr = NULL,
> +                                    .pgoffs = NULL};
> +
> +        ret = postcopy_outgoing_recv_req(s->mig_read, &req);
> +        if (ret < 0) {
> +            if (ret == -EAGAIN) {
> +                ret = 0;
> +            }
> +            break;
> +        }
> +        if (s->state == PO_STATE_ACTIVE) {
> +            ret = postcopy_outgoing_handle_req(s, &req, &written);
> +        }
> +        postcopy_outgoing_free_req(&req);
> +    } while (ret == 0);
> +
> +    /*
> +     * flush buffered_file.
> +     * Although mig_write is rate-limited buffered file, those written pages
> +     * are requested on demand by the destination. So forcibly push
> +     * those pages ignoring rate limiting
> +     */
> +    if (written) {
> +        qemu_fflush(s->mig_buffered_write);
> +        /* qemu_buffered_file_drain(s->mig_buffered_write); */
> +    }
> +
> +    if (ret < 0) {
> +        switch (s->state) {
> +        case PO_STATE_ACTIVE:
> +            s->state = PO_STATE_ERROR_RECEIVE;
> +            DPRINTF("-> PO_STATE_ERROR_RECEIVE\n");
> +            break;
> +        case PO_STATE_ALL_PAGES_SENT:
> +            s->state = PO_STATE_COMPLETED;
> +            DPRINTF("-> PO_STATE_ALL_PAGES_SENT\n");
> +            break;
> +        default:
> +            abort();
> +        }
> +    }
> +    if (s->state == PO_STATE_ERROR_RECEIVE || s->state == PO_STATE_COMPLETED) {
> +        postcopy_outgoing_close_mig_read(s);
> +    }
> +    if (s->state == PO_STATE_COMPLETED) {
> +        DPRINTF("PO_STATE_COMPLETED\n");
> +        MigrationState *ms = s->ms;
> +        postcopy_outgoing_completed(s);
> +        migrate_fd_completed(ms);
> +    }
> +}
> +
> +void *postcopy_outgoing_begin(MigrationState *ms)
> +{
> +    PostcopyOutgoingState *s = g_new(PostcopyOutgoingState, 1);
> +    DPRINTF("outgoing begin\n");
> +    qemu_fflush(ms->file);
> +
> +    s->ms = ms;
> +    s->state = PO_STATE_ACTIVE;
> +    s->fd_read = ms->fd_read;
> +    s->mig_read = ms->file_read;
> +    s->mig_buffered_write = ms->file;
> +    s->block = NULL;
> +    s->addr = 0;
> +
> +    /* Make sure all dirty bits are set */
> +    ram_save_memory_set_dirty();
> +
> +    qemu_set_fd_handler(s->fd_read,
> +                        &postcopy_outgoing_recv_handler, NULL, s);
> +    return s;
> +}
> +
> +static void postcopy_outgoing_ram_all_sent(QEMUFile *f,
> +                                           PostcopyOutgoingState *s)
> +{
> +    assert(s->state == PO_STATE_ACTIVE);
> +
> +    s->state = PO_STATE_ALL_PAGES_SENT;
> +    /* tell incoming side that all pages are sent */
> +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +    qemu_fflush(f);
> +    qemu_buffered_file_drain(f);
> +    DPRINTF("sent RAM_SAVE_FLAG_EOS\n");
> +    migrate_fd_cleanup(s->ms);
> +
> +    /* Later migrate_fd_complete() will be called which calls
> +     * migrate_fd_cleanup() again. So dummy file is created
> +     * for qemu monitor to keep working.
> +     */
> +    s->ms->file = qemu_fopen_ops(NULL, NULL, NULL, NULL, NULL,
> +                                 NULL, NULL);
> +}
> +
> +static int postcopy_outgoing_check_all_ram_sent(PostcopyOutgoingState *s,
> +                                                RAMBlock *block,
> +                                                ram_addr_t addr)
> +{
> +    if (block == NULL) {
> +        block = QLIST_FIRST(&ram_list.blocks);
> +        addr = block->offset;
> +    }
> +
> +    for (; block != NULL;
> +         s->block = QLIST_NEXT(s->block, next), addr = block->offset) {
> +        for (; addr < block->offset + block->length;
> +             addr += TARGET_PAGE_SIZE) {
> +            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
> +                s->block = block;
> +                s->addr = addr;
> +                return 0;
> +            }
> +        }
> +    }
> +
> +    return 1;
> +}
> +
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +                                          void *postcopy)
> +{
> +    PostcopyOutgoingState *s = postcopy;
> +
> +    assert(s->state == PO_STATE_ACTIVE ||
> +           s->state == PO_STATE_EOC_RECEIVED ||
> +           s->state == PO_STATE_ERROR_RECEIVE);
> +
> +    switch (s->state) {
> +    case PO_STATE_ACTIVE:
> +        /* nothing. processed below */
> +        break;
> +    case PO_STATE_EOC_RECEIVED:
> +        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +        s->state = PO_STATE_COMPLETED;
> +        postcopy_outgoing_completed(s);
> +        DPRINTF("PO_STATE_COMPLETED\n");
> +        return 1;
> +    case PO_STATE_ERROR_RECEIVE:
> +        postcopy_outgoing_completed(s);
> +        DPRINTF("PO_STATE_ERROR_RECEIVE\n");
> +        return -1;
> +    default:
> +        abort();
> +    }
> +
> +    if (s->ms->params.nobg) {
> +        /* See if all pages are sent. */
> +        if (postcopy_outgoing_check_all_ram_sent(s, s->block, s->addr) == 0) {
> +            return 0;
> +        }
> +        /* ram_list can be reordered. (it doesn't seem so during migration,
> +           though) So the whole list needs to be checked again */
> +        if (postcopy_outgoing_check_all_ram_sent(s, NULL, 0) == 0) {
> +            return 0;
> +        }
> +
> +        postcopy_outgoing_ram_all_sent(f, s);
> +        return 0;
> +    }
> +
> +    DPRINTF("outgoing background state: %d\n", s->state);
> +
> +    while (qemu_file_rate_limit(f) == 0) {
> +        if (ram_save_block(f) == 0) { /* no more blocks */
> +            assert(s->state == PO_STATE_ACTIVE);
> +            postcopy_outgoing_ram_all_sent(f, s);
> +            return 0;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/***************************************************************************
> + * incoming part
> + */
> +
> +/* flags for incoming mode to modify the behavior.
> +   This is for benchmark/debug purpose */
> +#define INCOMING_FLAGS_FAULT_REQUEST 0x01
> +
> +
> +static void postcopy_incoming_umemd(void);
> +
> +#define PIS_STATE_QUIT_RECEIVED         0x01
> +#define PIS_STATE_QUIT_QUEUED           0x02
> +#define PIS_STATE_QUIT_SENT             0x04
> +
> +#define PIS_STATE_QUIT_MASK             (PIS_STATE_QUIT_RECEIVED | \
> +                                         PIS_STATE_QUIT_QUEUED | \
> +                                         PIS_STATE_QUIT_SENT)
> +
> +struct PostcopyIncomingState {
> +    /* dest qemu state */
> +    uint32_t    state;
> +
> +    UMemDev *dev;
> +    int host_page_size;
> +    int host_page_shift;
> +
> +    /* qemu side */
> +    int to_umemd_fd;
> +    QEMUFileNonblock *to_umemd;
> +#define MAX_FAULTED_PAGES       256
> +    struct umem_pages *faulted_pages;
> +
> +    int from_umemd_fd;
> +    QEMUFile *from_umemd;
> +    int version_id;     /* save/load format version id */
> +};
> +typedef struct PostcopyIncomingState PostcopyIncomingState;
> +
> +
> +#define UMEM_STATE_EOS_RECEIVED         0x01    /* umem daemon <-> src qemu */
> +#define UMEM_STATE_EOC_SENT             0x02    /* umem daemon <-> src qemu */
> +#define UMEM_STATE_QUIT_RECEIVED        0x04    /* umem daemon <-> dst qemu */
> +#define UMEM_STATE_QUIT_QUEUED          0x08    /* umem daemon <-> dst qemu */
> +#define UMEM_STATE_QUIT_SENT            0x10    /* umem daemon <-> dst qemu */
> +
> +#define UMEM_STATE_QUIT_MASK            (UMEM_STATE_QUIT_QUEUED | \
> +                                         UMEM_STATE_QUIT_SENT | \
> +                                         UMEM_STATE_QUIT_RECEIVED)
> +#define UMEM_STATE_END_MASK             (UMEM_STATE_EOS_RECEIVED | \
> +                                         UMEM_STATE_EOC_SENT | \
> +                                         UMEM_STATE_QUIT_MASK)
> +
> +struct PostcopyIncomingUMemDaemon {
> +    /* umem daemon side */
> +    uint32_t state;
> +
> +    int host_page_size;
> +    int host_page_shift;
> +    int nr_host_pages_per_target_page;
> +    int host_to_target_page_shift;
> +    int nr_target_pages_per_host_page;
> +    int target_to_host_page_shift;
> +    int version_id;     /* save/load format version id */
> +
> +    int to_qemu_fd;
> +    QEMUFileNonblock *to_qemu;
> +    int from_qemu_fd;
> +    QEMUFile *from_qemu;
> +
> +    int mig_read_fd;
> +    QEMUFile *mig_read;         /* qemu on source -> umem daemon */
> +
> +    int mig_write_fd;
> +    QEMUFileNonblock *mig_write;        /* umem daemon -> qemu on source */
> +
> +    /* = KVM_MAX_VCPUS * (ASYNC_PF_PER_VCPUS + 1) */
> +#define MAX_REQUESTS    (512 * (64 + 1))
> +
> +    struct umem_page_request page_request;
> +    struct umem_page_cached page_cached;
> +
> +#define MAX_PRESENT_REQUESTS    MAX_FAULTED_PAGES
> +    struct umem_pages *present_request;
> +
> +    uint64_t *target_pgoffs;
> +
> +    /* bitmap indexed by target page offset */
> +    unsigned long *phys_requested;
> +
> +    /* bitmap indexed by target page offset */
> +    unsigned long *phys_received;
> +
> +    RAMBlock *last_block_read;  /* qemu on source -> umem daemon */
> +    RAMBlock *last_block_write; /* umem daemon -> qemu on source */
> +};
> +typedef struct PostcopyIncomingUMemDaemon PostcopyIncomingUMemDaemon;
> +
> +static PostcopyIncomingState state = {
> +    .state = 0,
> +    .dev = NULL,
> +    .to_umemd_fd = -1,
> +    .to_umemd = NULL,
> +    .from_umemd_fd = -1,
> +    .from_umemd = NULL,
> +};
> +
> +static PostcopyIncomingUMemDaemon umemd = {
> +    .state = 0,
> +    .to_qemu_fd = -1,
> +    .to_qemu = NULL,
> +    .from_qemu_fd = -1,
> +    .from_qemu = NULL,
> +    .mig_read_fd = -1,
> +    .mig_read = NULL,
> +    .mig_write_fd = -1,
> +    .mig_write = NULL,
> +};
> +
> +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> +{
> +    /* incoming_postcopy makes sense only when incoming migration mode */
> +    if (!incoming && incoming_postcopy) {
> +        return -EINVAL;
> +    }
> +
> +    if (!incoming_postcopy) {
> +        return 0;
> +    }
> +
> +    state.state = 0;
> +    state.dev = umem_dev_new();
> +    state.host_page_size = getpagesize();
> +    state.host_page_shift = ffs(state.host_page_size) - 1;
> +    state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
> +                                               ram_save_live() */
> +    return 0;
> +}
> +
> +void postcopy_incoming_ram_alloc(const char *name,
> +                                 size_t size, uint8_t **hostp, UMem **umemp)
> +{
> +    UMem *umem;
> +    size = ALIGN_UP(size, state.host_page_size);
> +    umem = umem_dev_create(state.dev, size, name);
> +
> +    *umemp = umem;
> +    *hostp = umem->umem;
> +}
> +
> +void postcopy_incoming_ram_free(UMem *umem)
> +{
> +    umem_unmap(umem);
> +    umem_close(umem);
> +    umem_destroy(umem);
> +}
> +
> +void postcopy_incoming_prepare(void)
> +{
> +    RAMBlock *block;
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        if (block->umem != NULL) {
> +            umem_mmap(block->umem);
> +        }
> +    }
> +}
> +
> +static int postcopy_incoming_ram_load_get64(QEMUFile *f,
> +                                             ram_addr_t *addr, int *flags)
> +{
> +    *addr = qemu_get_be64(f);
> +    *flags = *addr & ~TARGET_PAGE_MASK;
> +    *addr &= TARGET_PAGE_MASK;
> +    return qemu_file_get_error(f);
> +}
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    ram_addr_t addr;
> +    int flags;
> +    int error;
> +
> +    DPRINTF("incoming ram load\n");
> +    /*
> +     * RAM_SAVE_FLAGS_EOS or
> +     * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
> +     * see postcopy_outgoing_ram_save_live()
> +     */
> +
> +    if (version_id != RAM_SAVE_VERSION_ID) {
> +        DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
> +                version_id, RAM_SAVE_VERSION_ID);
> +        return -EINVAL;
> +    }
> +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> +    DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
> +    if (error) {
> +        DPRINTF("error %d\n", error);
> +        return error;
> +    }
> +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> +        DPRINTF("EOS\n");
> +        return 0;
> +    }
> +
> +    if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
> +        DPRINTF("-EINVAL flags 0x%x\n", flags);
> +        return -EINVAL;
> +    }
> +    error = ram_load_mem_size(f, addr);
> +    if (error) {
> +        DPRINTF("addr 0x%lx error %d\n", addr, error);
> +        return error;
> +    }
> +
> +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> +    if (error) {
> +        DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
> +        return error;
> +    }
> +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> +        DPRINTF("done\n");
> +        return 0;
> +    }
> +    DPRINTF("-EINVAL\n");
> +    return -EINVAL;
> +}
> +
> +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> +{
> +    int fds[2];
> +    RAMBlock *block;
> +
> +    DPRINTF("fork\n");
> +
> +    /* socketpair(AF_UNIX)? */
> +
> +    if (qemu_pipe(fds) == -1) {
> +        perror("qemu_pipe");
> +        abort();
> +    }
> +    state.from_umemd_fd = fds[0];
> +    umemd.to_qemu_fd = fds[1];
> +
> +    if (qemu_pipe(fds) == -1) {
> +        perror("qemu_pipe");
> +        abort();
> +    }
> +    umemd.from_qemu_fd = fds[0];
> +    state.to_umemd_fd = fds[1];
> +
> +    pid_t child = fork();
> +    if (child < 0) {
> +        perror("fork");
> +        abort();
> +    }
> +
> +    if (child == 0) {
> +        int mig_write_fd;
> +
> +        fd_close(&state.to_umemd_fd);
> +        fd_close(&state.from_umemd_fd);
> +        umemd.host_page_size = state.host_page_size;
> +        umemd.host_page_shift = state.host_page_shift;
> +
> +        umemd.nr_host_pages_per_target_page =
> +            TARGET_PAGE_SIZE / umemd.host_page_size;
> +        umemd.nr_target_pages_per_host_page =
> +            umemd.host_page_size / TARGET_PAGE_SIZE;
> +
> +        umemd.target_to_host_page_shift =
> +            ffs(umemd.nr_host_pages_per_target_page) - 1;
> +        umemd.host_to_target_page_shift =
> +            ffs(umemd.nr_target_pages_per_host_page) - 1;
> +
> +        umemd.state = 0;
> +        umemd.version_id = state.version_id;
> +        umemd.mig_read_fd = mig_read_fd;
> +        umemd.mig_read = mig_read;
> +
> +        mig_write_fd = dup(mig_read_fd);
> +        if (mig_write_fd < 0) {
> +            perror("could not dup for writable socket \n");
> +            abort();
> +        }
> +        umemd.mig_write_fd = mig_write_fd;
> +        umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
> +
> +        postcopy_incoming_umemd(); /* noreturn */
> +    }
> +
> +    DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
> +    fd_close(&umemd.to_qemu_fd);
> +    fd_close(&umemd.from_qemu_fd);
> +    state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAGES));
> +    state.faulted_pages->nr = 0;
> +
> +    /* close all UMem.shmem_fd */
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        umem_close_shmem(block->umem);
> +    }
> +    umem_qemu_wait_for_daemon(state.from_umemd_fd);
> +}
> +
> +static void postcopy_incoming_qemu_recv_quit(void)
> +{
> +    RAMBlock *block;
> +    if (state.state & PIS_STATE_QUIT_RECEIVED) {
> +        return;
> +    }
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        if (block->umem != NULL) {
> +            umem_destroy(block->umem);
> +            block->umem = NULL;
> +            block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> +        }
> +    }
> +
> +    DPRINTF("|= PIS_STATE_QUIT_RECEIVED\n");
> +    state.state |= PIS_STATE_QUIT_RECEIVED;
> +    qemu_set_fd_handler(state.from_umemd_fd, NULL, NULL, NULL);
> +    qemu_fclose(state.from_umemd);
> +    state.from_umemd = NULL;
> +    fd_close(&state.from_umemd_fd);
> +}
> +
> +static void postcopy_incoming_qemu_fflush_to_umemd_handler(void *opaque)
> +{
> +    assert(state.to_umemd != NULL);
> +
> +    nonblock_fflush(state.to_umemd);
> +    if (nonblock_pending_size(state.to_umemd) > 0) {
> +        return;
> +    }
> +
> +    qemu_set_fd_handler(state.to_umemd->fd, NULL, NULL, NULL);
> +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> +        DPRINTF("|= PIS_STATE_QUIT_SENT\n");
> +        state.state |= PIS_STATE_QUIT_SENT;
> +        qemu_fclose(state.to_umemd->file);
> +        state.to_umemd = NULL;
> +        fd_close(&state.to_umemd_fd);
> +        g_free(state.faulted_pages);
> +        state.faulted_pages = NULL;
> +    }
> +}
> +
> +static void postcopy_incoming_qemu_fflush_to_umemd(void)
> +{
> +    qemu_set_fd_handler(state.to_umemd->fd, NULL,
> +                        postcopy_incoming_qemu_fflush_to_umemd_handler, NULL);
> +    postcopy_incoming_qemu_fflush_to_umemd_handler(NULL);
> +}
> +
> +static void postcopy_incoming_qemu_queue_quit(void)
> +{
> +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> +        return;
> +    }
> +
> +    DPRINTF("|= PIS_STATE_QUIT_QUEUED\n");
> +    umem_qemu_quit(state.to_umemd->file);
> +    state.state |= PIS_STATE_QUIT_QUEUED;
> +}
> +
> +static void postcopy_incoming_qemu_send_pages_present(void)
> +{
> +    if (state.faulted_pages->nr > 0) {
> +        umem_qemu_send_pages_present(state.to_umemd->file,
> +                                     state.faulted_pages);
> +        state.faulted_pages->nr = 0;
> +    }
> +}
> +
> +static void postcopy_incoming_qemu_faulted_pages(
> +    const struct umem_pages *pages)
> +{
> +    assert(pages->nr <= MAX_FAULTED_PAGES);
> +    assert(state.faulted_pages != NULL);
> +
> +    if (state.faulted_pages->nr + pages->nr > MAX_FAULTED_PAGES) {
> +        postcopy_incoming_qemu_send_pages_present();
> +    }
> +    memcpy(&state.faulted_pages->pgoffs[state.faulted_pages->nr],
> +           &pages->pgoffs[0], sizeof(pages->pgoffs[0]) * pages->nr);
> +    state.faulted_pages->nr += pages->nr;
> +}
> +
> +static void postcopy_incoming_qemu_cleanup_umem(void);
> +
> +static int postcopy_incoming_qemu_handle_req_one(void)
> +{
> +    int offset = 0;
> +    int ret;
> +    uint8_t cmd;
> +
> +    ret = qemu_peek_buffer(state.from_umemd, &cmd, sizeof(cmd), offset);
> +    offset += sizeof(cmd);
> +    if (ret != sizeof(cmd)) {
> +        return -EAGAIN;
> +    }
> +    DPRINTF("cmd %c\n", cmd);
> +
> +    switch (cmd) {
> +    case UMEM_DAEMON_QUIT:
> +        postcopy_incoming_qemu_recv_quit();
> +        postcopy_incoming_qemu_queue_quit();
> +        postcopy_incoming_qemu_cleanup_umem();
> +        break;
> +    case UMEM_DAEMON_TRIGGER_PAGE_FAULT: {
> +        struct umem_pages *pages =
> +            umem_qemu_trigger_page_fault(state.from_umemd, &offset);
> +        if (pages == NULL) {
> +            return -EAGAIN;
> +        }
> +        if (state.to_umemd_fd >= 0 && !(state.state & PIS_STATE_QUIT_QUEUED)) {
> +            postcopy_incoming_qemu_faulted_pages(pages);
> +            g_free(pages);
> +        }
> +        break;
> +    }
> +    case UMEM_DAEMON_ERROR:
> +        /* umem daemon hit troubles, so it warned us to stop vm execution */
> +        vm_stop(RUN_STATE_IO_ERROR); /* or RUN_STATE_INTERNAL_ERROR */
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +
> +    if (state.from_umemd != NULL) {
> +        qemu_file_skip(state.from_umemd, offset);
> +    }
> +    return 0;
> +}
> +
> +static void postcopy_incoming_qemu_handle_req(void *opaque)
> +{
> +    do {
> +        int ret = postcopy_incoming_qemu_handle_req_one();
> +        if (ret == -EAGAIN) {
> +            break;
> +        }
> +    } while (state.from_umemd != NULL &&
> +             qemu_pending_size(state.from_umemd) > 0);
> +
> +    if (state.to_umemd != NULL) {
> +        if (state.faulted_pages->nr > 0) {
> +            postcopy_incoming_qemu_send_pages_present();
> +        }
> +        postcopy_incoming_qemu_fflush_to_umemd();
> +    }
> +}
> +
> +void postcopy_incoming_qemu_ready(void)
> +{
> +    umem_qemu_ready(state.to_umemd_fd);
> +
> +    state.from_umemd = qemu_fopen_pipe(state.from_umemd_fd);
> +    state.to_umemd = qemu_fopen_nonblock(state.to_umemd_fd);
> +    qemu_set_fd_handler(state.from_umemd_fd,
> +                        postcopy_incoming_qemu_handle_req, NULL, NULL);
> +}
> +
> +static void postcopy_incoming_qemu_cleanup_umem(void)
> +{
> +    /* when qemu will quit before completing postcopy, tell umem daemon
> +       to tear down umem device and exit. */
> +    if (state.to_umemd_fd >= 0) {
> +        postcopy_incoming_qemu_queue_quit();
> +        postcopy_incoming_qemu_fflush_to_umemd();
> +    }
> +
> +    if (state.dev) {
> +        umem_dev_destroy(state.dev);
> +        state.dev = NULL;
> +    }
> +}
> +
> +void postcopy_incoming_qemu_cleanup(void)
> +{
> +    postcopy_incoming_qemu_cleanup_umem();
> +    if (state.to_umemd != NULL) {
> +        nonblock_wait_for_flush(state.to_umemd);
> +    }
> +}
> +
> +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> +{
> +    uint64_t nr = DIV_ROUND_UP(size, state.host_page_size);
> +    size_t len = umem_pages_size(nr);
> +    ram_addr_t end = addr + size;
> +    struct umem_pages *pages;
> +    int i;
> +
> +    if (state.to_umemd_fd < 0 || state.state & PIS_STATE_QUIT_QUEUED) {
> +        return;
> +    }
> +    pages = g_malloc(len);
> +    pages->nr = nr;
> +    for (i = 0; addr < end; addr += state.host_page_size, i++) {
> +        pages->pgoffs[i] = addr >> state.host_page_shift;
> +    }
> +    umem_qemu_send_pages_unmapped(state.to_umemd->file, pages);
> +    g_free(pages);
> +    assert(state.to_umemd != NULL);
> +    postcopy_incoming_qemu_fflush_to_umemd();
> +}
> +
> +/**************************************************************************
> + * incoming umem daemon
> + */
> +
> +static void postcopy_incoming_umem_recv_quit(void)
> +{
> +    if (umemd.state & UMEM_STATE_QUIT_RECEIVED) {
> +        return;
> +    }
> +    DPRINTF("|= UMEM_STATE_QUIT_RECEIVED\n");
> +    umemd.state |= UMEM_STATE_QUIT_RECEIVED;
> +    qemu_fclose(umemd.from_qemu);
> +    umemd.from_qemu = NULL;
> +    fd_close(&umemd.from_qemu_fd);
> +}
> +
> +static void postcopy_incoming_umem_queue_quit(void)
> +{
> +    if (umemd.state & UMEM_STATE_QUIT_QUEUED) {
> +        return;
> +    }
> +    DPRINTF("|= UMEM_STATE_QUIT_QUEUED\n");
> +    umem_daemon_quit(umemd.to_qemu->file);
> +    umemd.state |= UMEM_STATE_QUIT_QUEUED;
> +}
> +
> +static void postcopy_incoming_umem_send_eoc_req(void)
> +{
> +    struct qemu_umem_req req;
> +
> +    if (umemd.state & UMEM_STATE_EOC_SENT) {
> +        return;
> +    }
> +
> +    DPRINTF("|= UMEM_STATE_EOC_SENT\n");
> +    req.cmd = QEMU_UMEM_REQ_EOC;
> +    postcopy_incoming_send_req(umemd.mig_write->file, &req);
> +    umemd.state |= UMEM_STATE_EOC_SENT;
> +    qemu_fclose(umemd.mig_write->file);
> +    umemd.mig_write = NULL;
> +    fd_close(&umemd.mig_write_fd);
> +}
> +
> +static void postcopy_incoming_umem_send_page_req(RAMBlock *block)
> +{
> +    struct qemu_umem_req req;
> +    int bit;
> +    uint64_t target_pgoff;
> +    int i;
> +
> +    umemd.page_request.nr = MAX_REQUESTS;
> +    umem_get_page_request(block->umem, &umemd.page_request);
> +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> +            block->idstr, umemd.page_request.nr,
> +            (uint64_t)umemd.page_request.pgoffs[0],
> +            (uint64_t)umemd.page_request.pgoffs[1]);
> +
> +    if (umemd.last_block_write != block) {
> +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND;
> +        req.idstr = block->idstr;
> +    } else {
> +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> +    }
> +
> +    req.nr = 0;
> +    req.pgoffs = umemd.target_pgoffs;
> +    if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> +        for (i = 0; i < umemd.page_request.nr; i++) {
> +            target_pgoff =
> +                umemd.page_request.pgoffs[i] >> umemd.host_to_target_page_shift;
> +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> +
> +            if (!test_and_set_bit(bit, umemd.phys_requested)) {
> +                req.pgoffs[req.nr] = target_pgoff;
> +                req.nr++;
> +            }
> +        }
> +    } else {
> +        for (i = 0; i < umemd.page_request.nr; i++) {
> +            int j;
> +            target_pgoff =
> +                umemd.page_request.pgoffs[i] << umemd.host_to_target_page_shift;
> +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> +
> +            for (j = 0; j < umemd.nr_target_pages_per_host_page; j++) {
> +                if (!test_and_set_bit(bit + j, umemd.phys_requested)) {
> +                    req.pgoffs[req.nr] = target_pgoff + j;
> +                    req.nr++;
> +                }
> +            }
> +        }
> +    }
> +
> +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> +            block->idstr, req.nr, req.pgoffs[0], req.pgoffs[1]);
> +    if (req.nr > 0 && umemd.mig_write != NULL) {
> +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> +        umemd.last_block_write = block;
> +    }
> +}
> +
> +static void postcopy_incoming_umem_send_pages_present(void)
> +{
> +    if (umemd.present_request->nr > 0) {
> +        umem_daemon_send_pages_present(umemd.to_qemu->file,
> +                                       umemd.present_request);
> +        umemd.present_request->nr = 0;
> +    }
> +}
> +
> +static void postcopy_incoming_umem_pages_present_one(
> +    uint32_t nr, const __u64 *pgoffs, uint64_t ramblock_pgoffset)
> +{
> +    uint32_t i;
> +    assert(nr <= MAX_PRESENT_REQUESTS);
> +
> +    if (umemd.present_request->nr + nr > MAX_PRESENT_REQUESTS) {
> +        postcopy_incoming_umem_send_pages_present();
> +    }
> +
> +    for (i = 0; i < nr; i++) {
> +        umemd.present_request->pgoffs[umemd.present_request->nr + i] =
> +            pgoffs[i] + ramblock_pgoffset;
> +    }
> +    umemd.present_request->nr += nr;
> +}
> +
> +static void postcopy_incoming_umem_pages_present(
> +    const struct umem_page_cached *page_cached, uint64_t ramblock_pgoffset)
> +{
> +    uint32_t left = page_cached->nr;
> +    uint32_t offset = 0;
> +
> +    while (left > 0) {
> +        uint32_t nr = MIN(left, MAX_PRESENT_REQUESTS);
> +        postcopy_incoming_umem_pages_present_one(
> +            nr, &page_cached->pgoffs[offset], ramblock_pgoffset);
> +
> +        left -= nr;
> +        offset += nr;
> +    }
> +}
> +
> +static int postcopy_incoming_umem_ram_load(void)
> +{
> +    ram_addr_t offset;
> +    int flags;
> +    int error;
> +    void *shmem;
> +    int i;
> +    int bit;
> +
> +    if (umemd.version_id != RAM_SAVE_VERSION_ID) {
> +        return -EINVAL;
> +    }
> +
> +    offset = qemu_get_be64(umemd.mig_read);
> +
> +    flags = offset & ~TARGET_PAGE_MASK;
> +    offset &= TARGET_PAGE_MASK;
> +
> +    assert(!(flags & RAM_SAVE_FLAG_MEM_SIZE));
> +
> +    if (flags & RAM_SAVE_FLAG_EOS) {
> +        DPRINTF("RAM_SAVE_FLAG_EOS\n");
> +        postcopy_incoming_umem_send_eoc_req();
> +
> +        qemu_fclose(umemd.mig_read);
> +        umemd.mig_read = NULL;
> +        fd_close(&umemd.mig_read_fd);
> +        umemd.state |= UMEM_STATE_EOS_RECEIVED;
> +
> +        postcopy_incoming_umem_queue_quit();
> +        DPRINTF("|= UMEM_STATE_EOS_RECEIVED\n");
> +        return 0;
> +    }
> +
> +    shmem = ram_load_host_from_stream_offset(umemd.mig_read, offset, flags,
> +                                             &umemd.last_block_read);
> +    if (!shmem) {
> +        DPRINTF("shmem == NULL\n");
> +        return -EINVAL;
> +    }
> +
> +    if (flags & RAM_SAVE_FLAG_COMPRESS) {
> +        uint8_t ch = qemu_get_byte(umemd.mig_read);
> +        memset(shmem, ch, TARGET_PAGE_SIZE);
> +    } else if (flags & RAM_SAVE_FLAG_PAGE) {
> +        qemu_get_buffer(umemd.mig_read, shmem, TARGET_PAGE_SIZE);
> +    }
> +
> +    error = qemu_file_get_error(umemd.mig_read);
> +    if (error) {
> +        DPRINTF("error %d\n", error);
> +        return error;
> +    }
> +
> +    umemd.page_cached.nr = 0;
> +    bit = (umemd.last_block_read->offset + offset) >> TARGET_PAGE_BITS;
> +    if (!test_and_set_bit(bit, umemd.phys_received)) {
> +        if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> +            __u64 pgoff = offset >> umemd.host_page_shift;
> +            for (i = 0; i < umemd.nr_host_pages_per_target_page; i++) {
> +                umemd.page_cached.pgoffs[umemd.page_cached.nr] = pgoff + i;
> +                umemd.page_cached.nr++;
> +            }
> +        } else {
> +            bool mark_cache = true;
> +            for (i = 0; i < umemd.nr_target_pages_per_host_page; i++) {
> +                if (!test_bit(bit + i, umemd.phys_received)) {
> +                    mark_cache = false;
> +                    break;
> +                }
> +            }
> +            if (mark_cache) {
> +                umemd.page_cached.pgoffs[0] = offset >> umemd.host_page_shift;
> +                umemd.page_cached.nr = 1;
> +            }
> +        }
> +    }
> +
> +    if (umemd.page_cached.nr > 0) {
> +        umem_mark_page_cached(umemd.last_block_read->umem, &umemd.page_cached);
> +
> +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED) && umemd.to_qemu_fd >=0 &&
> +            (incoming_postcopy_flags & INCOMING_FLAGS_FAULT_REQUEST)) {
> +            uint64_t ramblock_pgoffset;
> +
> +            ramblock_pgoffset =
> +                umemd.last_block_read->offset >> umemd.host_page_shift;
> +            postcopy_incoming_umem_pages_present(&umemd.page_cached,
> +                                                 ramblock_pgoffset);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static bool postcopy_incoming_umem_check_umem_done(void)
> +{
> +    bool all_done = true;
> +    RAMBlock *block;
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        UMem *umem = block->umem;
> +        if (umem != NULL && umem->nsets == umem->nbits) {
> +            umem_unmap_shmem(umem);
> +            umem_destroy(umem);
> +            block->umem = NULL;
> +        }
> +        if (block->umem != NULL) {
> +            all_done = false;
> +        }
> +    }
> +    return all_done;
> +}
> +
> +static bool postcopy_incoming_umem_page_faulted(const struct umem_pages *pages)
> +{
> +    int i;
> +
> +    for (i = 0; i < pages->nr; i++) {
> +        ram_addr_t addr = pages->pgoffs[i] << umemd.host_page_shift;
> +        RAMBlock *block = qemu_get_ram_block(addr);
> +        addr -= block->offset;
> +        umem_remove_shmem(block->umem, addr, umemd.host_page_size);
> +    }
> +    return postcopy_incoming_umem_check_umem_done();
> +}
> +
> +static bool
> +postcopy_incoming_umem_page_unmapped(const struct umem_pages *pages)
> +{
> +    RAMBlock *block;
> +    ram_addr_t addr;
> +    int i;
> +
> +    struct qemu_umem_req req = {
> +        .cmd = QEMU_UMEM_REQ_REMOVE,
> +        .nr = 0,
> +        .pgoffs = (uint64_t*)pages->pgoffs,
> +    };
> +
> +    addr = pages->pgoffs[0] << umemd.host_page_shift;
> +    block = qemu_get_ram_block(addr);
> +
> +    for (i = 0; i < pages->nr; i++)  {
> +        int pgoff;
> +
> +        addr = pages->pgoffs[i] << umemd.host_page_shift;
> +        pgoff = addr >> TARGET_PAGE_BITS;
> +        if (!test_bit(pgoff, umemd.phys_received) &&
> +            !test_bit(pgoff, umemd.phys_requested)) {
> +            req.pgoffs[req.nr] = pgoff;
> +            req.nr++;
> +        }
> +        set_bit(pgoff, umemd.phys_received);
> +        set_bit(pgoff, umemd.phys_requested);
> +
> +        umem_remove_shmem(block->umem,
> +                          addr - block->offset, umemd.host_page_size);
> +    }
> +    if (req.nr > 0 && umemd.mig_write != NULL) {
> +        req.idstr = block->idstr;
> +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> +    }
> +
> +    return postcopy_incoming_umem_check_umem_done();
> +}
> +
> +static void postcopy_incoming_umem_done(void)
> +{
> +    postcopy_incoming_umem_send_eoc_req();
> +    postcopy_incoming_umem_queue_quit();
> +}
> +
> +static int postcopy_incoming_umem_handle_qemu(void)
> +{
> +    int ret;
> +    int offset = 0;
> +    uint8_t cmd;
> +
> +    ret = qemu_peek_buffer(umemd.from_qemu, &cmd, sizeof(cmd), offset);
> +    offset += sizeof(cmd);
> +    if (ret != sizeof(cmd)) {
> +        return -EAGAIN;
> +    }
> +    DPRINTF("cmd %c\n", cmd);
> +    switch (cmd) {
> +    case UMEM_QEMU_QUIT:
> +        postcopy_incoming_umem_recv_quit();
> +        postcopy_incoming_umem_done();
> +        break;
> +    case UMEM_QEMU_PAGE_FAULTED: {
> +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> +                                                   &offset);
> +        if (pages == NULL) {
> +            return -EAGAIN;
> +        }
> +        if (postcopy_incoming_umem_page_faulted(pages)){
> +            postcopy_incoming_umem_done();
> +        }
> +        g_free(pages);
> +        break;
> +    }
> +    case UMEM_QEMU_PAGE_UNMAPPED: {
> +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> +                                                   &offset);
> +        if (pages == NULL) {
> +            return -EAGAIN;
> +        }
> +        if (postcopy_incoming_umem_page_unmapped(pages)){
> +            postcopy_incoming_umem_done();
> +        }
> +        g_free(pages);
> +        break;
> +    }
> +    default:
> +        abort();
> +        break;
> +    }
> +    if (umemd.from_qemu != NULL) {
> +        qemu_file_skip(umemd.from_qemu, offset);
> +    }
> +    return 0;
> +}
> +
> +static void set_fd(int fd, fd_set *fds, int *nfds)
> +{
> +    FD_SET(fd, fds);
> +    if (fd > *nfds) {
> +        *nfds = fd;
> +    }
> +}
> +
> +static int postcopy_incoming_umemd_main_loop(void)
> +{
> +    fd_set writefds;
> +    fd_set readfds;
> +    int nfds;
> +    RAMBlock *block;
> +    int ret;
> +
> +    int pending_size;
> +    bool get_page_request;
> +
> +    nfds = -1;
> +    FD_ZERO(&writefds);
> +    FD_ZERO(&readfds);
> +
> +    if (umemd.mig_write != NULL) {
> +        pending_size = nonblock_pending_size(umemd.mig_write);
> +        if (pending_size > 0) {
> +            set_fd(umemd.mig_write_fd, &writefds, &nfds);
> +        }
> +    } else {
> +        pending_size = 0;
> +    }
> +
> +#define PENDING_SIZE_MAX (MAX_REQUESTS * sizeof(uint64_t) * 2)
> +    /* If page request to the migration source is accumulated,
> +       suspend getting page fault request. */
> +    get_page_request = (pending_size <= PENDING_SIZE_MAX);
> +
> +    if (get_page_request) {
> +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> +            if (block->umem != NULL) {
> +                set_fd(block->umem->fd, &readfds, &nfds);
> +            }
> +        }
> +    }
> +
> +    if (umemd.mig_read_fd >= 0) {
> +        set_fd(umemd.mig_read_fd, &readfds, &nfds);
> +    }
> +
> +    if (umemd.to_qemu != NULL &&
> +        nonblock_pending_size(umemd.to_qemu) > 0) {
> +        set_fd(umemd.to_qemu_fd, &writefds, &nfds);
> +    }
> +    if (umemd.from_qemu_fd >= 0) {
> +        set_fd(umemd.from_qemu_fd, &readfds, &nfds);
> +    }
> +
> +    ret = select(nfds + 1, &readfds, &writefds, NULL, NULL);
> +    if (ret == -1) {
> +        if (errno == EINTR) {
> +            return 0;
> +        }
> +        return ret;
> +    }
> +
> +    if (umemd.mig_write_fd >= 0 && FD_ISSET(umemd.mig_write_fd, &writefds)) {
> +        nonblock_fflush(umemd.mig_write);
> +    }
> +    if (umemd.to_qemu_fd >= 0 && FD_ISSET(umemd.to_qemu_fd, &writefds)) {
> +        nonblock_fflush(umemd.to_qemu);
> +    }
> +    if (get_page_request) {
> +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> +            if (block->umem != NULL && FD_ISSET(block->umem->fd, &readfds)) {
> +                postcopy_incoming_umem_send_page_req(block);
> +            }
> +        }
> +    }
> +    if (umemd.mig_read_fd >= 0 && FD_ISSET(umemd.mig_read_fd, &readfds)) {
> +        do {
> +            ret = postcopy_incoming_umem_ram_load();
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        } while (umemd.mig_read != NULL &&
> +                 qemu_pending_size(umemd.mig_read) > 0);
> +    }
> +    if (umemd.from_qemu_fd >= 0 && FD_ISSET(umemd.from_qemu_fd, &readfds)) {
> +        do {
> +            ret = postcopy_incoming_umem_handle_qemu();
> +            if (ret == -EAGAIN) {
> +                break;
> +            }
> +        } while (umemd.from_qemu != NULL &&
> +                 qemu_pending_size(umemd.from_qemu) > 0);
> +    }
> +
> +    if (umemd.mig_write != NULL) {
> +        nonblock_fflush(umemd.mig_write);
> +    }
> +    if (umemd.to_qemu != NULL) {
> +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED)) {
> +            postcopy_incoming_umem_send_pages_present();
> +        }
> +        nonblock_fflush(umemd.to_qemu);
> +        if ((umemd.state & UMEM_STATE_QUIT_QUEUED) &&
> +            nonblock_pending_size(umemd.to_qemu) == 0) {
> +            DPRINTF("|= UMEM_STATE_QUIT_SENT\n");
> +            qemu_fclose(umemd.to_qemu->file);
> +            umemd.to_qemu = NULL;
> +            fd_close(&umemd.to_qemu_fd);
> +            umemd.state |= UMEM_STATE_QUIT_SENT;
> +        }
> +    }
> +
> +    return (umemd.state & UMEM_STATE_END_MASK) == UMEM_STATE_END_MASK;
> +}
> +
> +static void postcopy_incoming_umemd(void)
> +{
> +    ram_addr_t last_ram_offset;
> +    int nbits;
> +    RAMBlock *block;
> +    int ret;
> +
> +    qemu_daemon(1, 1);
> +    signal(SIGPIPE, SIG_IGN);
> +    DPRINTF("daemon pid: %d\n", getpid());
> +
> +    umemd.page_request.pgoffs = g_new(__u64, MAX_REQUESTS);
> +    umemd.page_cached.pgoffs =
> +        g_new(__u64, MAX_REQUESTS *
> +              (TARGET_PAGE_SIZE >= umemd.host_page_size ?
> +               1: umemd.nr_host_pages_per_target_page));
> +    umemd.target_pgoffs =
> +        g_new(uint64_t, MAX_REQUESTS *
> +              MAX(umemd.nr_host_pages_per_target_page,
> +                  umemd.nr_target_pages_per_host_page));
> +    umemd.present_request = g_malloc(umem_pages_size(MAX_PRESENT_REQUESTS));
> +    umemd.present_request->nr = 0;
> +
> +    last_ram_offset = qemu_last_ram_offset();
> +    nbits = last_ram_offset >> TARGET_PAGE_BITS;
> +    umemd.phys_requested = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> +    umemd.phys_received = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> +    umemd.last_block_read = NULL;
> +    umemd.last_block_write = NULL;
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        UMem *umem = block->umem;
> +        umem->umem = NULL;      /* umem mapping area has VM_DONT_COPY flag,
> +                                   so we lost those mappings by fork */
> +        block->host = umem_map_shmem(umem);
> +        umem_close_shmem(umem);
> +    }
> +    umem_daemon_ready(umemd.to_qemu_fd);
> +    umemd.to_qemu = qemu_fopen_nonblock(umemd.to_qemu_fd);
> +
> +    /* wait for qemu to disown migration_fd */
> +    umem_daemon_wait_for_qemu(umemd.from_qemu_fd);
> +    umemd.from_qemu = qemu_fopen_pipe(umemd.from_qemu_fd);
> +
> +    DPRINTF("entering umemd main loop\n");
> +    for (;;) {
> +        ret = postcopy_incoming_umemd_main_loop();
> +        if (ret != 0) {
> +            break;
> +        }
> +    }
> +    DPRINTF("exiting umemd main loop\n");
> +
> +    /* This daemon forked from qemu and the parent qemu is still running.
> +     * Cleanups of linked libraries like SDL should not be triggered,
> +     * otherwise the parent qemu may use resources which was already freed.
> +     */
> +    fflush(stdout);
> +    fflush(stderr);
> +    _exit(ret < 0? EXIT_FAILURE: 0);
> +}
> diff --git a/migration-tcp.c b/migration-tcp.c
> index cf6a9b8..aa35050 100644
> --- a/migration-tcp.c
> +++ b/migration-tcp.c
> @@ -63,18 +63,25 @@ static void tcp_wait_for_connect(void *opaque)
>      } while (ret == -1 && (socket_error()) == EINTR);
>  
>      if (ret < 0) {
> -        migrate_fd_error(s);
> -        return;
> +        goto error_out;
>      }
>  
>      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>  
> -    if (val == 0)
> +    if (val == 0) {
> +        ret = postcopy_outgoing_create_read_socket(s);
> +        if (ret < 0) {
> +            goto error_out;
> +        }
>          migrate_fd_connect(s);
> -    else {
> +    } else {
>          DPRINTF("error connecting %d\n", val);
> -        migrate_fd_error(s);
> +        goto error_out;
>      }
> +    return;
> +
> +error_out:
> +    migrate_fd_error(s);
>  }
>  
>  int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
> @@ -112,11 +119,19 @@ int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
>  
>      if (ret < 0) {
>          DPRINTF("connect failed\n");
> -        migrate_fd_error(s);
> -        return ret;
> +        goto error_out;
> +    }
> +
> +    ret = postcopy_outgoing_create_read_socket(s);
> +    if (ret < 0) {
> +        goto error_out;
>      }
>      migrate_fd_connect(s);
>      return 0;
> +
> +error_out:
> +    migrate_fd_error(s);
> +    return ret;
>  }
>  
>  static void tcp_accept_incoming_migration(void *opaque)
> @@ -145,7 +160,15 @@ static void tcp_accept_incoming_migration(void *opaque)
>      }
>  
>      process_incoming_migration(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_fork_umemd(c, f);
> +    }
>      qemu_fclose(f);
> +    if (incoming_postcopy) {
> +        /* now socket is disowned.
> +           So tell umem server that it's safe to use it */
> +        postcopy_incoming_qemu_ready();
> +    }
>  out:
>      close(c);
>  out2:
> diff --git a/migration-unix.c b/migration-unix.c
> index dfcf203..3707505 100644
> --- a/migration-unix.c
> +++ b/migration-unix.c
> @@ -69,12 +69,20 @@ static void unix_wait_for_connect(void *opaque)
>  
>      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>  
> -    if (val == 0)
> +    if (val == 0) {
> +        ret = postcopy_outgoing_create_read_socket(s);
> +        if (ret < 0) {
> +            goto error_out;
> +        }
>          migrate_fd_connect(s);
> -    else {
> +    } else {
>          DPRINTF("error connecting %d\n", val);
> -        migrate_fd_error(s);
> +        goto error_out;
>      }
> +    return;
> +
> +error_out:
> +    migrate_fd_error(s);
>  }
>  
>  int unix_start_outgoing_migration(MigrationState *s, const char *path)
> @@ -109,11 +117,19 @@ int unix_start_outgoing_migration(MigrationState *s, const char *path)
>  
>      if (ret < 0) {
>          DPRINTF("connect failed\n");
> -        migrate_fd_error(s);
> -        return ret;
> +        goto error_out;
> +    }
> +
> +    ret = postcopy_outgoing_create_read_socket(s);
> +    if (ret < 0) {
> +        goto error_out;
>      }
>      migrate_fd_connect(s);
>      return 0;
> +
> +error_out:
> +    migrate_fd_error(s);
> +    return ret;
>  }
>  
>  static void unix_accept_incoming_migration(void *opaque)
> @@ -142,7 +158,13 @@ static void unix_accept_incoming_migration(void *opaque)
>      }
>  
>      process_incoming_migration(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_fork_umemd(c, f);
> +    }
>      qemu_fclose(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_qemu_ready();
> +    }
>  out:
>      close(c);
>  out2:
> diff --git a/migration.c b/migration.c
> index 0149ab3..51efe44 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -39,6 +39,11 @@ enum {
>      MIG_STATE_COMPLETED,
>  };
>  
> +enum {
> +    MIG_SUBSTATE_PRECOPY,
> +    MIG_SUBSTATE_POSTCOPY,
> +};
> +
>  #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
>  
>  static NotifierList migration_state_notifiers =
> @@ -255,6 +260,18 @@ static void migrate_fd_put_ready(void *opaque)
>          return;
>      }
>  
> +    if (s->substate == MIG_SUBSTATE_POSTCOPY) {
> +        /* PRINTF("postcopy background\n"); */
> +        ret = postcopy_outgoing_ram_save_background(s->mon, s->file,
> +                                                    s->postcopy);
> +        if (ret > 0) {
> +            migrate_fd_completed(s);
> +        } else if (ret < 0) {
> +            migrate_fd_error(s);
> +        }
> +        return;
> +    }
> +
>      DPRINTF("iterate\n");
>      ret = qemu_savevm_state_iterate(s->mon, s->file);
>      if (ret < 0) {
> @@ -265,6 +282,19 @@ static void migrate_fd_put_ready(void *opaque)
>          DPRINTF("done iterating\n");
>          vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>  
> +        if (s->params.postcopy) {
> +            if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
> +                migrate_fd_error(s);
> +                if (old_vm_running) {
> +                    vm_start();
> +                }
> +                return;
> +            }
> +            s->substate = MIG_SUBSTATE_POSTCOPY;
> +            s->postcopy = postcopy_outgoing_begin(s);
> +            return;
> +        }
> +
>          if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
>              migrate_fd_error(s);
>          } else {
> @@ -357,6 +387,7 @@ void migrate_fd_connect(MigrationState *s)
>      int ret;
>  
>      s->state = MIG_STATE_ACTIVE;
> +    s->substate = MIG_SUBSTATE_PRECOPY;
>      s->file = qemu_fopen_ops_buffered(s,
>                                        s->bandwidth_limit,
>                                        migrate_fd_put_buffer,
> diff --git a/migration.h b/migration.h
> index 90ae362..2809e99 100644
> --- a/migration.h
> +++ b/migration.h
> @@ -40,6 +40,12 @@ struct MigrationState
>      int (*write)(MigrationState *s, const void *buff, size_t size);
>      void *opaque;
>      MigrationParams params;
> +
> +    /* for postcopy */
> +    int substate;              /* precopy or postcopy */
> +    int fd_read;
> +    QEMUFile *file_read;        /* connection from the detination */
> +    void *postcopy;
>  };
>  
>  void process_incoming_migration(QEMUFile *f);
> @@ -86,6 +92,7 @@ uint64_t ram_bytes_remaining(void);
>  uint64_t ram_bytes_transferred(void);
>  uint64_t ram_bytes_total(void);
>  
> +void ram_save_set_params(const MigrationParams *params, void *opaque);
>  void sort_ram_list(void);
>  int ram_save_block(QEMUFile *f);
>  void ram_save_memory_set_dirty(void);
> @@ -107,7 +114,30 @@ void migrate_add_blocker(Error *reason);
>   */
>  void migrate_del_blocker(Error *reason);
>  
> +/* For outgoing postcopy */
> +int postcopy_outgoing_create_read_socket(MigrationState *s);
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +                                    QEMUFile *f, int stage, void *opaque);
> +void *postcopy_outgoing_begin(MigrationState *s);
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +                                          void *postcopy);
> +
> +/* For incoming postcopy */
>  extern bool incoming_postcopy;
>  extern unsigned long incoming_postcopy_flags;
>  
> +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy);
> +void postcopy_incoming_ram_alloc(const char *name,
> +                                 size_t size, uint8_t **hostp, UMem **umemp);
> +void postcopy_incoming_ram_free(UMem *umem);
> +void postcopy_incoming_prepare(void);
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id);
> +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read);
> +void postcopy_incoming_qemu_ready(void);
> +void postcopy_incoming_qemu_cleanup(void);
> +#ifdef NEED_CPU_H
> +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size);
> +#endif
> +
>  #endif
> diff --git a/qemu-common.h b/qemu-common.h
> index 725922b..d74a8c9 100644
> --- a/qemu-common.h
> +++ b/qemu-common.h
> @@ -17,6 +17,7 @@ typedef struct DeviceState DeviceState;
>  
>  struct Monitor;
>  typedef struct Monitor Monitor;
> +typedef struct UMem UMem;
>  
>  /* we put basic includes here to avoid repeating them in device drivers */
>  #include <stdlib.h>
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 5c5b8f3..19e20f9 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -2510,7 +2510,10 @@ DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
>      "-postcopy-flags unsigned-int(flags)\n"
>      "	                flags for postcopy incoming migration\n"
>      "                   when -incoming and -postcopy are specified.\n"
> -    "                   This is for benchmark/debug purpose (default: 0)\n",
> +    "                   This is for benchmark/debug purpose (default: 0)\n"
> +    "                   Currently supprted flags are\n"
> +    "                   1: enable fault request from umemd to qemu\n"
> +    "                      (default: disabled)\n",
>      QEMU_ARCH_ALL)
>  STEXI
>  @item -postcopy-flags int

Can you move umem.h and umem.h to a separate patch please ,
this patch
> diff --git a/umem.c b/umem.c
> new file mode 100644
> index 0000000..b7be006
> --- /dev/null
> +++ b/umem.c
> @@ -0,0 +1,379 @@
> +/*
> + * umem.c: user process backed memory module for postcopy livemigration
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +
> +#include <linux/umem.h>
> +
> +#include "bitops.h"
> +#include "sysemu.h"
> +#include "hw/hw.h"
> +#include "umem.h"
> +
> +//#define DEBUG_UMEM
> +#ifdef DEBUG_UMEM
> +#include <sys/syscall.h>
> +#define DPRINTF(format, ...)                                            \
> +    do {                                                                \
> +        printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),    \
> +               __func__, __LINE__, ## __VA_ARGS__);                     \
> +    } while (0)
> +#else
> +#define DPRINTF(format, ...)    do { } while (0)
> +#endif
> +
> +#define DEV_UMEM        "/dev/umem"
> +
> +struct UMemDev {
> +    int fd;
> +    int page_shift;
> +};
> +
> +UMemDev *umem_dev_new(void)
> +{
> +    UMemDev *umem_dev;
> +    int umem_dev_fd = open(DEV_UMEM, O_RDWR);
> +    if (umem_dev_fd < 0) {
> +        perror("can't open "DEV_UMEM);
> +        abort();
> +    }
> +
> +    umem_dev = g_new(UMemDev, 1);
> +    umem_dev->fd = umem_dev_fd;
> +    umem_dev->page_shift = ffs(getpagesize()) - 1;
> +    return umem_dev;
> +}
> +
> +void umem_dev_destroy(UMemDev *dev)
> +{
> +    close(dev->fd);
> +    g_free(dev);
> +}
> +
> +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name)
> +{
> +    struct umem_create create = {
> +        .size = size,
> +        .async_req_max = 0,
> +        .sync_req_max = 0,
> +    };
> +    UMem *umem;
> +
> +    snprintf(create.name.id, sizeof(create.name.id),
> +             "pid-%"PRId64, (uint64_t)getpid());
> +    create.name.id[UMEM_ID_MAX - 1] = 0;
> +    strncpy(create.name.name, name, sizeof(create.name.name));
> +    create.name.name[UMEM_NAME_MAX - 1] = 0;
> +
> +    assert((size % getpagesize()) == 0);
> +    if (ioctl(dev->fd, UMEM_DEV_CREATE_UMEM, &create) < 0) {
> +        perror("UMEM_DEV_CREATE_UMEM");
> +        abort();
> +    }
> +    if (ftruncate(create.shmem_fd, create.size) < 0) {
> +        perror("truncate(\"shmem_fd\")");
> +        abort();
> +    }
> +
> +    umem = g_new(UMem, 1);
> +    umem->nbits = 0;
> +    umem->nsets = 0;
> +    umem->faulted = NULL;
> +    umem->page_shift = dev->page_shift;
> +    umem->fd = create.umem_fd;
> +    umem->shmem_fd = create.shmem_fd;
> +    umem->size = create.size;
> +    umem->umem = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
> +                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (umem->umem == MAP_FAILED) {
> +        perror("mmap(UMem) failed");
> +        abort();
> +    }
> +    return umem;
> +}
> +
> +void umem_mmap(UMem *umem)
> +{
> +    void *ret = mmap(umem->umem, umem->size,
> +                     PROT_EXEC | PROT_READ | PROT_WRITE,
> +                     MAP_PRIVATE | MAP_FIXED, umem->fd, 0);
> +    if (ret == MAP_FAILED) {
> +        perror("umem_mmap(UMem) failed");
> +        abort();
> +    }
> +}
> +
> +void umem_destroy(UMem *umem)
> +{
> +    if (umem->fd != -1) {
> +        close(umem->fd);
> +    }
> +    if (umem->shmem_fd != -1) {
> +        close(umem->shmem_fd);
> +    }
> +    g_free(umem->faulted);
> +    g_free(umem);
> +}
> +
> +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request)
> +{
> +    if (ioctl(umem->fd, UMEM_GET_PAGE_REQUEST, page_request)) {
> +        perror("daemon: UMEM_GET_PAGE_REQUEST");
> +        abort();
> +    }
> +}
> +
> +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached)
> +{
> +    if (ioctl(umem->fd, UMEM_MARK_PAGE_CACHED, page_cached)) {
> +        perror("daemon: UMEM_MARK_PAGE_CACHED");
> +        abort();
> +    }
> +}
> +
> +void umem_unmap(UMem *umem)
> +{
> +    munmap(umem->umem, umem->size);
> +    umem->umem = NULL;
> +}
> +
> +void umem_close(UMem *umem)
> +{
> +    close(umem->fd);
> +    umem->fd = -1;
> +}
> +
> +void *umem_map_shmem(UMem *umem)
> +{
> +    umem->nbits = umem->size >> umem->page_shift;
> +    umem->nsets = 0;
> +    umem->faulted = g_new0(unsigned long, BITS_TO_LONGS(umem->nbits));
> +
> +    umem->shmem = mmap(NULL, umem->size, PROT_READ | PROT_WRITE, MAP_SHARED,
> +                       umem->shmem_fd, 0);
> +    if (umem->shmem == MAP_FAILED) {
> +        perror("daemon: mmap(\"shmem\")");
> +        abort();
> +    }
> +    return umem->shmem;
> +}
> +
> +void umem_unmap_shmem(UMem *umem)
> +{
> +    munmap(umem->shmem, umem->size);
> +    umem->shmem = NULL;
> +}
> +
> +void umem_remove_shmem(UMem *umem, size_t offset, size_t size)
> +{
> +    int s = offset >> umem->page_shift;
> +    int e = (offset + size) >> umem->page_shift;
> +    int i;
> +
> +    for (i = s; i < e; i++) {
> +        if (!test_and_set_bit(i, umem->faulted)) {
> +            umem->nsets++;
> +#if defined(CONFIG_MADVISE) && defined(MADV_REMOVE)
> +            madvise(umem->shmem + offset, size, MADV_REMOVE);
> +#endif
> +        }
> +    }
> +}
> +
> +void umem_close_shmem(UMem *umem)
> +{
> +    close(umem->shmem_fd);
> +    umem->shmem_fd = -1;
> +}
> +
> +/***************************************************************************/
> +/* qemu <-> umem daemon communication */
> +
> +size_t umem_pages_size(uint64_t nr)
> +{
> +    return sizeof(struct umem_pages) + nr * sizeof(uint64_t);
> +}
> +
> +static void umem_write_cmd(int fd, uint8_t cmd)
> +{
> +    DPRINTF("write cmd %c\n", cmd);
> +
> +    for (;;) {
> +        ssize_t ret = write(fd, &cmd, 1);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == EPIPE) {
> +                perror("pipe");
> +                DPRINTF("write cmd %c %zd %d: pipe is closed\n",
> +                        cmd, ret, errno);
> +                break;
> +            }
> +
> +            perror("pipe");
> +            DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
> +            abort();
> +        }
> +
> +        break;
> +    }
> +}
> +
> +static void umem_read_cmd(int fd, uint8_t expect)
> +{
> +    uint8_t cmd;
> +    for (;;) {
> +        ssize_t ret = read(fd, &cmd, 1);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            }
> +            perror("pipe");
> +            DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
> +            abort();
> +        }
> +
> +        if (ret == 0) {
> +            DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
> +            abort();
> +        }
> +
> +        break;
> +    }
> +
> +    DPRINTF("read cmd %c\n", cmd);
> +    if (cmd != expect) {
> +        DPRINTF("cmd %c expect %d\n", cmd, expect);
> +        abort();
> +    }
> +}
> +
> +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
> +{
> +    int ret;
> +    uint64_t nr;
> +    size_t size;
> +    struct umem_pages *pages;
> +
> +    ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
> +    *offset += sizeof(nr);
> +    DPRINTF("ret %d nr %ld\n", ret, nr);
> +    if (ret != sizeof(nr) || nr == 0) {
> +        return NULL;
> +    }
> +
> +    size = umem_pages_size(nr);
> +    pages = g_malloc(size);
> +    pages->nr = nr;
> +    size -= sizeof(pages->nr);
> +
> +    ret = qemu_peek_buffer(f, (uint8_t*)pages->pgoffs, size, *offset);
> +    *offset += size;
> +    if (ret != size) {
> +        g_free(pages);
> +        return NULL;
> +    }
> +    return pages;
> +}
> +
> +static void umem_send_pages(QEMUFile *f, const struct umem_pages *pages)
> +{
> +    size_t len = umem_pages_size(pages->nr);
> +    qemu_put_buffer(f, (const uint8_t*)pages, len);
> +}
> +
> +/* umem daemon -> qemu */
> +void umem_daemon_ready(int to_qemu_fd)
> +{
> +    umem_write_cmd(to_qemu_fd, UMEM_DAEMON_READY);
> +}
> +
> +void umem_daemon_quit(QEMUFile *to_qemu)
> +{
> +    qemu_put_byte(to_qemu, UMEM_DAEMON_QUIT);
> +}
> +
> +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> +                                    struct umem_pages *pages)
> +{
> +    qemu_put_byte(to_qemu, UMEM_DAEMON_TRIGGER_PAGE_FAULT);
> +    umem_send_pages(to_qemu, pages);
> +}
> +
> +void umem_daemon_wait_for_qemu(int from_qemu_fd)
> +{
> +    umem_read_cmd(from_qemu_fd, UMEM_QEMU_READY);
> +}
> +
> +/* qemu -> umem daemon */
> +void umem_qemu_wait_for_daemon(int from_umemd_fd)
> +{
> +    umem_read_cmd(from_umemd_fd, UMEM_DAEMON_READY);
> +}
> +
> +void umem_qemu_ready(int to_umemd_fd)
> +{
> +    umem_write_cmd(to_umemd_fd, UMEM_QEMU_READY);
> +}
> +
> +void umem_qemu_quit(QEMUFile *to_umemd)
> +{
> +    qemu_put_byte(to_umemd, UMEM_QEMU_QUIT);
> +}
> +
> +/* qemu side handler */
> +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> +                                                int *offset)
> +{
> +    uint64_t i;
> +    int page_shift = ffs(getpagesize()) - 1;
> +    struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
> +    if (pages == NULL) {
> +        return NULL;
> +    }
> +
> +    for (i = 0; i < pages->nr; i++) {
> +        ram_addr_t addr = pages->pgoffs[i] << page_shift;
> +
> +        /* make pages present by forcibly triggering page fault. */
> +        volatile uint8_t *ram = qemu_get_ram_ptr(addr);
> +        uint8_t dummy_read = ram[0];
> +        (void)dummy_read;   /* suppress unused variable warning */
> +    }
> +
> +    return pages;
> +}
> +
> +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> +                                  const struct umem_pages *pages)
> +{
> +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_FAULTED);
> +    umem_send_pages(to_umemd, pages);
> +}
> +
> +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> +                                   const struct umem_pages *pages)
> +{
> +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_UNMAPPED);
> +    umem_send_pages(to_umemd, pages);
> +}
> diff --git a/umem.h b/umem.h
> new file mode 100644
> index 0000000..5ca19ef
> --- /dev/null
> +++ b/umem.h
> @@ -0,0 +1,105 @@
> +/*
> + * umem.h: user process backed memory module for postcopy livemigration
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef QEMU_UMEM_H
> +#define QEMU_UMEM_H
> +
> +#include <linux/umem.h>
> +
> +#include "qemu-common.h"
> +
> +typedef struct UMemDev UMemDev;
> +
> +struct UMem {
> +    void *umem;
> +    int fd;
> +    void *shmem;
> +    int shmem_fd;
> +    uint64_t size;
> +
> +    /* indexed by host page size */
> +    int page_shift;
> +    int nbits;
> +    int nsets;
> +    unsigned long *faulted;
> +};
> +
> +UMemDev *umem_dev_new(void);
> +void umem_dev_destroy(UMemDev *dev);
> +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name);
> +void umem_mmap(UMem *umem);
> +
> +void umem_destroy(UMem *umem);
> +
> +/* umem device operations */
> +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request);
> +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached);
> +void umem_unmap(UMem *umem);
> +void umem_close(UMem *umem);
> +
> +/* umem shmem operations */
> +void *umem_map_shmem(UMem *umem);
> +void umem_unmap_shmem(UMem *umem);
> +void umem_remove_shmem(UMem *umem, size_t offset, size_t size);
> +void umem_close_shmem(UMem *umem);
> +
> +/* qemu on source <-> umem daemon communication */
> +
> +struct umem_pages {
> +    uint64_t nr;        /* nr = 0 means completed */
> +    uint64_t pgoffs[0];
> +};
> +
> +/* daemon -> qemu */
> +#define UMEM_DAEMON_READY               'R'
> +#define UMEM_DAEMON_QUIT                'Q'
> +#define UMEM_DAEMON_TRIGGER_PAGE_FAULT  'T'
> +#define UMEM_DAEMON_ERROR               'E'
> +
> +/* qemu -> daemon */
> +#define UMEM_QEMU_READY                 'r'
> +#define UMEM_QEMU_QUIT                  'q'
> +#define UMEM_QEMU_PAGE_FAULTED          't'
> +#define UMEM_QEMU_PAGE_UNMAPPED         'u'
> +
> +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset);
> +size_t umem_pages_size(uint64_t nr);
> +
> +/* for umem daemon */
> +void umem_daemon_ready(int to_qemu_fd);
> +void umem_daemon_wait_for_qemu(int from_qemu_fd);
> +void umem_daemon_quit(QEMUFile *to_qemu);
> +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> +                                    struct umem_pages *pages);
> +
> +/* for qemu */
> +void umem_qemu_wait_for_daemon(int from_umemd_fd);
> +void umem_qemu_ready(int to_umemd_fd);
> +void umem_qemu_quit(QEMUFile *to_umemd);
> +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> +                                                int *offset);
> +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> +                                  const struct umem_pages *pages);
> +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> +                                   const struct umem_pages *pages);
> +
> +#endif /* QEMU_UMEM_H */
> diff --git a/vl.c b/vl.c
> index 5430b8c..17427a0 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -3274,8 +3274,12 @@ int main(int argc, char **argv, char **envp)
>      default_drive(default_sdcard, snapshot, machine->use_scsi,
>                    IF_SD, 0, SD_OPTS);
>  
> -    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
> -                         ram_save_live, NULL, ram_load, NULL);
> +    if (postcopy_incoming_init(incoming, incoming_postcopy) < 0) {
> +        exit(1);
> +    }
> +    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
> +                         ram_save_set_params, ram_save_live, NULL,
> +                         ram_load, NULL);
>  
>      if (nb_numa_nodes > 0) {
>          int i;
> @@ -3471,6 +3475,9 @@ int main(int argc, char **argv, char **envp)
>  
>      if (incoming) {
>          runstate_set(RUN_STATE_INMIGRATE);
> +        if (incoming_postcopy) {
> +            postcopy_incoming_prepare();
>+        }

how about moving postcopy_incoming_prepare into qemu_start_incoming_migration ?

>          int ret = qemu_start_incoming_migration(incoming);
>          if (ret < 0) {
>              fprintf(stderr, "Migration failed. Exit code %s(%d), exiting.\n",
> @@ -3488,6 +3495,9 @@ int main(int argc, char **argv, char **envp)
>      bdrv_close_all();
>      pause_all_vcpus();
>      net_cleanup();
> +    if (incoming_postcopy) {
> +        postcopy_incoming_qemu_cleanup();
> +    }
>      res_free();
>  
>      return 0;

Orit

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
@ 2011-12-29 15:51     ` Orit Wasserman
  0 siblings, 0 replies; 88+ messages in thread
From: Orit Wasserman @ 2011-12-29 15:51 UTC (permalink / raw)
  To: Isaku Yamahata, satoshi.itoh; +Cc: t.hirofuchi, qemu-devel, kvm

Hi,
A general comment this patch is a bit too long,which makes it hard to review.
Can you split it please?

On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This patch implements postcopy livemigration.
> 
> Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
> ---
>  Makefile.target           |    4 +
>  arch_init.c               |   26 +-
>  cpu-all.h                 |    7 +
>  exec.c                    |   20 +-
>  migration-exec.c          |    8 +
>  migration-fd.c            |   30 +
>  migration-postcopy-stub.c |   77 ++
>  migration-postcopy.c      | 1891 +++++++++++++++++++++++++++++++++++++++++++++
>  migration-tcp.c           |   37 +-
>  migration-unix.c          |   32 +-
>  migration.c               |   31 +
>  migration.h               |   30 +
>  qemu-common.h             |    1 +
>  qemu-options.hx           |    5 +-
>  umem.c                    |  379 +++++++++
>  umem.h                    |  105 +++
>  vl.c                      |   14 +-
>  17 files changed, 2677 insertions(+), 20 deletions(-)
>  create mode 100644 migration-postcopy-stub.c
>  create mode 100644 migration-postcopy.c
>  create mode 100644 umem.c
>  create mode 100644 umem.h
> 
> diff --git a/Makefile.target b/Makefile.target
> index 3261383..d94c53f 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -4,6 +4,7 @@ GENERATED_HEADERS = config-target.h
>  CONFIG_NO_PCI = $(if $(subst n,,$(CONFIG_PCI)),n,y)
>  CONFIG_NO_KVM = $(if $(subst n,,$(CONFIG_KVM)),n,y)
>  CONFIG_NO_XEN = $(if $(subst n,,$(CONFIG_XEN)),n,y)
> +CONFIG_NO_POSTCOPY = $(if $(subst n,,$(CONFIG_POSTCOPY)),n,y)
>  
>  include ../config-host.mak
>  include config-devices.mak
> @@ -199,6 +200,9 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
>  obj-y += memory.o
>  LIBS+=-lz
>  
> +common-obj-$(CONFIG_POSTCOPY) += migration-postcopy.o umem.o
> +common-obj-$(CONFIG_NO_POSTCOPY) += migration-postcopy-stub.o
> +
>  QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
>  QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
>  QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
> diff --git a/arch_init.c b/arch_init.c
> index bc53092..8b3130d 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -102,6 +102,13 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
>      return 1;
>  }
>  
> +static bool outgoing_postcopy = false;
> +
> +void ram_save_set_params(const MigrationParams *params, void *opaque)
> +{
> +    outgoing_postcopy = params->postcopy;
> +}
> +
>  static RAMBlock *last_block_sent = NULL;
>  
>  int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
> @@ -284,6 +291,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      uint64_t expected_time = 0;
>      int ret;
>  
> +    if (stage == 1) {
> +        last_block_sent = NULL;
> +
> +        bytes_transferred = 0;
> +        last_block = NULL;
> +        last_offset = 0;

Changing of line order + new empty line

> +    }
> +    if (outgoing_postcopy) {
> +        return postcopy_outgoing_ram_save_live(mon, f, stage, opaque);
> +    }
> +

I would just do :

unregister_savevm_live and then register_savevm_live(...,postcopy_outgoing_ram_save_live,...)
when starting outgoing postcopy migration.

>      if (stage < 0) {
>          cpu_physical_memory_set_dirty_tracking(0);
>          return 0;
> @@ -295,10 +313,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      }
>  
>      if (stage == 1) {
> -        bytes_transferred = 0;
> -        last_block_sent = NULL;
> -        last_block = NULL;
> -        last_offset = 0;
>          sort_ram_list();
>  
>          /* Make sure all dirty bits are set */
> @@ -436,6 +450,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
>      int flags;
>      int error;
>  
> +    if (incoming_postcopy) {
> +        return postcopy_incoming_ram_load(f, opaque, version_id);
> +    }
> +
why not call register_savevm_live(...,postcopy_incoming_ram_load,...) when starting guest with postcopy_incoming

>      if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
>          return -EINVAL;
>      }
> diff --git a/cpu-all.h b/cpu-all.h
> index 0244f7a..2e9d8a7 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -475,6 +475,9 @@ extern ram_addr_t ram_size;
>  /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
>  #define RAM_PREALLOC_MASK   (1 << 0)
>  
> +/* RAM is allocated via umem for postcopy incoming mode */
> +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> +
>  typedef struct RAMBlock {
>      uint8_t *host;
>      ram_addr_t offset;
> @@ -485,6 +488,10 @@ typedef struct RAMBlock {
>  #if defined(__linux__) && !defined(TARGET_S390X)
>      int fd;
>  #endif
> +
> +#ifdef CONFIG_POSTCOPY
> +    UMem *umem;    /* for incoming postcopy mode */
> +#endif
>  } RAMBlock;
>  
>  typedef struct RAMList {
> diff --git a/exec.c b/exec.c
> index c8c6692..90b0491 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -35,6 +35,7 @@
>  #include "qemu-timer.h"
>  #include "memory.h"
>  #include "exec-memory.h"
> +#include "migration.h"
>  #if defined(CONFIG_USER_ONLY)
>  #include <qemu.h>
>  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> @@ -2949,6 +2950,13 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
>          new_block->host = host;
>          new_block->flags |= RAM_PREALLOC_MASK;
>      } else {
> +#ifdef CONFIG_POSTCOPY
> +        if (incoming_postcopy) {
> +            postcopy_incoming_ram_alloc(name, size,
> +                                        &new_block->host, &new_block->umem);
> +            new_block->flags |= RAM_POSTCOPY_UMEM_MASK;
> +        } else
> +#endif
>          if (mem_path) {
>  #if defined (__linux__) && !defined(TARGET_S390X)
>              new_block->host = file_ram_alloc(new_block, size, mem_path);
> @@ -3027,7 +3035,13 @@ void qemu_ram_free(ram_addr_t addr)
>              QLIST_REMOVE(block, next);
>              if (block->flags & RAM_PREALLOC_MASK) {
>                  ;
> -            } else if (mem_path) {
> +            }
> +#ifdef CONFIG_POSTCOPY
> +            else if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> +                postcopy_incoming_ram_free(block->umem);
> +            }
> +#endif
> +            else if (mem_path) {
>  #if defined (__linux__) && !defined(TARGET_S390X)
>                  if (block->fd) {
>                      munmap(block->host, block->length);
> @@ -3073,6 +3087,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>              } else {
>                  flags = MAP_FIXED;
>                  munmap(vaddr, length);
> +                if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> +                    postcopy_incoming_qemu_pages_unmapped(addr, length);
> +                    block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> +                }
>                  if (mem_path) {
>  #if defined(__linux__) && !defined(TARGET_S390X)
>                      if (block->fd) {
> diff --git a/migration-exec.c b/migration-exec.c
> index e14552e..2bd0c3b 100644
> --- a/migration-exec.c
> +++ b/migration-exec.c
> @@ -62,6 +62,10 @@ int exec_start_outgoing_migration(MigrationState *s, const char *command)
>  {
>      FILE *f;
>  
> +    if (s->params.postcopy) {
> +        return -ENOSYS;
> +    }
> +
>      f = popen(command, "w");
>      if (f == NULL) {
>          DPRINTF("Unable to popen exec target\n");
> @@ -104,6 +108,10 @@ int exec_start_incoming_migration(const char *command)
>  {
>      QEMUFile *f;
>  
> +    if (incoming_postcopy) {
> +        return -ENOSYS;
> +    }
> +
>      DPRINTF("Attempting to start an incoming migration\n");
>      f = qemu_popen_cmd(command, "r");
>      if(f == NULL) {
> diff --git a/migration-fd.c b/migration-fd.c
> index 6211124..5a62ab9 100644
> --- a/migration-fd.c
> +++ b/migration-fd.c
> @@ -88,6 +88,23 @@ int fd_start_outgoing_migration(MigrationState *s, const char *fdname)
>      s->write = fd_write;
>      s->close = fd_close;
>  
> +    if (s->params.postcopy) {
> +        int flags = fcntl(s->fd, F_GETFL);
> +        if ((flags & O_ACCMODE) != O_RDWR) {
> +            goto err_after_open;
> +        }
> +
> +        s->fd_read = dup(s->fd);
> +        if (s->fd_read == -1) {
> +            goto err_after_open;
> +        }
> +        s->file_read = qemu_fdopen(s->fd_read, "r");
> +        if (s->file_read == NULL) {
> +            close(s->fd_read);
> +            goto err_after_open;
> +        }
> +    }
> +
>      migrate_fd_connect(s);
>      return 0;
>  
> @@ -103,7 +120,14 @@ static void fd_accept_incoming_migration(void *opaque)
>  
>      process_incoming_migration(f);
>      qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_fork_umemd(qemu_stdio_fd(f), f);
> +    }
>      qemu_fclose(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_qemu_ready();
> +    }
> +    return;
>  }
>  
>  int fd_start_incoming_migration(const char *infd)
> @@ -114,6 +138,12 @@ int fd_start_incoming_migration(const char *infd)
>      DPRINTF("Attempting to start an incoming migration via fd\n");
>  
>      fd = strtol(infd, NULL, 0);
> +    if (incoming_postcopy) {
> +        int flags = fcntl(fd, F_GETFL);
> +        if ((flags & O_ACCMODE) != O_RDWR) {
> +            return -EINVAL;
> +        }
> +    }
>      f = qemu_fdopen(fd, "rb");
>      if(f == NULL) {
>          DPRINTF("Unable to apply qemu wrapper to file descriptor\n");
> diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
> new file mode 100644
> index 0000000..0b78de7
> --- /dev/null
> +++ b/migration-postcopy-stub.c
> @@ -0,0 +1,77 @@
> +/*
> + * migration-postcopy-stub.c: postcopy livemigration
> + *                            stub functions for non-supported hosts
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "sysemu.h"
> +#include "migration.h"
> +
> +int postcopy_outgoing_create_read_socket(MigrationState *s)
> +{
> +    return -ENOSYS;
> +}
> +
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +                                    QEMUFile *f, int stage, void *opaque)
> +{
> +    return -ENOSYS;
> +}
> +
> +void *postcopy_outgoing_begin(MigrationState *ms)
> +{
> +    return NULL;
> +}
> +
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +                                          void *postcopy)
> +{
> +    return -ENOSYS;
> +}
> +
> +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> +{
> +    return -ENOSYS;
> +}
> +
> +void postcopy_incoming_prepare(void)
> +{
> +}
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    return -ENOSYS;
> +}
> +
> +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> +{
> +}
> +
> +void postcopy_incoming_qemu_ready(void)
> +{
> +}
> +
> +void postcopy_incoming_qemu_cleanup(void)
> +{
> +}
> +
> +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> +{
> +}
> diff --git a/migration-postcopy.c b/migration-postcopy.c
> new file mode 100644
> index 0000000..ed0d574
> --- /dev/null
> +++ b/migration-postcopy.c
> @@ -0,0 +1,1891 @@
> +/*
> + * migration-postcopy.c: postcopy livemigration
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "bitmap.h"
> +#include "sysemu.h"
> +#include "hw/hw.h"
> +#include "arch_init.h"
> +#include "migration.h"
> +#include "umem.h"
> +
> +#include "memory.h"
> +#define WANT_EXEC_OBSOLETE
> +#include "exec-obsolete.h"
> +
> +//#define DEBUG_POSTCOPY
> +#ifdef DEBUG_POSTCOPY
> +#include <sys/syscall.h>
> +#define DPRINTF(fmt, ...)                                               \
> +    do {                                                                \
> +        printf("%d:%ld %s:%d: " fmt, getpid(), syscall(SYS_gettid),     \
> +               __func__, __LINE__, ## __VA_ARGS__);                     \
> +    } while (0)
> +#else
> +#define DPRINTF(fmt, ...)       do { } while (0)
> +#endif
> +
> +#define ALIGN_UP(size, align)   (((size) + (align) - 1) & ~((align) - 1))
> +
> +static void fd_close(int *fd)
> +{
> +    if (*fd >= 0) {
> +        close(*fd);
> +        *fd = -1;
> +    }
> +}
> +
> +/***************************************************************************
> + * QEMUFile for non blocking pipe
> + */
> +
> +/* read only */
> +struct QEMUFilePipe {
> +    int fd;
> +    QEMUFile *file;
> +};

Why not use QEMUFileSocket ?

> +typedef struct QEMUFilePipe QEMUFilePipe;
> +
> +static int pipe_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
> +{
> +    QEMUFilePipe *s = opaque;
> +    ssize_t len = 0;
> +
> +    while (size > 0) {
> +        ssize_t ret = read(s->fd, buf, size);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            }
> +            if (len == 0) {
> +                len = -errno;
> +            }
> +            break;
> +        }
> +
> +        if (ret == 0) {
> +            /* the write end of the pipe is closed */
> +            break;
> +        }
> +        len += ret;
> +        buf += ret;
> +        size -= ret;
> +    }
> +
> +    return len;
> +}
> +
> +static int pipe_close(void *opaque)
> +{
> +    QEMUFilePipe *s = opaque;
> +    g_free(s);
> +    return 0;
> +}
> +
> +static QEMUFile *qemu_fopen_pipe(int fd)
> +{
> +    QEMUFilePipe *s = g_malloc0(sizeof(*s));
> +
> +    s->fd = fd;
> +    fcntl_setfl(fd, O_NONBLOCK);
> +    s->file = qemu_fopen_ops(s, NULL, pipe_get_buffer, pipe_close,
> +                             NULL, NULL, NULL);
> +    return s->file;
> +}
> +
> +/* write only */
> +struct QEMUFileNonblock {
> +    int fd;
> +    QEMUFile *file;
> +
> +    /* for pipe-write nonblocking mode */
> +#define BUF_SIZE_INC    (32 * 1024)     /* = IO_BUF_SIZE */
> +    uint8_t *buffer;
> +    size_t buffer_size;
> +    size_t buffer_capacity;
> +    bool freeze_output;
> +};
> +typedef struct QEMUFileNonblock QEMUFileNonblock;
> +

Couldn't you use QEMUFileBuffered ?

> +static void nonblock_flush_buffer(QEMUFileNonblock *s)
> +{
> +    size_t offset = 0;
> +    ssize_t ret;
> +
> +    while (offset < s->buffer_size) {
> +        ret = write(s->fd, s->buffer + offset, s->buffer_size - offset);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == EAGAIN) {
> +                s->freeze_output = true;
> +            } else {
> +                qemu_file_set_error(s->file, errno);
> +            }
> +            break;
> +        }
> +
> +        if (ret == 0) {
> +            DPRINTF("ret == 0\n");
> +            break;
> +        }
> +
> +        offset += ret;
> +    }
> +
> +    if (offset > 0) {
> +        assert(s->buffer_size >= offset);
> +        memmove(s->buffer, s->buffer + offset, s->buffer_size - offset);
> +        s->buffer_size -= offset;
> +    }
> +    if (s->buffer_size > 0) {
> +        s->freeze_output = true;
> +    }
> +}
> +
> +static int nonblock_put_buffer(void *opaque,
> +                               const uint8_t *buf, int64_t pos, int size)
> +{
> +    QEMUFileNonblock *s = opaque;
> +    int error;
> +    ssize_t len = 0;
> +
> +    error = qemu_file_get_error(s->file);
> +    if (error) {
> +        return error;
> +    }
> +
> +    nonblock_flush_buffer(s);
> +    error = qemu_file_get_error(s->file);
> +    if (error) {
> +        return error;
> +    }
> +
> +    while (!s->freeze_output && size > 0) {
> +        ssize_t ret;
> +        assert(s->buffer_size == 0);
> +
> +        ret = write(s->fd, buf, size);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == EAGAIN) {
> +                s->freeze_output = true;
> +            } else {
> +                qemu_file_set_error(s->file, errno);
> +            }
> +            break;
> +        }
> +
> +        len += ret;
> +        buf += ret;
> +        size -= ret;
> +    }
> +
> +    if (size > 0) {
> +        int inc = size - (s->buffer_capacity - s->buffer_size);
> +        if (inc > 0) {
> +            s->buffer_capacity +=
> +                DIV_ROUND_UP(inc, BUF_SIZE_INC) * BUF_SIZE_INC;
> +            s->buffer = g_realloc(s->buffer, s->buffer_capacity);
> +        }
> +        memcpy(s->buffer + s->buffer_size, buf, size);
> +        s->buffer_size += size;
> +
> +        len += size;
> +    }
> +
> +    return len;
> +}
> +
> +static int nonblock_pending_size(QEMUFileNonblock *s)
> +{
> +    return qemu_pending_size(s->file) + s->buffer_size;
> +}
> +
> +static void nonblock_fflush(QEMUFileNonblock *s)
> +{
> +    s->freeze_output = false;
> +    nonblock_flush_buffer(s);
> +    if (!s->freeze_output) {
> +        qemu_fflush(s->file);
> +    }
> +}
> +
> +static void nonblock_wait_for_flush(QEMUFileNonblock *s)
> +{
> +    while (nonblock_pending_size(s) > 0) {
> +        fd_set fds;
> +        FD_ZERO(&fds);
> +        FD_SET(s->fd, &fds);
> +        select(s->fd + 1, NULL, &fds, NULL, NULL);
> +
> +        nonblock_fflush(s);
> +    }
> +}
> +
> +static int nonblock_close(void *opaque)
> +{
> +    QEMUFileNonblock *s = opaque;
> +    nonblock_wait_for_flush(s);
> +    g_free(s->buffer);
> +    g_free(s);
> +    return 0;
> +}
> +
> +static QEMUFileNonblock *qemu_fopen_nonblock(int fd)
> +{
> +    QEMUFileNonblock *s = g_malloc0(sizeof(*s));
> +
> +    s->fd = fd;
> +    fcntl_setfl(fd, O_NONBLOCK);
> +    s->file = qemu_fopen_ops(s, nonblock_put_buffer, NULL, nonblock_close,
> +                             NULL, NULL, NULL);
> +    return s;
> +}
> +
> +/***************************************************************************
> + * umem daemon on destination <-> qemu on source protocol
> + */
> +
> +#define QEMU_UMEM_REQ_INIT              0x00
> +#define QEMU_UMEM_REQ_ON_DEMAND         0x01
> +#define QEMU_UMEM_REQ_ON_DEMAND_CONT    0x02
> +#define QEMU_UMEM_REQ_BACKGROUND        0x03
> +#define QEMU_UMEM_REQ_BACKGROUND_CONT   0x04
> +#define QEMU_UMEM_REQ_REMOVE            0x05
> +#define QEMU_UMEM_REQ_EOC               0x06
> +
> +struct qemu_umem_req {
> +    int8_t cmd;
> +    uint8_t len;
> +    char *idstr;        /* ON_DEMAND, BACKGROUND, REMOVE */
> +    uint32_t nr;        /* ON_DEMAND, ON_DEMAND_CONT,
> +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> +
> +    /* in target page size as qemu migration protocol */
> +    uint64_t *pgoffs;   /* ON_DEMAND, ON_DEMAND_CONT,
> +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> +};
> +
> +static void postcopy_incoming_send_req_idstr(QEMUFile *f, const char* idstr)
> +{
> +    qemu_put_byte(f, strlen(idstr));
> +    qemu_put_buffer(f, (uint8_t *)idstr, strlen(idstr));
> +}
> +
> +static void postcopy_incoming_send_req_pgoffs(QEMUFile *f, uint32_t nr,
> +                                              const uint64_t *pgoffs)
> +{
> +    uint32_t i;
> +
> +    qemu_put_be32(f, nr);
> +    for (i = 0; i < nr; i++) {
> +        qemu_put_be64(f, pgoffs[i]);
> +    }
> +}
> +
> +static void postcopy_incoming_send_req_one(QEMUFile *f,
> +                                           const struct qemu_umem_req *req)
> +{
> +    DPRINTF("cmd %d\n", req->cmd);
> +    qemu_put_byte(f, req->cmd);
> +    switch (req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +    case QEMU_UMEM_REQ_EOC:
> +        /* nothing */
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +    case QEMU_UMEM_REQ_REMOVE:
> +        postcopy_incoming_send_req_idstr(f, req->idstr);
> +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +}
> +
> +/* QEMUFile can buffer up to IO_BUF_SIZE = 32 * 1024.
> + * So one message size must be <= IO_BUF_SIZE
> + * cmd: 1
> + * id len: 1
> + * id: 256
> + * nr: 2
> + */
> +#define MAX_PAGE_NR     ((32 * 1024 - 1 - 1 - 256 - 2) / sizeof(uint64_t))
> +static void postcopy_incoming_send_req(QEMUFile *f,
> +                                       const struct qemu_umem_req *req)
> +{
> +    uint32_t nr = req->nr;
> +    struct qemu_umem_req tmp = *req;
> +
> +    switch (req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +    case QEMU_UMEM_REQ_EOC:
> +        postcopy_incoming_send_req_one(f, &tmp);
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +        tmp.nr = MIN(nr, MAX_PAGE_NR);
> +        postcopy_incoming_send_req_one(f, &tmp);
> +
> +        nr -= tmp.nr;
> +        tmp.pgoffs += tmp.nr;
> +        if (tmp.cmd == QEMU_UMEM_REQ_ON_DEMAND) {
> +            tmp.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> +        }else {
> +            tmp.cmd = QEMU_UMEM_REQ_BACKGROUND_CONT;
> +        }
> +        /* fall through */
> +    case QEMU_UMEM_REQ_REMOVE:
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        while (nr > 0) {
> +            tmp.nr = MIN(nr, MAX_PAGE_NR);
> +            postcopy_incoming_send_req_one(f, &tmp);
> +
> +            nr -= tmp.nr;
> +            tmp.pgoffs += tmp.nr;
> +        }
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +}
> +
> +static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
> +                                            struct qemu_umem_req *req,
> +                                            size_t *offset)
> +{
> +    int ret;
> +
> +    req->len = qemu_peek_byte(f, *offset);
> +    *offset += 1;
> +    if (req->len == 0) {
> +        return -EAGAIN;
> +    }
> +    req->idstr = g_malloc((int)req->len + 1);
> +    ret = qemu_peek_buffer(f, (uint8_t*)req->idstr, req->len, *offset);
> +    *offset += ret;
> +    if (ret != req->len) {
> +        g_free(req->idstr);
> +        req->idstr = NULL;
> +        return -EAGAIN;
> +    }
> +    req->idstr[req->len] = 0;
> +    return 0;
> +}
> +
> +static int postcopy_outgoing_recv_req_pgoffs(QEMUFile *f,
> +                                             struct qemu_umem_req *req,
> +                                             size_t *offset)
> +{
> +    int ret;
> +    uint32_t be32;
> +    uint32_t i;
> +
> +    ret = qemu_peek_buffer(f, (uint8_t*)&be32, sizeof(be32), *offset);
> +    *offset += sizeof(be32);
> +    if (ret != sizeof(be32)) {
> +        return -EAGAIN;
> +    }
> +
> +    req->nr = be32_to_cpu(be32);
> +    req->pgoffs = g_new(uint64_t, req->nr);
> +    for (i = 0; i < req->nr; i++) {
> +        uint64_t be64;
> +        ret = qemu_peek_buffer(f, (uint8_t*)&be64, sizeof(be64), *offset);
> +        *offset += sizeof(be64);
> +        if (ret != sizeof(be64)) {
> +            g_free(req->pgoffs);
> +            req->pgoffs = NULL;
> +            return -EAGAIN;
> +        }
> +        req->pgoffs[i] = be64_to_cpu(be64);
> +    }
> +    return 0;
> +}
> +
> +static int postcopy_outgoing_recv_req(QEMUFile *f, struct qemu_umem_req *req)
> +{
> +    int size;
> +    int ret;
> +    size_t offset = 0;
> +
> +    size = qemu_peek_buffer(f, (uint8_t*)&req->cmd, 1, offset);
> +    if (size <= 0) {
> +        return -EAGAIN;
> +    }
> +    offset += 1;
> +
> +    switch (req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +    case QEMU_UMEM_REQ_EOC:
> +        /* nothing */
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +    case QEMU_UMEM_REQ_REMOVE:
> +        ret = postcopy_outgoing_recv_req_idstr(f, req, &offset);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        break;
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +    qemu_file_skip(f, offset);
> +    DPRINTF("cmd %d\n", req->cmd);
> +    return 0;
> +}
> +
> +static void postcopy_outgoing_free_req(struct qemu_umem_req *req)
> +{
> +    g_free(req->idstr);
> +    g_free(req->pgoffs);
> +}
> +
> +/***************************************************************************
> + * outgoing part
> + */
> +
> +#define QEMU_SAVE_LIVE_STAGE_START      0x01    /* = QEMU_VM_SECTION_START */
> +#define QEMU_SAVE_LIVE_STAGE_PART       0x02    /* = QEMU_VM_SECTION_PART */
> +#define QEMU_SAVE_LIVE_STAGE_END        0x03    /* = QEMU_VM_SECTION_END */
> +
> +enum POState {
> +    PO_STATE_ERROR_RECEIVE,
> +    PO_STATE_ACTIVE,
> +    PO_STATE_EOC_RECEIVED,
> +    PO_STATE_ALL_PAGES_SENT,
> +    PO_STATE_COMPLETED,
> +};
> +typedef enum POState POState;
> +
> +struct PostcopyOutgoingState {
> +    POState state;
> +    QEMUFile *mig_read;
> +    int fd_read;
> +    RAMBlock *last_block_read;
> +
> +    QEMUFile *mig_buffered_write;
> +    MigrationState *ms;
> +
> +    /* For nobg mode. Check if all pages are sent */
> +    RAMBlock *block;
> +    ram_addr_t addr;
> +};
> +typedef struct PostcopyOutgoingState PostcopyOutgoingState;
> +
> +int postcopy_outgoing_create_read_socket(MigrationState *s)
> +{
> +    if (!s->params.postcopy) {
> +        return 0;
> +    }
> +
> +    s->fd_read = dup(s->fd);
> +    if (s->fd_read == -1) {
> +        int ret = -errno;
> +        perror("dup");
> +        return ret;
> +    }
> +    s->file_read = qemu_fopen_socket(s->fd_read);
> +    if (s->file_read == NULL) {
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +                                    QEMUFile *f, int stage, void *opaque)
> +{
> +    int ret = 0;
> +    DPRINTF("stage %d\n", stage);
> +    if (stage == QEMU_SAVE_LIVE_STAGE_START) {
> +        sort_ram_list();
> +        ram_save_live_mem_size(f);
> +    }
> +    if (stage == QEMU_SAVE_LIVE_STAGE_PART) {
> +        ret = 1;
> +    }
> +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +    return ret;
> +}
> +
> +static RAMBlock *postcopy_outgoing_find_block(const char *idstr)
> +{
> +    RAMBlock *block;
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        if (!strncmp(idstr, block->idstr, strlen(idstr))) {
> +            return block;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * return value
> + *   0: continue postcopy mode
> + * > 0: completed postcopy mode.
> + * < 0: error
> + */
> +static int postcopy_outgoing_handle_req(PostcopyOutgoingState *s,
> +                                        const struct qemu_umem_req *req,
> +                                        bool *written)
> +{
> +    int i;
> +    RAMBlock *block;
> +
> +    DPRINTF("cmd %d state %d\n", req->cmd, s->state);
> +    switch(req->cmd) {
> +    case QEMU_UMEM_REQ_INIT:
> +        /* nothing */
> +        break;
> +    case QEMU_UMEM_REQ_EOC:
> +        /* tell to finish migration. */
> +        if (s->state == PO_STATE_ALL_PAGES_SENT) {
> +            s->state = PO_STATE_COMPLETED;
> +            DPRINTF("-> PO_STATE_COMPLETED\n");
> +        } else {
> +            s->state = PO_STATE_EOC_RECEIVED;
> +            DPRINTF("-> PO_STATE_EOC_RECEIVED\n");
> +        }
> +        return 1;
> +    case QEMU_UMEM_REQ_ON_DEMAND:
> +    case QEMU_UMEM_REQ_BACKGROUND:
> +        DPRINTF("idstr: %s\n", req->idstr);
> +        block = postcopy_outgoing_find_block(req->idstr);
> +        if (block == NULL) {
> +            return -EINVAL;
> +        }
> +        s->last_block_read = block;
> +        /* fall through */
> +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> +        DPRINTF("nr %d\n", req->nr);
> +        for (i = 0; i < req->nr; i++) {
> +            DPRINTF("offs[%d] 0x%"PRIx64"\n", i, req->pgoffs[i]);
> +            int ret = ram_save_page(s->mig_buffered_write, s->last_block_read,
> +                                    req->pgoffs[i] << TARGET_PAGE_BITS);
> +            if (ret > 0) {
> +                *written = true;
> +            }
> +        }
> +        break;
> +    case QEMU_UMEM_REQ_REMOVE:
> +        block = postcopy_outgoing_find_block(req->idstr);
> +        if (block == NULL) {
> +            return -EINVAL;
> +        }
> +        for (i = 0; i < req->nr; i++) {
> +            ram_addr_t addr = block->offset +
> +                (req->pgoffs[i] << TARGET_PAGE_BITS);
> +            cpu_physical_memory_reset_dirty(addr,
> +                                            addr + TARGET_PAGE_SIZE,
> +                                            MIGRATION_DIRTY_FLAG);
> +        }
> +        break;
> +    default:
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +static void postcopy_outgoing_close_mig_read(PostcopyOutgoingState *s)
> +{
> +    if (s->mig_read != NULL) {
> +        qemu_set_fd_handler(s->fd_read, NULL, NULL, NULL);
> +        qemu_fclose(s->mig_read);
> +        s->mig_read = NULL;
> +        fd_close(&s->fd_read);
> +
> +        s->ms->file_read = NULL;
> +        s->ms->fd_read = -1;
> +    }
> +}
> +
> +static void postcopy_outgoing_completed(PostcopyOutgoingState *s)
> +{
> +    postcopy_outgoing_close_mig_read(s);
> +    s->ms->postcopy = NULL;
> +    g_free(s);
> +}
> +
> +static void postcopy_outgoing_recv_handler(void *opaque)
> +{
> +    PostcopyOutgoingState *s = opaque;
> +    bool written = false;
> +    int ret = 0;
> +
> +    assert(s->state == PO_STATE_ACTIVE ||
> +           s->state == PO_STATE_ALL_PAGES_SENT);
> +
> +    do {
> +        struct qemu_umem_req req = {.idstr = NULL,
> +                                    .pgoffs = NULL};
> +
> +        ret = postcopy_outgoing_recv_req(s->mig_read, &req);
> +        if (ret < 0) {
> +            if (ret == -EAGAIN) {
> +                ret = 0;
> +            }
> +            break;
> +        }
> +        if (s->state == PO_STATE_ACTIVE) {
> +            ret = postcopy_outgoing_handle_req(s, &req, &written);
> +        }
> +        postcopy_outgoing_free_req(&req);
> +    } while (ret == 0);
> +
> +    /*
> +     * flush buffered_file.
> +     * Although mig_write is rate-limited buffered file, those written pages
> +     * are requested on demand by the destination. So forcibly push
> +     * those pages ignoring rate limiting
> +     */
> +    if (written) {
> +        qemu_fflush(s->mig_buffered_write);
> +        /* qemu_buffered_file_drain(s->mig_buffered_write); */
> +    }
> +
> +    if (ret < 0) {
> +        switch (s->state) {
> +        case PO_STATE_ACTIVE:
> +            s->state = PO_STATE_ERROR_RECEIVE;
> +            DPRINTF("-> PO_STATE_ERROR_RECEIVE\n");
> +            break;
> +        case PO_STATE_ALL_PAGES_SENT:
> +            s->state = PO_STATE_COMPLETED;
> +            DPRINTF("-> PO_STATE_ALL_PAGES_SENT\n");
> +            break;
> +        default:
> +            abort();
> +        }
> +    }
> +    if (s->state == PO_STATE_ERROR_RECEIVE || s->state == PO_STATE_COMPLETED) {
> +        postcopy_outgoing_close_mig_read(s);
> +    }
> +    if (s->state == PO_STATE_COMPLETED) {
> +        DPRINTF("PO_STATE_COMPLETED\n");
> +        MigrationState *ms = s->ms;
> +        postcopy_outgoing_completed(s);
> +        migrate_fd_completed(ms);
> +    }
> +}
> +
> +void *postcopy_outgoing_begin(MigrationState *ms)
> +{
> +    PostcopyOutgoingState *s = g_new(PostcopyOutgoingState, 1);
> +    DPRINTF("outgoing begin\n");
> +    qemu_fflush(ms->file);
> +
> +    s->ms = ms;
> +    s->state = PO_STATE_ACTIVE;
> +    s->fd_read = ms->fd_read;
> +    s->mig_read = ms->file_read;
> +    s->mig_buffered_write = ms->file;
> +    s->block = NULL;
> +    s->addr = 0;
> +
> +    /* Make sure all dirty bits are set */
> +    ram_save_memory_set_dirty();
> +
> +    qemu_set_fd_handler(s->fd_read,
> +                        &postcopy_outgoing_recv_handler, NULL, s);
> +    return s;
> +}
> +
> +static void postcopy_outgoing_ram_all_sent(QEMUFile *f,
> +                                           PostcopyOutgoingState *s)
> +{
> +    assert(s->state == PO_STATE_ACTIVE);
> +
> +    s->state = PO_STATE_ALL_PAGES_SENT;
> +    /* tell incoming side that all pages are sent */
> +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +    qemu_fflush(f);
> +    qemu_buffered_file_drain(f);
> +    DPRINTF("sent RAM_SAVE_FLAG_EOS\n");
> +    migrate_fd_cleanup(s->ms);
> +
> +    /* Later migrate_fd_complete() will be called which calls
> +     * migrate_fd_cleanup() again. So dummy file is created
> +     * for qemu monitor to keep working.
> +     */
> +    s->ms->file = qemu_fopen_ops(NULL, NULL, NULL, NULL, NULL,
> +                                 NULL, NULL);
> +}
> +
> +static int postcopy_outgoing_check_all_ram_sent(PostcopyOutgoingState *s,
> +                                                RAMBlock *block,
> +                                                ram_addr_t addr)
> +{
> +    if (block == NULL) {
> +        block = QLIST_FIRST(&ram_list.blocks);
> +        addr = block->offset;
> +    }
> +
> +    for (; block != NULL;
> +         s->block = QLIST_NEXT(s->block, next), addr = block->offset) {
> +        for (; addr < block->offset + block->length;
> +             addr += TARGET_PAGE_SIZE) {
> +            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
> +                s->block = block;
> +                s->addr = addr;
> +                return 0;
> +            }
> +        }
> +    }
> +
> +    return 1;
> +}
> +
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +                                          void *postcopy)
> +{
> +    PostcopyOutgoingState *s = postcopy;
> +
> +    assert(s->state == PO_STATE_ACTIVE ||
> +           s->state == PO_STATE_EOC_RECEIVED ||
> +           s->state == PO_STATE_ERROR_RECEIVE);
> +
> +    switch (s->state) {
> +    case PO_STATE_ACTIVE:
> +        /* nothing. processed below */
> +        break;
> +    case PO_STATE_EOC_RECEIVED:
> +        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> +        s->state = PO_STATE_COMPLETED;
> +        postcopy_outgoing_completed(s);
> +        DPRINTF("PO_STATE_COMPLETED\n");
> +        return 1;
> +    case PO_STATE_ERROR_RECEIVE:
> +        postcopy_outgoing_completed(s);
> +        DPRINTF("PO_STATE_ERROR_RECEIVE\n");
> +        return -1;
> +    default:
> +        abort();
> +    }
> +
> +    if (s->ms->params.nobg) {
> +        /* See if all pages are sent. */
> +        if (postcopy_outgoing_check_all_ram_sent(s, s->block, s->addr) == 0) {
> +            return 0;
> +        }
> +        /* ram_list can be reordered. (it doesn't seem so during migration,
> +           though) So the whole list needs to be checked again */
> +        if (postcopy_outgoing_check_all_ram_sent(s, NULL, 0) == 0) {
> +            return 0;
> +        }
> +
> +        postcopy_outgoing_ram_all_sent(f, s);
> +        return 0;
> +    }
> +
> +    DPRINTF("outgoing background state: %d\n", s->state);
> +
> +    while (qemu_file_rate_limit(f) == 0) {
> +        if (ram_save_block(f) == 0) { /* no more blocks */
> +            assert(s->state == PO_STATE_ACTIVE);
> +            postcopy_outgoing_ram_all_sent(f, s);
> +            return 0;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/***************************************************************************
> + * incoming part
> + */
> +
> +/* flags for incoming mode to modify the behavior.
> +   This is for benchmark/debug purpose */
> +#define INCOMING_FLAGS_FAULT_REQUEST 0x01
> +
> +
> +static void postcopy_incoming_umemd(void);
> +
> +#define PIS_STATE_QUIT_RECEIVED         0x01
> +#define PIS_STATE_QUIT_QUEUED           0x02
> +#define PIS_STATE_QUIT_SENT             0x04
> +
> +#define PIS_STATE_QUIT_MASK             (PIS_STATE_QUIT_RECEIVED | \
> +                                         PIS_STATE_QUIT_QUEUED | \
> +                                         PIS_STATE_QUIT_SENT)
> +
> +struct PostcopyIncomingState {
> +    /* dest qemu state */
> +    uint32_t    state;
> +
> +    UMemDev *dev;
> +    int host_page_size;
> +    int host_page_shift;
> +
> +    /* qemu side */
> +    int to_umemd_fd;
> +    QEMUFileNonblock *to_umemd;
> +#define MAX_FAULTED_PAGES       256
> +    struct umem_pages *faulted_pages;
> +
> +    int from_umemd_fd;
> +    QEMUFile *from_umemd;
> +    int version_id;     /* save/load format version id */
> +};
> +typedef struct PostcopyIncomingState PostcopyIncomingState;
> +
> +
> +#define UMEM_STATE_EOS_RECEIVED         0x01    /* umem daemon <-> src qemu */
> +#define UMEM_STATE_EOC_SENT             0x02    /* umem daemon <-> src qemu */
> +#define UMEM_STATE_QUIT_RECEIVED        0x04    /* umem daemon <-> dst qemu */
> +#define UMEM_STATE_QUIT_QUEUED          0x08    /* umem daemon <-> dst qemu */
> +#define UMEM_STATE_QUIT_SENT            0x10    /* umem daemon <-> dst qemu */
> +
> +#define UMEM_STATE_QUIT_MASK            (UMEM_STATE_QUIT_QUEUED | \
> +                                         UMEM_STATE_QUIT_SENT | \
> +                                         UMEM_STATE_QUIT_RECEIVED)
> +#define UMEM_STATE_END_MASK             (UMEM_STATE_EOS_RECEIVED | \
> +                                         UMEM_STATE_EOC_SENT | \
> +                                         UMEM_STATE_QUIT_MASK)
> +
> +struct PostcopyIncomingUMemDaemon {
> +    /* umem daemon side */
> +    uint32_t state;
> +
> +    int host_page_size;
> +    int host_page_shift;
> +    int nr_host_pages_per_target_page;
> +    int host_to_target_page_shift;
> +    int nr_target_pages_per_host_page;
> +    int target_to_host_page_shift;
> +    int version_id;     /* save/load format version id */
> +
> +    int to_qemu_fd;
> +    QEMUFileNonblock *to_qemu;
> +    int from_qemu_fd;
> +    QEMUFile *from_qemu;
> +
> +    int mig_read_fd;
> +    QEMUFile *mig_read;         /* qemu on source -> umem daemon */
> +
> +    int mig_write_fd;
> +    QEMUFileNonblock *mig_write;        /* umem daemon -> qemu on source */
> +
> +    /* = KVM_MAX_VCPUS * (ASYNC_PF_PER_VCPUS + 1) */
> +#define MAX_REQUESTS    (512 * (64 + 1))
> +
> +    struct umem_page_request page_request;
> +    struct umem_page_cached page_cached;
> +
> +#define MAX_PRESENT_REQUESTS    MAX_FAULTED_PAGES
> +    struct umem_pages *present_request;
> +
> +    uint64_t *target_pgoffs;
> +
> +    /* bitmap indexed by target page offset */
> +    unsigned long *phys_requested;
> +
> +    /* bitmap indexed by target page offset */
> +    unsigned long *phys_received;
> +
> +    RAMBlock *last_block_read;  /* qemu on source -> umem daemon */
> +    RAMBlock *last_block_write; /* umem daemon -> qemu on source */
> +};
> +typedef struct PostcopyIncomingUMemDaemon PostcopyIncomingUMemDaemon;
> +
> +static PostcopyIncomingState state = {
> +    .state = 0,
> +    .dev = NULL,
> +    .to_umemd_fd = -1,
> +    .to_umemd = NULL,
> +    .from_umemd_fd = -1,
> +    .from_umemd = NULL,
> +};
> +
> +static PostcopyIncomingUMemDaemon umemd = {
> +    .state = 0,
> +    .to_qemu_fd = -1,
> +    .to_qemu = NULL,
> +    .from_qemu_fd = -1,
> +    .from_qemu = NULL,
> +    .mig_read_fd = -1,
> +    .mig_read = NULL,
> +    .mig_write_fd = -1,
> +    .mig_write = NULL,
> +};
> +
> +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> +{
> +    /* incoming_postcopy makes sense only when incoming migration mode */
> +    if (!incoming && incoming_postcopy) {
> +        return -EINVAL;
> +    }
> +
> +    if (!incoming_postcopy) {
> +        return 0;
> +    }
> +
> +    state.state = 0;
> +    state.dev = umem_dev_new();
> +    state.host_page_size = getpagesize();
> +    state.host_page_shift = ffs(state.host_page_size) - 1;
> +    state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
> +                                               ram_save_live() */
> +    return 0;
> +}
> +
> +void postcopy_incoming_ram_alloc(const char *name,
> +                                 size_t size, uint8_t **hostp, UMem **umemp)
> +{
> +    UMem *umem;
> +    size = ALIGN_UP(size, state.host_page_size);
> +    umem = umem_dev_create(state.dev, size, name);
> +
> +    *umemp = umem;
> +    *hostp = umem->umem;
> +}
> +
> +void postcopy_incoming_ram_free(UMem *umem)
> +{
> +    umem_unmap(umem);
> +    umem_close(umem);
> +    umem_destroy(umem);
> +}
> +
> +void postcopy_incoming_prepare(void)
> +{
> +    RAMBlock *block;
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        if (block->umem != NULL) {
> +            umem_mmap(block->umem);
> +        }
> +    }
> +}
> +
> +static int postcopy_incoming_ram_load_get64(QEMUFile *f,
> +                                             ram_addr_t *addr, int *flags)
> +{
> +    *addr = qemu_get_be64(f);
> +    *flags = *addr & ~TARGET_PAGE_MASK;
> +    *addr &= TARGET_PAGE_MASK;
> +    return qemu_file_get_error(f);
> +}
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    ram_addr_t addr;
> +    int flags;
> +    int error;
> +
> +    DPRINTF("incoming ram load\n");
> +    /*
> +     * RAM_SAVE_FLAGS_EOS or
> +     * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
> +     * see postcopy_outgoing_ram_save_live()
> +     */
> +
> +    if (version_id != RAM_SAVE_VERSION_ID) {
> +        DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
> +                version_id, RAM_SAVE_VERSION_ID);
> +        return -EINVAL;
> +    }
> +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> +    DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
> +    if (error) {
> +        DPRINTF("error %d\n", error);
> +        return error;
> +    }
> +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> +        DPRINTF("EOS\n");
> +        return 0;
> +    }
> +
> +    if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
> +        DPRINTF("-EINVAL flags 0x%x\n", flags);
> +        return -EINVAL;
> +    }
> +    error = ram_load_mem_size(f, addr);
> +    if (error) {
> +        DPRINTF("addr 0x%lx error %d\n", addr, error);
> +        return error;
> +    }
> +
> +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> +    if (error) {
> +        DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
> +        return error;
> +    }
> +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> +        DPRINTF("done\n");
> +        return 0;
> +    }
> +    DPRINTF("-EINVAL\n");
> +    return -EINVAL;
> +}
> +
> +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> +{
> +    int fds[2];
> +    RAMBlock *block;
> +
> +    DPRINTF("fork\n");
> +
> +    /* socketpair(AF_UNIX)? */
> +
> +    if (qemu_pipe(fds) == -1) {
> +        perror("qemu_pipe");
> +        abort();
> +    }
> +    state.from_umemd_fd = fds[0];
> +    umemd.to_qemu_fd = fds[1];
> +
> +    if (qemu_pipe(fds) == -1) {
> +        perror("qemu_pipe");
> +        abort();
> +    }
> +    umemd.from_qemu_fd = fds[0];
> +    state.to_umemd_fd = fds[1];
> +
> +    pid_t child = fork();
> +    if (child < 0) {
> +        perror("fork");
> +        abort();
> +    }
> +
> +    if (child == 0) {
> +        int mig_write_fd;
> +
> +        fd_close(&state.to_umemd_fd);
> +        fd_close(&state.from_umemd_fd);
> +        umemd.host_page_size = state.host_page_size;
> +        umemd.host_page_shift = state.host_page_shift;
> +
> +        umemd.nr_host_pages_per_target_page =
> +            TARGET_PAGE_SIZE / umemd.host_page_size;
> +        umemd.nr_target_pages_per_host_page =
> +            umemd.host_page_size / TARGET_PAGE_SIZE;
> +
> +        umemd.target_to_host_page_shift =
> +            ffs(umemd.nr_host_pages_per_target_page) - 1;
> +        umemd.host_to_target_page_shift =
> +            ffs(umemd.nr_target_pages_per_host_page) - 1;
> +
> +        umemd.state = 0;
> +        umemd.version_id = state.version_id;
> +        umemd.mig_read_fd = mig_read_fd;
> +        umemd.mig_read = mig_read;
> +
> +        mig_write_fd = dup(mig_read_fd);
> +        if (mig_write_fd < 0) {
> +            perror("could not dup for writable socket \n");
> +            abort();
> +        }
> +        umemd.mig_write_fd = mig_write_fd;
> +        umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
> +
> +        postcopy_incoming_umemd(); /* noreturn */
> +    }
> +
> +    DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
> +    fd_close(&umemd.to_qemu_fd);
> +    fd_close(&umemd.from_qemu_fd);
> +    state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAGES));
> +    state.faulted_pages->nr = 0;
> +
> +    /* close all UMem.shmem_fd */
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        umem_close_shmem(block->umem);
> +    }
> +    umem_qemu_wait_for_daemon(state.from_umemd_fd);
> +}
> +
> +static void postcopy_incoming_qemu_recv_quit(void)
> +{
> +    RAMBlock *block;
> +    if (state.state & PIS_STATE_QUIT_RECEIVED) {
> +        return;
> +    }
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        if (block->umem != NULL) {
> +            umem_destroy(block->umem);
> +            block->umem = NULL;
> +            block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> +        }
> +    }
> +
> +    DPRINTF("|= PIS_STATE_QUIT_RECEIVED\n");
> +    state.state |= PIS_STATE_QUIT_RECEIVED;
> +    qemu_set_fd_handler(state.from_umemd_fd, NULL, NULL, NULL);
> +    qemu_fclose(state.from_umemd);
> +    state.from_umemd = NULL;
> +    fd_close(&state.from_umemd_fd);
> +}
> +
> +static void postcopy_incoming_qemu_fflush_to_umemd_handler(void *opaque)
> +{
> +    assert(state.to_umemd != NULL);
> +
> +    nonblock_fflush(state.to_umemd);
> +    if (nonblock_pending_size(state.to_umemd) > 0) {
> +        return;
> +    }
> +
> +    qemu_set_fd_handler(state.to_umemd->fd, NULL, NULL, NULL);
> +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> +        DPRINTF("|= PIS_STATE_QUIT_SENT\n");
> +        state.state |= PIS_STATE_QUIT_SENT;
> +        qemu_fclose(state.to_umemd->file);
> +        state.to_umemd = NULL;
> +        fd_close(&state.to_umemd_fd);
> +        g_free(state.faulted_pages);
> +        state.faulted_pages = NULL;
> +    }
> +}
> +
> +static void postcopy_incoming_qemu_fflush_to_umemd(void)
> +{
> +    qemu_set_fd_handler(state.to_umemd->fd, NULL,
> +                        postcopy_incoming_qemu_fflush_to_umemd_handler, NULL);
> +    postcopy_incoming_qemu_fflush_to_umemd_handler(NULL);
> +}
> +
> +static void postcopy_incoming_qemu_queue_quit(void)
> +{
> +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> +        return;
> +    }
> +
> +    DPRINTF("|= PIS_STATE_QUIT_QUEUED\n");
> +    umem_qemu_quit(state.to_umemd->file);
> +    state.state |= PIS_STATE_QUIT_QUEUED;
> +}
> +
> +static void postcopy_incoming_qemu_send_pages_present(void)
> +{
> +    if (state.faulted_pages->nr > 0) {
> +        umem_qemu_send_pages_present(state.to_umemd->file,
> +                                     state.faulted_pages);
> +        state.faulted_pages->nr = 0;
> +    }
> +}
> +
> +static void postcopy_incoming_qemu_faulted_pages(
> +    const struct umem_pages *pages)
> +{
> +    assert(pages->nr <= MAX_FAULTED_PAGES);
> +    assert(state.faulted_pages != NULL);
> +
> +    if (state.faulted_pages->nr + pages->nr > MAX_FAULTED_PAGES) {
> +        postcopy_incoming_qemu_send_pages_present();
> +    }
> +    memcpy(&state.faulted_pages->pgoffs[state.faulted_pages->nr],
> +           &pages->pgoffs[0], sizeof(pages->pgoffs[0]) * pages->nr);
> +    state.faulted_pages->nr += pages->nr;
> +}
> +
> +static void postcopy_incoming_qemu_cleanup_umem(void);
> +
> +static int postcopy_incoming_qemu_handle_req_one(void)
> +{
> +    int offset = 0;
> +    int ret;
> +    uint8_t cmd;
> +
> +    ret = qemu_peek_buffer(state.from_umemd, &cmd, sizeof(cmd), offset);
> +    offset += sizeof(cmd);
> +    if (ret != sizeof(cmd)) {
> +        return -EAGAIN;
> +    }
> +    DPRINTF("cmd %c\n", cmd);
> +
> +    switch (cmd) {
> +    case UMEM_DAEMON_QUIT:
> +        postcopy_incoming_qemu_recv_quit();
> +        postcopy_incoming_qemu_queue_quit();
> +        postcopy_incoming_qemu_cleanup_umem();
> +        break;
> +    case UMEM_DAEMON_TRIGGER_PAGE_FAULT: {
> +        struct umem_pages *pages =
> +            umem_qemu_trigger_page_fault(state.from_umemd, &offset);
> +        if (pages == NULL) {
> +            return -EAGAIN;
> +        }
> +        if (state.to_umemd_fd >= 0 && !(state.state & PIS_STATE_QUIT_QUEUED)) {
> +            postcopy_incoming_qemu_faulted_pages(pages);
> +            g_free(pages);
> +        }
> +        break;
> +    }
> +    case UMEM_DAEMON_ERROR:
> +        /* umem daemon hit troubles, so it warned us to stop vm execution */
> +        vm_stop(RUN_STATE_IO_ERROR); /* or RUN_STATE_INTERNAL_ERROR */
> +        break;
> +    default:
> +        abort();
> +        break;
> +    }
> +
> +    if (state.from_umemd != NULL) {
> +        qemu_file_skip(state.from_umemd, offset);
> +    }
> +    return 0;
> +}
> +
> +static void postcopy_incoming_qemu_handle_req(void *opaque)
> +{
> +    do {
> +        int ret = postcopy_incoming_qemu_handle_req_one();
> +        if (ret == -EAGAIN) {
> +            break;
> +        }
> +    } while (state.from_umemd != NULL &&
> +             qemu_pending_size(state.from_umemd) > 0);
> +
> +    if (state.to_umemd != NULL) {
> +        if (state.faulted_pages->nr > 0) {
> +            postcopy_incoming_qemu_send_pages_present();
> +        }
> +        postcopy_incoming_qemu_fflush_to_umemd();
> +    }
> +}
> +
> +void postcopy_incoming_qemu_ready(void)
> +{
> +    umem_qemu_ready(state.to_umemd_fd);
> +
> +    state.from_umemd = qemu_fopen_pipe(state.from_umemd_fd);
> +    state.to_umemd = qemu_fopen_nonblock(state.to_umemd_fd);
> +    qemu_set_fd_handler(state.from_umemd_fd,
> +                        postcopy_incoming_qemu_handle_req, NULL, NULL);
> +}
> +
> +static void postcopy_incoming_qemu_cleanup_umem(void)
> +{
> +    /* when qemu will quit before completing postcopy, tell umem daemon
> +       to tear down umem device and exit. */
> +    if (state.to_umemd_fd >= 0) {
> +        postcopy_incoming_qemu_queue_quit();
> +        postcopy_incoming_qemu_fflush_to_umemd();
> +    }
> +
> +    if (state.dev) {
> +        umem_dev_destroy(state.dev);
> +        state.dev = NULL;
> +    }
> +}
> +
> +void postcopy_incoming_qemu_cleanup(void)
> +{
> +    postcopy_incoming_qemu_cleanup_umem();
> +    if (state.to_umemd != NULL) {
> +        nonblock_wait_for_flush(state.to_umemd);
> +    }
> +}
> +
> +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> +{
> +    uint64_t nr = DIV_ROUND_UP(size, state.host_page_size);
> +    size_t len = umem_pages_size(nr);
> +    ram_addr_t end = addr + size;
> +    struct umem_pages *pages;
> +    int i;
> +
> +    if (state.to_umemd_fd < 0 || state.state & PIS_STATE_QUIT_QUEUED) {
> +        return;
> +    }
> +    pages = g_malloc(len);
> +    pages->nr = nr;
> +    for (i = 0; addr < end; addr += state.host_page_size, i++) {
> +        pages->pgoffs[i] = addr >> state.host_page_shift;
> +    }
> +    umem_qemu_send_pages_unmapped(state.to_umemd->file, pages);
> +    g_free(pages);
> +    assert(state.to_umemd != NULL);
> +    postcopy_incoming_qemu_fflush_to_umemd();
> +}
> +
> +/**************************************************************************
> + * incoming umem daemon
> + */
> +
> +static void postcopy_incoming_umem_recv_quit(void)
> +{
> +    if (umemd.state & UMEM_STATE_QUIT_RECEIVED) {
> +        return;
> +    }
> +    DPRINTF("|= UMEM_STATE_QUIT_RECEIVED\n");
> +    umemd.state |= UMEM_STATE_QUIT_RECEIVED;
> +    qemu_fclose(umemd.from_qemu);
> +    umemd.from_qemu = NULL;
> +    fd_close(&umemd.from_qemu_fd);
> +}
> +
> +static void postcopy_incoming_umem_queue_quit(void)
> +{
> +    if (umemd.state & UMEM_STATE_QUIT_QUEUED) {
> +        return;
> +    }
> +    DPRINTF("|= UMEM_STATE_QUIT_QUEUED\n");
> +    umem_daemon_quit(umemd.to_qemu->file);
> +    umemd.state |= UMEM_STATE_QUIT_QUEUED;
> +}
> +
> +static void postcopy_incoming_umem_send_eoc_req(void)
> +{
> +    struct qemu_umem_req req;
> +
> +    if (umemd.state & UMEM_STATE_EOC_SENT) {
> +        return;
> +    }
> +
> +    DPRINTF("|= UMEM_STATE_EOC_SENT\n");
> +    req.cmd = QEMU_UMEM_REQ_EOC;
> +    postcopy_incoming_send_req(umemd.mig_write->file, &req);
> +    umemd.state |= UMEM_STATE_EOC_SENT;
> +    qemu_fclose(umemd.mig_write->file);
> +    umemd.mig_write = NULL;
> +    fd_close(&umemd.mig_write_fd);
> +}
> +
> +static void postcopy_incoming_umem_send_page_req(RAMBlock *block)
> +{
> +    struct qemu_umem_req req;
> +    int bit;
> +    uint64_t target_pgoff;
> +    int i;
> +
> +    umemd.page_request.nr = MAX_REQUESTS;
> +    umem_get_page_request(block->umem, &umemd.page_request);
> +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> +            block->idstr, umemd.page_request.nr,
> +            (uint64_t)umemd.page_request.pgoffs[0],
> +            (uint64_t)umemd.page_request.pgoffs[1]);
> +
> +    if (umemd.last_block_write != block) {
> +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND;
> +        req.idstr = block->idstr;
> +    } else {
> +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> +    }
> +
> +    req.nr = 0;
> +    req.pgoffs = umemd.target_pgoffs;
> +    if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> +        for (i = 0; i < umemd.page_request.nr; i++) {
> +            target_pgoff =
> +                umemd.page_request.pgoffs[i] >> umemd.host_to_target_page_shift;
> +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> +
> +            if (!test_and_set_bit(bit, umemd.phys_requested)) {
> +                req.pgoffs[req.nr] = target_pgoff;
> +                req.nr++;
> +            }
> +        }
> +    } else {
> +        for (i = 0; i < umemd.page_request.nr; i++) {
> +            int j;
> +            target_pgoff =
> +                umemd.page_request.pgoffs[i] << umemd.host_to_target_page_shift;
> +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> +
> +            for (j = 0; j < umemd.nr_target_pages_per_host_page; j++) {
> +                if (!test_and_set_bit(bit + j, umemd.phys_requested)) {
> +                    req.pgoffs[req.nr] = target_pgoff + j;
> +                    req.nr++;
> +                }
> +            }
> +        }
> +    }
> +
> +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> +            block->idstr, req.nr, req.pgoffs[0], req.pgoffs[1]);
> +    if (req.nr > 0 && umemd.mig_write != NULL) {
> +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> +        umemd.last_block_write = block;
> +    }
> +}
> +
> +static void postcopy_incoming_umem_send_pages_present(void)
> +{
> +    if (umemd.present_request->nr > 0) {
> +        umem_daemon_send_pages_present(umemd.to_qemu->file,
> +                                       umemd.present_request);
> +        umemd.present_request->nr = 0;
> +    }
> +}
> +
> +static void postcopy_incoming_umem_pages_present_one(
> +    uint32_t nr, const __u64 *pgoffs, uint64_t ramblock_pgoffset)
> +{
> +    uint32_t i;
> +    assert(nr <= MAX_PRESENT_REQUESTS);
> +
> +    if (umemd.present_request->nr + nr > MAX_PRESENT_REQUESTS) {
> +        postcopy_incoming_umem_send_pages_present();
> +    }
> +
> +    for (i = 0; i < nr; i++) {
> +        umemd.present_request->pgoffs[umemd.present_request->nr + i] =
> +            pgoffs[i] + ramblock_pgoffset;
> +    }
> +    umemd.present_request->nr += nr;
> +}
> +
> +static void postcopy_incoming_umem_pages_present(
> +    const struct umem_page_cached *page_cached, uint64_t ramblock_pgoffset)
> +{
> +    uint32_t left = page_cached->nr;
> +    uint32_t offset = 0;
> +
> +    while (left > 0) {
> +        uint32_t nr = MIN(left, MAX_PRESENT_REQUESTS);
> +        postcopy_incoming_umem_pages_present_one(
> +            nr, &page_cached->pgoffs[offset], ramblock_pgoffset);
> +
> +        left -= nr;
> +        offset += nr;
> +    }
> +}
> +
> +static int postcopy_incoming_umem_ram_load(void)
> +{
> +    ram_addr_t offset;
> +    int flags;
> +    int error;
> +    void *shmem;
> +    int i;
> +    int bit;
> +
> +    if (umemd.version_id != RAM_SAVE_VERSION_ID) {
> +        return -EINVAL;
> +    }
> +
> +    offset = qemu_get_be64(umemd.mig_read);
> +
> +    flags = offset & ~TARGET_PAGE_MASK;
> +    offset &= TARGET_PAGE_MASK;
> +
> +    assert(!(flags & RAM_SAVE_FLAG_MEM_SIZE));
> +
> +    if (flags & RAM_SAVE_FLAG_EOS) {
> +        DPRINTF("RAM_SAVE_FLAG_EOS\n");
> +        postcopy_incoming_umem_send_eoc_req();
> +
> +        qemu_fclose(umemd.mig_read);
> +        umemd.mig_read = NULL;
> +        fd_close(&umemd.mig_read_fd);
> +        umemd.state |= UMEM_STATE_EOS_RECEIVED;
> +
> +        postcopy_incoming_umem_queue_quit();
> +        DPRINTF("|= UMEM_STATE_EOS_RECEIVED\n");
> +        return 0;
> +    }
> +
> +    shmem = ram_load_host_from_stream_offset(umemd.mig_read, offset, flags,
> +                                             &umemd.last_block_read);
> +    if (!shmem) {
> +        DPRINTF("shmem == NULL\n");
> +        return -EINVAL;
> +    }
> +
> +    if (flags & RAM_SAVE_FLAG_COMPRESS) {
> +        uint8_t ch = qemu_get_byte(umemd.mig_read);
> +        memset(shmem, ch, TARGET_PAGE_SIZE);
> +    } else if (flags & RAM_SAVE_FLAG_PAGE) {
> +        qemu_get_buffer(umemd.mig_read, shmem, TARGET_PAGE_SIZE);
> +    }
> +
> +    error = qemu_file_get_error(umemd.mig_read);
> +    if (error) {
> +        DPRINTF("error %d\n", error);
> +        return error;
> +    }
> +
> +    umemd.page_cached.nr = 0;
> +    bit = (umemd.last_block_read->offset + offset) >> TARGET_PAGE_BITS;
> +    if (!test_and_set_bit(bit, umemd.phys_received)) {
> +        if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> +            __u64 pgoff = offset >> umemd.host_page_shift;
> +            for (i = 0; i < umemd.nr_host_pages_per_target_page; i++) {
> +                umemd.page_cached.pgoffs[umemd.page_cached.nr] = pgoff + i;
> +                umemd.page_cached.nr++;
> +            }
> +        } else {
> +            bool mark_cache = true;
> +            for (i = 0; i < umemd.nr_target_pages_per_host_page; i++) {
> +                if (!test_bit(bit + i, umemd.phys_received)) {
> +                    mark_cache = false;
> +                    break;
> +                }
> +            }
> +            if (mark_cache) {
> +                umemd.page_cached.pgoffs[0] = offset >> umemd.host_page_shift;
> +                umemd.page_cached.nr = 1;
> +            }
> +        }
> +    }
> +
> +    if (umemd.page_cached.nr > 0) {
> +        umem_mark_page_cached(umemd.last_block_read->umem, &umemd.page_cached);
> +
> +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED) && umemd.to_qemu_fd >=0 &&
> +            (incoming_postcopy_flags & INCOMING_FLAGS_FAULT_REQUEST)) {
> +            uint64_t ramblock_pgoffset;
> +
> +            ramblock_pgoffset =
> +                umemd.last_block_read->offset >> umemd.host_page_shift;
> +            postcopy_incoming_umem_pages_present(&umemd.page_cached,
> +                                                 ramblock_pgoffset);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static bool postcopy_incoming_umem_check_umem_done(void)
> +{
> +    bool all_done = true;
> +    RAMBlock *block;
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        UMem *umem = block->umem;
> +        if (umem != NULL && umem->nsets == umem->nbits) {
> +            umem_unmap_shmem(umem);
> +            umem_destroy(umem);
> +            block->umem = NULL;
> +        }
> +        if (block->umem != NULL) {
> +            all_done = false;
> +        }
> +    }
> +    return all_done;
> +}
> +
> +static bool postcopy_incoming_umem_page_faulted(const struct umem_pages *pages)
> +{
> +    int i;
> +
> +    for (i = 0; i < pages->nr; i++) {
> +        ram_addr_t addr = pages->pgoffs[i] << umemd.host_page_shift;
> +        RAMBlock *block = qemu_get_ram_block(addr);
> +        addr -= block->offset;
> +        umem_remove_shmem(block->umem, addr, umemd.host_page_size);
> +    }
> +    return postcopy_incoming_umem_check_umem_done();
> +}
> +
> +static bool
> +postcopy_incoming_umem_page_unmapped(const struct umem_pages *pages)
> +{
> +    RAMBlock *block;
> +    ram_addr_t addr;
> +    int i;
> +
> +    struct qemu_umem_req req = {
> +        .cmd = QEMU_UMEM_REQ_REMOVE,
> +        .nr = 0,
> +        .pgoffs = (uint64_t*)pages->pgoffs,
> +    };
> +
> +    addr = pages->pgoffs[0] << umemd.host_page_shift;
> +    block = qemu_get_ram_block(addr);
> +
> +    for (i = 0; i < pages->nr; i++)  {
> +        int pgoff;
> +
> +        addr = pages->pgoffs[i] << umemd.host_page_shift;
> +        pgoff = addr >> TARGET_PAGE_BITS;
> +        if (!test_bit(pgoff, umemd.phys_received) &&
> +            !test_bit(pgoff, umemd.phys_requested)) {
> +            req.pgoffs[req.nr] = pgoff;
> +            req.nr++;
> +        }
> +        set_bit(pgoff, umemd.phys_received);
> +        set_bit(pgoff, umemd.phys_requested);
> +
> +        umem_remove_shmem(block->umem,
> +                          addr - block->offset, umemd.host_page_size);
> +    }
> +    if (req.nr > 0 && umemd.mig_write != NULL) {
> +        req.idstr = block->idstr;
> +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> +    }
> +
> +    return postcopy_incoming_umem_check_umem_done();
> +}
> +
> +static void postcopy_incoming_umem_done(void)
> +{
> +    postcopy_incoming_umem_send_eoc_req();
> +    postcopy_incoming_umem_queue_quit();
> +}
> +
> +static int postcopy_incoming_umem_handle_qemu(void)
> +{
> +    int ret;
> +    int offset = 0;
> +    uint8_t cmd;
> +
> +    ret = qemu_peek_buffer(umemd.from_qemu, &cmd, sizeof(cmd), offset);
> +    offset += sizeof(cmd);
> +    if (ret != sizeof(cmd)) {
> +        return -EAGAIN;
> +    }
> +    DPRINTF("cmd %c\n", cmd);
> +    switch (cmd) {
> +    case UMEM_QEMU_QUIT:
> +        postcopy_incoming_umem_recv_quit();
> +        postcopy_incoming_umem_done();
> +        break;
> +    case UMEM_QEMU_PAGE_FAULTED: {
> +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> +                                                   &offset);
> +        if (pages == NULL) {
> +            return -EAGAIN;
> +        }
> +        if (postcopy_incoming_umem_page_faulted(pages)){
> +            postcopy_incoming_umem_done();
> +        }
> +        g_free(pages);
> +        break;
> +    }
> +    case UMEM_QEMU_PAGE_UNMAPPED: {
> +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> +                                                   &offset);
> +        if (pages == NULL) {
> +            return -EAGAIN;
> +        }
> +        if (postcopy_incoming_umem_page_unmapped(pages)){
> +            postcopy_incoming_umem_done();
> +        }
> +        g_free(pages);
> +        break;
> +    }
> +    default:
> +        abort();
> +        break;
> +    }
> +    if (umemd.from_qemu != NULL) {
> +        qemu_file_skip(umemd.from_qemu, offset);
> +    }
> +    return 0;
> +}
> +
> +static void set_fd(int fd, fd_set *fds, int *nfds)
> +{
> +    FD_SET(fd, fds);
> +    if (fd > *nfds) {
> +        *nfds = fd;
> +    }
> +}
> +
> +static int postcopy_incoming_umemd_main_loop(void)
> +{
> +    fd_set writefds;
> +    fd_set readfds;
> +    int nfds;
> +    RAMBlock *block;
> +    int ret;
> +
> +    int pending_size;
> +    bool get_page_request;
> +
> +    nfds = -1;
> +    FD_ZERO(&writefds);
> +    FD_ZERO(&readfds);
> +
> +    if (umemd.mig_write != NULL) {
> +        pending_size = nonblock_pending_size(umemd.mig_write);
> +        if (pending_size > 0) {
> +            set_fd(umemd.mig_write_fd, &writefds, &nfds);
> +        }
> +    } else {
> +        pending_size = 0;
> +    }
> +
> +#define PENDING_SIZE_MAX (MAX_REQUESTS * sizeof(uint64_t) * 2)
> +    /* If page request to the migration source is accumulated,
> +       suspend getting page fault request. */
> +    get_page_request = (pending_size <= PENDING_SIZE_MAX);
> +
> +    if (get_page_request) {
> +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> +            if (block->umem != NULL) {
> +                set_fd(block->umem->fd, &readfds, &nfds);
> +            }
> +        }
> +    }
> +
> +    if (umemd.mig_read_fd >= 0) {
> +        set_fd(umemd.mig_read_fd, &readfds, &nfds);
> +    }
> +
> +    if (umemd.to_qemu != NULL &&
> +        nonblock_pending_size(umemd.to_qemu) > 0) {
> +        set_fd(umemd.to_qemu_fd, &writefds, &nfds);
> +    }
> +    if (umemd.from_qemu_fd >= 0) {
> +        set_fd(umemd.from_qemu_fd, &readfds, &nfds);
> +    }
> +
> +    ret = select(nfds + 1, &readfds, &writefds, NULL, NULL);
> +    if (ret == -1) {
> +        if (errno == EINTR) {
> +            return 0;
> +        }
> +        return ret;
> +    }
> +
> +    if (umemd.mig_write_fd >= 0 && FD_ISSET(umemd.mig_write_fd, &writefds)) {
> +        nonblock_fflush(umemd.mig_write);
> +    }
> +    if (umemd.to_qemu_fd >= 0 && FD_ISSET(umemd.to_qemu_fd, &writefds)) {
> +        nonblock_fflush(umemd.to_qemu);
> +    }
> +    if (get_page_request) {
> +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> +            if (block->umem != NULL && FD_ISSET(block->umem->fd, &readfds)) {
> +                postcopy_incoming_umem_send_page_req(block);
> +            }
> +        }
> +    }
> +    if (umemd.mig_read_fd >= 0 && FD_ISSET(umemd.mig_read_fd, &readfds)) {
> +        do {
> +            ret = postcopy_incoming_umem_ram_load();
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        } while (umemd.mig_read != NULL &&
> +                 qemu_pending_size(umemd.mig_read) > 0);
> +    }
> +    if (umemd.from_qemu_fd >= 0 && FD_ISSET(umemd.from_qemu_fd, &readfds)) {
> +        do {
> +            ret = postcopy_incoming_umem_handle_qemu();
> +            if (ret == -EAGAIN) {
> +                break;
> +            }
> +        } while (umemd.from_qemu != NULL &&
> +                 qemu_pending_size(umemd.from_qemu) > 0);
> +    }
> +
> +    if (umemd.mig_write != NULL) {
> +        nonblock_fflush(umemd.mig_write);
> +    }
> +    if (umemd.to_qemu != NULL) {
> +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED)) {
> +            postcopy_incoming_umem_send_pages_present();
> +        }
> +        nonblock_fflush(umemd.to_qemu);
> +        if ((umemd.state & UMEM_STATE_QUIT_QUEUED) &&
> +            nonblock_pending_size(umemd.to_qemu) == 0) {
> +            DPRINTF("|= UMEM_STATE_QUIT_SENT\n");
> +            qemu_fclose(umemd.to_qemu->file);
> +            umemd.to_qemu = NULL;
> +            fd_close(&umemd.to_qemu_fd);
> +            umemd.state |= UMEM_STATE_QUIT_SENT;
> +        }
> +    }
> +
> +    return (umemd.state & UMEM_STATE_END_MASK) == UMEM_STATE_END_MASK;
> +}
> +
> +static void postcopy_incoming_umemd(void)
> +{
> +    ram_addr_t last_ram_offset;
> +    int nbits;
> +    RAMBlock *block;
> +    int ret;
> +
> +    qemu_daemon(1, 1);
> +    signal(SIGPIPE, SIG_IGN);
> +    DPRINTF("daemon pid: %d\n", getpid());
> +
> +    umemd.page_request.pgoffs = g_new(__u64, MAX_REQUESTS);
> +    umemd.page_cached.pgoffs =
> +        g_new(__u64, MAX_REQUESTS *
> +              (TARGET_PAGE_SIZE >= umemd.host_page_size ?
> +               1: umemd.nr_host_pages_per_target_page));
> +    umemd.target_pgoffs =
> +        g_new(uint64_t, MAX_REQUESTS *
> +              MAX(umemd.nr_host_pages_per_target_page,
> +                  umemd.nr_target_pages_per_host_page));
> +    umemd.present_request = g_malloc(umem_pages_size(MAX_PRESENT_REQUESTS));
> +    umemd.present_request->nr = 0;
> +
> +    last_ram_offset = qemu_last_ram_offset();
> +    nbits = last_ram_offset >> TARGET_PAGE_BITS;
> +    umemd.phys_requested = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> +    umemd.phys_received = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> +    umemd.last_block_read = NULL;
> +    umemd.last_block_write = NULL;
> +
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> +        UMem *umem = block->umem;
> +        umem->umem = NULL;      /* umem mapping area has VM_DONT_COPY flag,
> +                                   so we lost those mappings by fork */
> +        block->host = umem_map_shmem(umem);
> +        umem_close_shmem(umem);
> +    }
> +    umem_daemon_ready(umemd.to_qemu_fd);
> +    umemd.to_qemu = qemu_fopen_nonblock(umemd.to_qemu_fd);
> +
> +    /* wait for qemu to disown migration_fd */
> +    umem_daemon_wait_for_qemu(umemd.from_qemu_fd);
> +    umemd.from_qemu = qemu_fopen_pipe(umemd.from_qemu_fd);
> +
> +    DPRINTF("entering umemd main loop\n");
> +    for (;;) {
> +        ret = postcopy_incoming_umemd_main_loop();
> +        if (ret != 0) {
> +            break;
> +        }
> +    }
> +    DPRINTF("exiting umemd main loop\n");
> +
> +    /* This daemon forked from qemu and the parent qemu is still running.
> +     * Cleanups of linked libraries like SDL should not be triggered,
> +     * otherwise the parent qemu may use resources which was already freed.
> +     */
> +    fflush(stdout);
> +    fflush(stderr);
> +    _exit(ret < 0? EXIT_FAILURE: 0);
> +}
> diff --git a/migration-tcp.c b/migration-tcp.c
> index cf6a9b8..aa35050 100644
> --- a/migration-tcp.c
> +++ b/migration-tcp.c
> @@ -63,18 +63,25 @@ static void tcp_wait_for_connect(void *opaque)
>      } while (ret == -1 && (socket_error()) == EINTR);
>  
>      if (ret < 0) {
> -        migrate_fd_error(s);
> -        return;
> +        goto error_out;
>      }
>  
>      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>  
> -    if (val == 0)
> +    if (val == 0) {
> +        ret = postcopy_outgoing_create_read_socket(s);
> +        if (ret < 0) {
> +            goto error_out;
> +        }
>          migrate_fd_connect(s);
> -    else {
> +    } else {
>          DPRINTF("error connecting %d\n", val);
> -        migrate_fd_error(s);
> +        goto error_out;
>      }
> +    return;
> +
> +error_out:
> +    migrate_fd_error(s);
>  }
>  
>  int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
> @@ -112,11 +119,19 @@ int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
>  
>      if (ret < 0) {
>          DPRINTF("connect failed\n");
> -        migrate_fd_error(s);
> -        return ret;
> +        goto error_out;
> +    }
> +
> +    ret = postcopy_outgoing_create_read_socket(s);
> +    if (ret < 0) {
> +        goto error_out;
>      }
>      migrate_fd_connect(s);
>      return 0;
> +
> +error_out:
> +    migrate_fd_error(s);
> +    return ret;
>  }
>  
>  static void tcp_accept_incoming_migration(void *opaque)
> @@ -145,7 +160,15 @@ static void tcp_accept_incoming_migration(void *opaque)
>      }
>  
>      process_incoming_migration(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_fork_umemd(c, f);
> +    }
>      qemu_fclose(f);
> +    if (incoming_postcopy) {
> +        /* now socket is disowned.
> +           So tell umem server that it's safe to use it */
> +        postcopy_incoming_qemu_ready();
> +    }
>  out:
>      close(c);
>  out2:
> diff --git a/migration-unix.c b/migration-unix.c
> index dfcf203..3707505 100644
> --- a/migration-unix.c
> +++ b/migration-unix.c
> @@ -69,12 +69,20 @@ static void unix_wait_for_connect(void *opaque)
>  
>      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>  
> -    if (val == 0)
> +    if (val == 0) {
> +        ret = postcopy_outgoing_create_read_socket(s);
> +        if (ret < 0) {
> +            goto error_out;
> +        }
>          migrate_fd_connect(s);
> -    else {
> +    } else {
>          DPRINTF("error connecting %d\n", val);
> -        migrate_fd_error(s);
> +        goto error_out;
>      }
> +    return;
> +
> +error_out:
> +    migrate_fd_error(s);
>  }
>  
>  int unix_start_outgoing_migration(MigrationState *s, const char *path)
> @@ -109,11 +117,19 @@ int unix_start_outgoing_migration(MigrationState *s, const char *path)
>  
>      if (ret < 0) {
>          DPRINTF("connect failed\n");
> -        migrate_fd_error(s);
> -        return ret;
> +        goto error_out;
> +    }
> +
> +    ret = postcopy_outgoing_create_read_socket(s);
> +    if (ret < 0) {
> +        goto error_out;
>      }
>      migrate_fd_connect(s);
>      return 0;
> +
> +error_out:
> +    migrate_fd_error(s);
> +    return ret;
>  }
>  
>  static void unix_accept_incoming_migration(void *opaque)
> @@ -142,7 +158,13 @@ static void unix_accept_incoming_migration(void *opaque)
>      }
>  
>      process_incoming_migration(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_fork_umemd(c, f);
> +    }
>      qemu_fclose(f);
> +    if (incoming_postcopy) {
> +        postcopy_incoming_qemu_ready();
> +    }
>  out:
>      close(c);
>  out2:
> diff --git a/migration.c b/migration.c
> index 0149ab3..51efe44 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -39,6 +39,11 @@ enum {
>      MIG_STATE_COMPLETED,
>  };
>  
> +enum {
> +    MIG_SUBSTATE_PRECOPY,
> +    MIG_SUBSTATE_POSTCOPY,
> +};
> +
>  #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
>  
>  static NotifierList migration_state_notifiers =
> @@ -255,6 +260,18 @@ static void migrate_fd_put_ready(void *opaque)
>          return;
>      }
>  
> +    if (s->substate == MIG_SUBSTATE_POSTCOPY) {
> +        /* PRINTF("postcopy background\n"); */
> +        ret = postcopy_outgoing_ram_save_background(s->mon, s->file,
> +                                                    s->postcopy);
> +        if (ret > 0) {
> +            migrate_fd_completed(s);
> +        } else if (ret < 0) {
> +            migrate_fd_error(s);
> +        }
> +        return;
> +    }
> +
>      DPRINTF("iterate\n");
>      ret = qemu_savevm_state_iterate(s->mon, s->file);
>      if (ret < 0) {
> @@ -265,6 +282,19 @@ static void migrate_fd_put_ready(void *opaque)
>          DPRINTF("done iterating\n");
>          vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>  
> +        if (s->params.postcopy) {
> +            if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
> +                migrate_fd_error(s);
> +                if (old_vm_running) {
> +                    vm_start();
> +                }
> +                return;
> +            }
> +            s->substate = MIG_SUBSTATE_POSTCOPY;
> +            s->postcopy = postcopy_outgoing_begin(s);
> +            return;
> +        }
> +
>          if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
>              migrate_fd_error(s);
>          } else {
> @@ -357,6 +387,7 @@ void migrate_fd_connect(MigrationState *s)
>      int ret;
>  
>      s->state = MIG_STATE_ACTIVE;
> +    s->substate = MIG_SUBSTATE_PRECOPY;
>      s->file = qemu_fopen_ops_buffered(s,
>                                        s->bandwidth_limit,
>                                        migrate_fd_put_buffer,
> diff --git a/migration.h b/migration.h
> index 90ae362..2809e99 100644
> --- a/migration.h
> +++ b/migration.h
> @@ -40,6 +40,12 @@ struct MigrationState
>      int (*write)(MigrationState *s, const void *buff, size_t size);
>      void *opaque;
>      MigrationParams params;
> +
> +    /* for postcopy */
> +    int substate;              /* precopy or postcopy */
> +    int fd_read;
> +    QEMUFile *file_read;        /* connection from the detination */
> +    void *postcopy;
>  };
>  
>  void process_incoming_migration(QEMUFile *f);
> @@ -86,6 +92,7 @@ uint64_t ram_bytes_remaining(void);
>  uint64_t ram_bytes_transferred(void);
>  uint64_t ram_bytes_total(void);
>  
> +void ram_save_set_params(const MigrationParams *params, void *opaque);
>  void sort_ram_list(void);
>  int ram_save_block(QEMUFile *f);
>  void ram_save_memory_set_dirty(void);
> @@ -107,7 +114,30 @@ void migrate_add_blocker(Error *reason);
>   */
>  void migrate_del_blocker(Error *reason);
>  
> +/* For outgoing postcopy */
> +int postcopy_outgoing_create_read_socket(MigrationState *s);
> +int postcopy_outgoing_ram_save_live(Monitor *mon,
> +                                    QEMUFile *f, int stage, void *opaque);
> +void *postcopy_outgoing_begin(MigrationState *s);
> +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> +                                          void *postcopy);
> +
> +/* For incoming postcopy */
>  extern bool incoming_postcopy;
>  extern unsigned long incoming_postcopy_flags;
>  
> +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy);
> +void postcopy_incoming_ram_alloc(const char *name,
> +                                 size_t size, uint8_t **hostp, UMem **umemp);
> +void postcopy_incoming_ram_free(UMem *umem);
> +void postcopy_incoming_prepare(void);
> +
> +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id);
> +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read);
> +void postcopy_incoming_qemu_ready(void);
> +void postcopy_incoming_qemu_cleanup(void);
> +#ifdef NEED_CPU_H
> +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size);
> +#endif
> +
>  #endif
> diff --git a/qemu-common.h b/qemu-common.h
> index 725922b..d74a8c9 100644
> --- a/qemu-common.h
> +++ b/qemu-common.h
> @@ -17,6 +17,7 @@ typedef struct DeviceState DeviceState;
>  
>  struct Monitor;
>  typedef struct Monitor Monitor;
> +typedef struct UMem UMem;
>  
>  /* we put basic includes here to avoid repeating them in device drivers */
>  #include <stdlib.h>
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 5c5b8f3..19e20f9 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -2510,7 +2510,10 @@ DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
>      "-postcopy-flags unsigned-int(flags)\n"
>      "	                flags for postcopy incoming migration\n"
>      "                   when -incoming and -postcopy are specified.\n"
> -    "                   This is for benchmark/debug purpose (default: 0)\n",
> +    "                   This is for benchmark/debug purpose (default: 0)\n"
> +    "                   Currently supprted flags are\n"
> +    "                   1: enable fault request from umemd to qemu\n"
> +    "                      (default: disabled)\n",
>      QEMU_ARCH_ALL)
>  STEXI
>  @item -postcopy-flags int

Can you move umem.h and umem.h to a separate patch please ,
this patch
> diff --git a/umem.c b/umem.c
> new file mode 100644
> index 0000000..b7be006
> --- /dev/null
> +++ b/umem.c
> @@ -0,0 +1,379 @@
> +/*
> + * umem.c: user process backed memory module for postcopy livemigration
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +
> +#include <linux/umem.h>
> +
> +#include "bitops.h"
> +#include "sysemu.h"
> +#include "hw/hw.h"
> +#include "umem.h"
> +
> +//#define DEBUG_UMEM
> +#ifdef DEBUG_UMEM
> +#include <sys/syscall.h>
> +#define DPRINTF(format, ...)                                            \
> +    do {                                                                \
> +        printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),    \
> +               __func__, __LINE__, ## __VA_ARGS__);                     \
> +    } while (0)
> +#else
> +#define DPRINTF(format, ...)    do { } while (0)
> +#endif
> +
> +#define DEV_UMEM        "/dev/umem"
> +
> +struct UMemDev {
> +    int fd;
> +    int page_shift;
> +};
> +
> +UMemDev *umem_dev_new(void)
> +{
> +    UMemDev *umem_dev;
> +    int umem_dev_fd = open(DEV_UMEM, O_RDWR);
> +    if (umem_dev_fd < 0) {
> +        perror("can't open "DEV_UMEM);
> +        abort();
> +    }
> +
> +    umem_dev = g_new(UMemDev, 1);
> +    umem_dev->fd = umem_dev_fd;
> +    umem_dev->page_shift = ffs(getpagesize()) - 1;
> +    return umem_dev;
> +}
> +
> +void umem_dev_destroy(UMemDev *dev)
> +{
> +    close(dev->fd);
> +    g_free(dev);
> +}
> +
> +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name)
> +{
> +    struct umem_create create = {
> +        .size = size,
> +        .async_req_max = 0,
> +        .sync_req_max = 0,
> +    };
> +    UMem *umem;
> +
> +    snprintf(create.name.id, sizeof(create.name.id),
> +             "pid-%"PRId64, (uint64_t)getpid());
> +    create.name.id[UMEM_ID_MAX - 1] = 0;
> +    strncpy(create.name.name, name, sizeof(create.name.name));
> +    create.name.name[UMEM_NAME_MAX - 1] = 0;
> +
> +    assert((size % getpagesize()) == 0);
> +    if (ioctl(dev->fd, UMEM_DEV_CREATE_UMEM, &create) < 0) {
> +        perror("UMEM_DEV_CREATE_UMEM");
> +        abort();
> +    }
> +    if (ftruncate(create.shmem_fd, create.size) < 0) {
> +        perror("truncate(\"shmem_fd\")");
> +        abort();
> +    }
> +
> +    umem = g_new(UMem, 1);
> +    umem->nbits = 0;
> +    umem->nsets = 0;
> +    umem->faulted = NULL;
> +    umem->page_shift = dev->page_shift;
> +    umem->fd = create.umem_fd;
> +    umem->shmem_fd = create.shmem_fd;
> +    umem->size = create.size;
> +    umem->umem = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
> +                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (umem->umem == MAP_FAILED) {
> +        perror("mmap(UMem) failed");
> +        abort();
> +    }
> +    return umem;
> +}
> +
> +void umem_mmap(UMem *umem)
> +{
> +    void *ret = mmap(umem->umem, umem->size,
> +                     PROT_EXEC | PROT_READ | PROT_WRITE,
> +                     MAP_PRIVATE | MAP_FIXED, umem->fd, 0);
> +    if (ret == MAP_FAILED) {
> +        perror("umem_mmap(UMem) failed");
> +        abort();
> +    }
> +}
> +
> +void umem_destroy(UMem *umem)
> +{
> +    if (umem->fd != -1) {
> +        close(umem->fd);
> +    }
> +    if (umem->shmem_fd != -1) {
> +        close(umem->shmem_fd);
> +    }
> +    g_free(umem->faulted);
> +    g_free(umem);
> +}
> +
> +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request)
> +{
> +    if (ioctl(umem->fd, UMEM_GET_PAGE_REQUEST, page_request)) {
> +        perror("daemon: UMEM_GET_PAGE_REQUEST");
> +        abort();
> +    }
> +}
> +
> +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached)
> +{
> +    if (ioctl(umem->fd, UMEM_MARK_PAGE_CACHED, page_cached)) {
> +        perror("daemon: UMEM_MARK_PAGE_CACHED");
> +        abort();
> +    }
> +}
> +
> +void umem_unmap(UMem *umem)
> +{
> +    munmap(umem->umem, umem->size);
> +    umem->umem = NULL;
> +}
> +
> +void umem_close(UMem *umem)
> +{
> +    close(umem->fd);
> +    umem->fd = -1;
> +}
> +
> +void *umem_map_shmem(UMem *umem)
> +{
> +    umem->nbits = umem->size >> umem->page_shift;
> +    umem->nsets = 0;
> +    umem->faulted = g_new0(unsigned long, BITS_TO_LONGS(umem->nbits));
> +
> +    umem->shmem = mmap(NULL, umem->size, PROT_READ | PROT_WRITE, MAP_SHARED,
> +                       umem->shmem_fd, 0);
> +    if (umem->shmem == MAP_FAILED) {
> +        perror("daemon: mmap(\"shmem\")");
> +        abort();
> +    }
> +    return umem->shmem;
> +}
> +
> +void umem_unmap_shmem(UMem *umem)
> +{
> +    munmap(umem->shmem, umem->size);
> +    umem->shmem = NULL;
> +}
> +
> +void umem_remove_shmem(UMem *umem, size_t offset, size_t size)
> +{
> +    int s = offset >> umem->page_shift;
> +    int e = (offset + size) >> umem->page_shift;
> +    int i;
> +
> +    for (i = s; i < e; i++) {
> +        if (!test_and_set_bit(i, umem->faulted)) {
> +            umem->nsets++;
> +#if defined(CONFIG_MADVISE) && defined(MADV_REMOVE)
> +            madvise(umem->shmem + offset, size, MADV_REMOVE);
> +#endif
> +        }
> +    }
> +}
> +
> +void umem_close_shmem(UMem *umem)
> +{
> +    close(umem->shmem_fd);
> +    umem->shmem_fd = -1;
> +}
> +
> +/***************************************************************************/
> +/* qemu <-> umem daemon communication */
> +
> +size_t umem_pages_size(uint64_t nr)
> +{
> +    return sizeof(struct umem_pages) + nr * sizeof(uint64_t);
> +}
> +
> +static void umem_write_cmd(int fd, uint8_t cmd)
> +{
> +    DPRINTF("write cmd %c\n", cmd);
> +
> +    for (;;) {
> +        ssize_t ret = write(fd, &cmd, 1);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == EPIPE) {
> +                perror("pipe");
> +                DPRINTF("write cmd %c %zd %d: pipe is closed\n",
> +                        cmd, ret, errno);
> +                break;
> +            }
> +
> +            perror("pipe");
> +            DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
> +            abort();
> +        }
> +
> +        break;
> +    }
> +}
> +
> +static void umem_read_cmd(int fd, uint8_t expect)
> +{
> +    uint8_t cmd;
> +    for (;;) {
> +        ssize_t ret = read(fd, &cmd, 1);
> +        if (ret == -1) {
> +            if (errno == EINTR) {
> +                continue;
> +            }
> +            perror("pipe");
> +            DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
> +            abort();
> +        }
> +
> +        if (ret == 0) {
> +            DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
> +            abort();
> +        }
> +
> +        break;
> +    }
> +
> +    DPRINTF("read cmd %c\n", cmd);
> +    if (cmd != expect) {
> +        DPRINTF("cmd %c expect %d\n", cmd, expect);
> +        abort();
> +    }
> +}
> +
> +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
> +{
> +    int ret;
> +    uint64_t nr;
> +    size_t size;
> +    struct umem_pages *pages;
> +
> +    ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
> +    *offset += sizeof(nr);
> +    DPRINTF("ret %d nr %ld\n", ret, nr);
> +    if (ret != sizeof(nr) || nr == 0) {
> +        return NULL;
> +    }
> +
> +    size = umem_pages_size(nr);
> +    pages = g_malloc(size);
> +    pages->nr = nr;
> +    size -= sizeof(pages->nr);
> +
> +    ret = qemu_peek_buffer(f, (uint8_t*)pages->pgoffs, size, *offset);
> +    *offset += size;
> +    if (ret != size) {
> +        g_free(pages);
> +        return NULL;
> +    }
> +    return pages;
> +}
> +
> +static void umem_send_pages(QEMUFile *f, const struct umem_pages *pages)
> +{
> +    size_t len = umem_pages_size(pages->nr);
> +    qemu_put_buffer(f, (const uint8_t*)pages, len);
> +}
> +
> +/* umem daemon -> qemu */
> +void umem_daemon_ready(int to_qemu_fd)
> +{
> +    umem_write_cmd(to_qemu_fd, UMEM_DAEMON_READY);
> +}
> +
> +void umem_daemon_quit(QEMUFile *to_qemu)
> +{
> +    qemu_put_byte(to_qemu, UMEM_DAEMON_QUIT);
> +}
> +
> +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> +                                    struct umem_pages *pages)
> +{
> +    qemu_put_byte(to_qemu, UMEM_DAEMON_TRIGGER_PAGE_FAULT);
> +    umem_send_pages(to_qemu, pages);
> +}
> +
> +void umem_daemon_wait_for_qemu(int from_qemu_fd)
> +{
> +    umem_read_cmd(from_qemu_fd, UMEM_QEMU_READY);
> +}
> +
> +/* qemu -> umem daemon */
> +void umem_qemu_wait_for_daemon(int from_umemd_fd)
> +{
> +    umem_read_cmd(from_umemd_fd, UMEM_DAEMON_READY);
> +}
> +
> +void umem_qemu_ready(int to_umemd_fd)
> +{
> +    umem_write_cmd(to_umemd_fd, UMEM_QEMU_READY);
> +}
> +
> +void umem_qemu_quit(QEMUFile *to_umemd)
> +{
> +    qemu_put_byte(to_umemd, UMEM_QEMU_QUIT);
> +}
> +
> +/* qemu side handler */
> +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> +                                                int *offset)
> +{
> +    uint64_t i;
> +    int page_shift = ffs(getpagesize()) - 1;
> +    struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
> +    if (pages == NULL) {
> +        return NULL;
> +    }
> +
> +    for (i = 0; i < pages->nr; i++) {
> +        ram_addr_t addr = pages->pgoffs[i] << page_shift;
> +
> +        /* make pages present by forcibly triggering page fault. */
> +        volatile uint8_t *ram = qemu_get_ram_ptr(addr);
> +        uint8_t dummy_read = ram[0];
> +        (void)dummy_read;   /* suppress unused variable warning */
> +    }
> +
> +    return pages;
> +}
> +
> +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> +                                  const struct umem_pages *pages)
> +{
> +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_FAULTED);
> +    umem_send_pages(to_umemd, pages);
> +}
> +
> +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> +                                   const struct umem_pages *pages)
> +{
> +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_UNMAPPED);
> +    umem_send_pages(to_umemd, pages);
> +}
> diff --git a/umem.h b/umem.h
> new file mode 100644
> index 0000000..5ca19ef
> --- /dev/null
> +++ b/umem.h
> @@ -0,0 +1,105 @@
> +/*
> + * umem.h: user process backed memory module for postcopy livemigration
> + *
> + * Copyright (c) 2011
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef QEMU_UMEM_H
> +#define QEMU_UMEM_H
> +
> +#include <linux/umem.h>
> +
> +#include "qemu-common.h"
> +
> +typedef struct UMemDev UMemDev;
> +
> +struct UMem {
> +    void *umem;
> +    int fd;
> +    void *shmem;
> +    int shmem_fd;
> +    uint64_t size;
> +
> +    /* indexed by host page size */
> +    int page_shift;
> +    int nbits;
> +    int nsets;
> +    unsigned long *faulted;
> +};
> +
> +UMemDev *umem_dev_new(void);
> +void umem_dev_destroy(UMemDev *dev);
> +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name);
> +void umem_mmap(UMem *umem);
> +
> +void umem_destroy(UMem *umem);
> +
> +/* umem device operations */
> +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request);
> +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached);
> +void umem_unmap(UMem *umem);
> +void umem_close(UMem *umem);
> +
> +/* umem shmem operations */
> +void *umem_map_shmem(UMem *umem);
> +void umem_unmap_shmem(UMem *umem);
> +void umem_remove_shmem(UMem *umem, size_t offset, size_t size);
> +void umem_close_shmem(UMem *umem);
> +
> +/* qemu on source <-> umem daemon communication */
> +
> +struct umem_pages {
> +    uint64_t nr;        /* nr = 0 means completed */
> +    uint64_t pgoffs[0];
> +};
> +
> +/* daemon -> qemu */
> +#define UMEM_DAEMON_READY               'R'
> +#define UMEM_DAEMON_QUIT                'Q'
> +#define UMEM_DAEMON_TRIGGER_PAGE_FAULT  'T'
> +#define UMEM_DAEMON_ERROR               'E'
> +
> +/* qemu -> daemon */
> +#define UMEM_QEMU_READY                 'r'
> +#define UMEM_QEMU_QUIT                  'q'
> +#define UMEM_QEMU_PAGE_FAULTED          't'
> +#define UMEM_QEMU_PAGE_UNMAPPED         'u'
> +
> +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset);
> +size_t umem_pages_size(uint64_t nr);
> +
> +/* for umem daemon */
> +void umem_daemon_ready(int to_qemu_fd);
> +void umem_daemon_wait_for_qemu(int from_qemu_fd);
> +void umem_daemon_quit(QEMUFile *to_qemu);
> +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> +                                    struct umem_pages *pages);
> +
> +/* for qemu */
> +void umem_qemu_wait_for_daemon(int from_umemd_fd);
> +void umem_qemu_ready(int to_umemd_fd);
> +void umem_qemu_quit(QEMUFile *to_umemd);
> +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> +                                                int *offset);
> +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> +                                  const struct umem_pages *pages);
> +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> +                                   const struct umem_pages *pages);
> +
> +#endif /* QEMU_UMEM_H */
> diff --git a/vl.c b/vl.c
> index 5430b8c..17427a0 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -3274,8 +3274,12 @@ int main(int argc, char **argv, char **envp)
>      default_drive(default_sdcard, snapshot, machine->use_scsi,
>                    IF_SD, 0, SD_OPTS);
>  
> -    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
> -                         ram_save_live, NULL, ram_load, NULL);
> +    if (postcopy_incoming_init(incoming, incoming_postcopy) < 0) {
> +        exit(1);
> +    }
> +    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
> +                         ram_save_set_params, ram_save_live, NULL,
> +                         ram_load, NULL);
>  
>      if (nb_numa_nodes > 0) {
>          int i;
> @@ -3471,6 +3475,9 @@ int main(int argc, char **argv, char **envp)
>  
>      if (incoming) {
>          runstate_set(RUN_STATE_INMIGRATE);
> +        if (incoming_postcopy) {
> +            postcopy_incoming_prepare();
>+        }

how about moving postcopy_incoming_prepare into qemu_start_incoming_migration ?

>          int ret = qemu_start_incoming_migration(incoming);
>          if (ret < 0) {
>              fprintf(stderr, "Migration failed. Exit code %s(%d), exiting.\n",
> @@ -3488,6 +3495,9 @@ int main(int argc, char **argv, char **envp)
>      bdrv_close_all();
>      pause_all_vcpus();
>      net_cleanup();
> +    if (incoming_postcopy) {
> +        postcopy_incoming_qemu_cleanup();
> +    }
>      res_free();
>  
>      return 0;

Orit

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 21/21] postcopy: implement postcopy livemigration
  2011-12-29  1:26   ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29 16:06     ` Avi Kivity
  -1 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2011-12-29 16:06 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This patch implements postcopy livemigration.
>
>  
> +/* RAM is allocated via umem for postcopy incoming mode */
> +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> +
>  typedef struct RAMBlock {
>      uint8_t *host;
>      ram_addr_t offset;
> @@ -485,6 +488,10 @@ typedef struct RAMBlock {
>  #if defined(__linux__) && !defined(TARGET_S390X)
>      int fd;
>  #endif
> +
> +#ifdef CONFIG_POSTCOPY
> +    UMem *umem;    /* for incoming postcopy mode */
> +#endif
>  } RAMBlock;

Is it possible to implement this via the MemoryListener API (which
replaces CPUPhysMemoryClient)?  This is how kvm, vhost, and xen manage
their memory tables.

>  

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
@ 2011-12-29 16:06     ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2011-12-29 16:06 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This patch implements postcopy livemigration.
>
>  
> +/* RAM is allocated via umem for postcopy incoming mode */
> +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> +
>  typedef struct RAMBlock {
>      uint8_t *host;
>      ram_addr_t offset;
> @@ -485,6 +488,10 @@ typedef struct RAMBlock {
>  #if defined(__linux__) && !defined(TARGET_S390X)
>      int fd;
>  #endif
> +
> +#ifdef CONFIG_POSTCOPY
> +    UMem *umem;    /* for incoming postcopy mode */
> +#endif
>  } RAMBlock;

Is it possible to implement this via the MemoryListener API (which
replaces CPUPhysMemoryClient)?  This is how kvm, vhost, and xen manage
their memory tables.

>  

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/21][RFC] postcopy live migration
  2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
@ 2011-12-29 22:39   ` Anthony Liguori
  -1 siblings, 0 replies; 88+ messages in thread
From: Anthony Liguori @ 2011-12-29 22:39 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, Juan Quintela, t.hirofuchi, satoshi.itoh, Michael Roth, qemu-devel

On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
> Intro
> =====
> This patch series implements postcopy live migration.[1]
> As discussed at KVM forum 2011, dedicated character device is used for
> distributed shared memory between migration source and destination.
> Now we can discuss/benchmark/compare with precopy. I believe there are
> much rooms for improvement.
>
> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>
>
> Usage
> =====
> You need load umem character device on the host before starting migration.
> Postcopy can be used for tcg and kvm accelarator. The implementation depend
> on only linux umem character device. But the driver dependent code is split
> into a file.
> I tested only host page size == guest page size case, but the implementation
> allows host page size != guest page size case.
>
> The following options are added with this patch series.
> - incoming part
>    command line options
>    -postcopy [-postcopy-flags<flags>]
>    where flags is for changing behavior for benchmark/debugging
>    Currently the following flags are available
>    0: default
>    1: enable touching page request
>
>    example:
>    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>
> - outging part
>    options for migrate command
>    migrate [-p [-n]] URI
>    -p: indicate postcopy migration
>    -n: disable background transferring pages: This is for benchmark/debugging
>
>    example:
>    migrate -p -n tcp:<dest ip address>:4444
>
>
> TODO
> ====
> - benchmark/evaluation. Especially how async page fault affects the result.

I'll review this series next week (Mike/Juan, please also review when you can).

But we really need to think hard about whether this is the right thing to take 
into the tree.  I worry a lot about the fact that we don't test pre-copy 
migration nearly enough and adding a second form just introduces more things to 
test.

It's also not clear to me why post-copy is better.  If you were going to sit 
down and explain to someone building a management tool when they should use 
pre-copy and when they should use post-copy, what would you tell them?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2011-12-29 22:39   ` Anthony Liguori
  0 siblings, 0 replies; 88+ messages in thread
From: Anthony Liguori @ 2011-12-29 22:39 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, Juan Quintela, t.hirofuchi, satoshi.itoh, Michael Roth, qemu-devel

On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
> Intro
> =====
> This patch series implements postcopy live migration.[1]
> As discussed at KVM forum 2011, dedicated character device is used for
> distributed shared memory between migration source and destination.
> Now we can discuss/benchmark/compare with precopy. I believe there are
> much rooms for improvement.
>
> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>
>
> Usage
> =====
> You need load umem character device on the host before starting migration.
> Postcopy can be used for tcg and kvm accelarator. The implementation depend
> on only linux umem character device. But the driver dependent code is split
> into a file.
> I tested only host page size == guest page size case, but the implementation
> allows host page size != guest page size case.
>
> The following options are added with this patch series.
> - incoming part
>    command line options
>    -postcopy [-postcopy-flags<flags>]
>    where flags is for changing behavior for benchmark/debugging
>    Currently the following flags are available
>    0: default
>    1: enable touching page request
>
>    example:
>    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>
> - outging part
>    options for migrate command
>    migrate [-p [-n]] URI
>    -p: indicate postcopy migration
>    -n: disable background transferring pages: This is for benchmark/debugging
>
>    example:
>    migrate -p -n tcp:<dest ip address>:4444
>
>
> TODO
> ====
> - benchmark/evaluation. Especially how async page fault affects the result.

I'll review this series next week (Mike/Juan, please also review when you can).

But we really need to think hard about whether this is the right thing to take 
into the tree.  I worry a lot about the fact that we don't test pre-copy 
migration nearly enough and adding a second form just introduces more things to 
test.

It's also not clear to me why post-copy is better.  If you were going to sit 
down and explain to someone building a management tool when they should use 
pre-copy and when they should use post-copy, what would you tell them?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/21][RFC] postcopy live migration
  2011-12-29 22:39   ` [Qemu-devel] " Anthony Liguori
@ 2012-01-01  9:43     ` Orit Wasserman
  -1 siblings, 0 replies; 88+ messages in thread
From: Orit Wasserman @ 2012-01-01  9:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Isaku Yamahata

On 12/30/2011 12:39 AM, Anthony Liguori wrote:
> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>> Intro
>> =====
>> This patch series implements postcopy live migration.[1]
>> As discussed at KVM forum 2011, dedicated character device is used for
>> distributed shared memory between migration source and destination.
>> Now we can discuss/benchmark/compare with precopy. I believe there are
>> much rooms for improvement.
>>
>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>
>>
>> Usage
>> =====
>> You need load umem character device on the host before starting migration.
>> Postcopy can be used for tcg and kvm accelarator. The implementation depend
>> on only linux umem character device. But the driver dependent code is split
>> into a file.
>> I tested only host page size == guest page size case, but the implementation
>> allows host page size != guest page size case.
>>
>> The following options are added with this patch series.
>> - incoming part
>>    command line options
>>    -postcopy [-postcopy-flags<flags>]
>>    where flags is for changing behavior for benchmark/debugging
>>    Currently the following flags are available
>>    0: default
>>    1: enable touching page request
>>
>>    example:
>>    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>
>> - outging part
>>    options for migrate command
>>    migrate [-p [-n]] URI
>>    -p: indicate postcopy migration
>>    -n: disable background transferring pages: This is for benchmark/debugging
>>
>>    example:
>>    migrate -p -n tcp:<dest ip address>:4444
>>
>>
>> TODO
>> ====
>> - benchmark/evaluation. Especially how async page fault affects the result.
> 
> I'll review this series next week (Mike/Juan, please also review when you can).
> 
> But we really need to think hard about whether this is the right thing to take into the tree.  I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test.
>
> It's also not clear to me why post-copy is better.  If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them?

Start with pre-copy , if it doesn't converge switch to post-copy  

Orit
> 
> Regards,
> 
> Anthony Liguori
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-01  9:43     ` Orit Wasserman
  0 siblings, 0 replies; 88+ messages in thread
From: Orit Wasserman @ 2012-01-01  9:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Isaku Yamahata

On 12/30/2011 12:39 AM, Anthony Liguori wrote:
> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>> Intro
>> =====
>> This patch series implements postcopy live migration.[1]
>> As discussed at KVM forum 2011, dedicated character device is used for
>> distributed shared memory between migration source and destination.
>> Now we can discuss/benchmark/compare with precopy. I believe there are
>> much rooms for improvement.
>>
>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>
>>
>> Usage
>> =====
>> You need load umem character device on the host before starting migration.
>> Postcopy can be used for tcg and kvm accelarator. The implementation depend
>> on only linux umem character device. But the driver dependent code is split
>> into a file.
>> I tested only host page size == guest page size case, but the implementation
>> allows host page size != guest page size case.
>>
>> The following options are added with this patch series.
>> - incoming part
>>    command line options
>>    -postcopy [-postcopy-flags<flags>]
>>    where flags is for changing behavior for benchmark/debugging
>>    Currently the following flags are available
>>    0: default
>>    1: enable touching page request
>>
>>    example:
>>    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>
>> - outging part
>>    options for migrate command
>>    migrate [-p [-n]] URI
>>    -p: indicate postcopy migration
>>    -n: disable background transferring pages: This is for benchmark/debugging
>>
>>    example:
>>    migrate -p -n tcp:<dest ip address>:4444
>>
>>
>> TODO
>> ====
>> - benchmark/evaluation. Especially how async page fault affects the result.
> 
> I'll review this series next week (Mike/Juan, please also review when you can).
> 
> But we really need to think hard about whether this is the right thing to take into the tree.  I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test.
>
> It's also not clear to me why post-copy is better.  If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them?

Start with pre-copy , if it doesn't converge switch to post-copy  

Orit
> 
> Regards,
> 
> Anthony Liguori
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/21][RFC] postcopy live migration
  2011-12-29 22:39   ` [Qemu-devel] " Anthony Liguori
@ 2012-01-01  9:52     ` Dor Laor
  -1 siblings, 0 replies; 88+ messages in thread
From: Dor Laor @ 2012-01-01  9:52 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Isaku Yamahata, Umesh Deshpande

On 12/30/2011 12:39 AM, Anthony Liguori wrote:
> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>> Intro
>> =====
>> This patch series implements postcopy live migration.[1]
>> As discussed at KVM forum 2011, dedicated character device is used for
>> distributed shared memory between migration source and destination.
>> Now we can discuss/benchmark/compare with precopy. I believe there are
>> much rooms for improvement.
>>
>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>
>>
>> Usage
>> =====
>> You need load umem character device on the host before starting
>> migration.
>> Postcopy can be used for tcg and kvm accelarator. The implementation
>> depend
>> on only linux umem character device. But the driver dependent code is
>> split
>> into a file.
>> I tested only host page size == guest page size case, but the
>> implementation
>> allows host page size != guest page size case.
>>
>> The following options are added with this patch series.
>> - incoming part
>> command line options
>> -postcopy [-postcopy-flags<flags>]
>> where flags is for changing behavior for benchmark/debugging
>> Currently the following flags are available
>> 0: default
>> 1: enable touching page request
>>
>> example:
>> qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>
>> - outging part
>> options for migrate command
>> migrate [-p [-n]] URI
>> -p: indicate postcopy migration
>> -n: disable background transferring pages: This is for
>> benchmark/debugging
>>
>> example:
>> migrate -p -n tcp:<dest ip address>:4444
>>
>>
>> TODO
>> ====
>> - benchmark/evaluation. Especially how async page fault affects the
>> result.
>
> I'll review this series next week (Mike/Juan, please also review when
> you can).
>
> But we really need to think hard about whether this is the right thing
> to take into the tree. I worry a lot about the fact that we don't test
> pre-copy migration nearly enough and adding a second form just
> introduces more things to test.

It is an issue but it can't be a merge criteria, Isaku is not blame of 
pre copy live migration lack of testing.

I would say that 90% of issues of live migration problems are not 
related to the pre|post stage but more of issues of device model save 
state. So post-copy shouldn't add a significant regression here.

Probably it will be good to ask every migration patch writer to write an 
additional unit test for migration.

> It's also not clear to me why post-copy is better. If you were going to
> sit down and explain to someone building a management tool when they
> should use pre-copy and when they should use post-copy, what would you
> tell them?

Today, we have a default of max-downtime of 100ms.
If either the guest work set size or the host networking throughput 
can't match the downtime, migration won't end.
The mgmt user options are:
  - increase the downtime more and more to an actual stop
  - fail migrate

W/ post-copy there is another option.
Performance measurements will teach us (probably prior to commit) when 
this stage is valuable. Most likely, we better try first with pre-copy 
and if we can't meet the downtime we can optionally use post-copy.

Here's a paper by Umesh (the migration thread writer):
http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf

Regards,
Dor

>
> Regards,
>
> Anthony Liguori
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-01  9:52     ` Dor Laor
  0 siblings, 0 replies; 88+ messages in thread
From: Dor Laor @ 2012-01-01  9:52 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Isaku Yamahata, Umesh Deshpande

On 12/30/2011 12:39 AM, Anthony Liguori wrote:
> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>> Intro
>> =====
>> This patch series implements postcopy live migration.[1]
>> As discussed at KVM forum 2011, dedicated character device is used for
>> distributed shared memory between migration source and destination.
>> Now we can discuss/benchmark/compare with precopy. I believe there are
>> much rooms for improvement.
>>
>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>
>>
>> Usage
>> =====
>> You need load umem character device on the host before starting
>> migration.
>> Postcopy can be used for tcg and kvm accelarator. The implementation
>> depend
>> on only linux umem character device. But the driver dependent code is
>> split
>> into a file.
>> I tested only host page size == guest page size case, but the
>> implementation
>> allows host page size != guest page size case.
>>
>> The following options are added with this patch series.
>> - incoming part
>> command line options
>> -postcopy [-postcopy-flags<flags>]
>> where flags is for changing behavior for benchmark/debugging
>> Currently the following flags are available
>> 0: default
>> 1: enable touching page request
>>
>> example:
>> qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>
>> - outging part
>> options for migrate command
>> migrate [-p [-n]] URI
>> -p: indicate postcopy migration
>> -n: disable background transferring pages: This is for
>> benchmark/debugging
>>
>> example:
>> migrate -p -n tcp:<dest ip address>:4444
>>
>>
>> TODO
>> ====
>> - benchmark/evaluation. Especially how async page fault affects the
>> result.
>
> I'll review this series next week (Mike/Juan, please also review when
> you can).
>
> But we really need to think hard about whether this is the right thing
> to take into the tree. I worry a lot about the fact that we don't test
> pre-copy migration nearly enough and adding a second form just
> introduces more things to test.

It is an issue but it can't be a merge criteria, Isaku is not blame of 
pre copy live migration lack of testing.

I would say that 90% of issues of live migration problems are not 
related to the pre|post stage but more of issues of device model save 
state. So post-copy shouldn't add a significant regression here.

Probably it will be good to ask every migration patch writer to write an 
additional unit test for migration.

> It's also not clear to me why post-copy is better. If you were going to
> sit down and explain to someone building a management tool when they
> should use pre-copy and when they should use post-copy, what would you
> tell them?

Today, we have a default of max-downtime of 100ms.
If either the guest work set size or the host networking throughput 
can't match the downtime, migration won't end.
The mgmt user options are:
  - increase the downtime more and more to an actual stop
  - fail migrate

W/ post-copy there is another option.
Performance measurements will teach us (probably prior to commit) when 
this stage is valuable. Most likely, we better try first with pre-copy 
and if we can't meet the downtime we can optionally use post-copy.

Here's a paper by Umesh (the migration thread writer):
http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf

Regards,
Dor

>
> Regards,
>
> Anthony Liguori
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
  2012-01-01  9:43     ` [Qemu-devel] " Orit Wasserman
@ 2012-01-01 16:27       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 88+ messages in thread
From: Stefan Hajnoczi @ 2012-01-01 16:27 UTC (permalink / raw)
  To: Orit Wasserman
  Cc: Anthony Liguori, kvm, satoshi.itoh, t.hirofuchi, Juan Quintela,
	Michael Roth, qemu-devel, Isaku Yamahata

On Sun, Jan 1, 2012 at 9:43 AM, Orit Wasserman <owasserm@redhat.com> wrote:
> On 12/30/2011 12:39 AM, Anthony Liguori wrote:
>> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>>> Intro
>>> =====
>>> This patch series implements postcopy live migration.[1]
>>> As discussed at KVM forum 2011, dedicated character device is used for
>>> distributed shared memory between migration source and destination.
>>> Now we can discuss/benchmark/compare with precopy. I believe there are
>>> much rooms for improvement.
>>>
>>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>
>>>
>>> Usage
>>> =====
>>> You need load umem character device on the host before starting migration.
>>> Postcopy can be used for tcg and kvm accelarator. The implementation depend
>>> on only linux umem character device. But the driver dependent code is split
>>> into a file.
>>> I tested only host page size == guest page size case, but the implementation
>>> allows host page size != guest page size case.
>>>
>>> The following options are added with this patch series.
>>> - incoming part
>>>    command line options
>>>    -postcopy [-postcopy-flags<flags>]
>>>    where flags is for changing behavior for benchmark/debugging
>>>    Currently the following flags are available
>>>    0: default
>>>    1: enable touching page request
>>>
>>>    example:
>>>    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>>
>>> - outging part
>>>    options for migrate command
>>>    migrate [-p [-n]] URI
>>>    -p: indicate postcopy migration
>>>    -n: disable background transferring pages: This is for benchmark/debugging
>>>
>>>    example:
>>>    migrate -p -n tcp:<dest ip address>:4444
>>>
>>>
>>> TODO
>>> ====
>>> - benchmark/evaluation. Especially how async page fault affects the result.
>>
>> I'll review this series next week (Mike/Juan, please also review when you can).
>>
>> But we really need to think hard about whether this is the right thing to take into the tree.  I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test.
>>
>> It's also not clear to me why post-copy is better.  If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them?
>
> Start with pre-copy , if it doesn't converge switch to post-copy

Post-copy throttles the guest when page faults are encountered because
the destination machine waits for memory pages from the source
machine.  Is there a reason this page fault-based throttling cannot be
done on the source machine with pre-copy migration?  I'm not sure
post-copy provides new behavior in terms of convergence, we could do
the same with pre-copy migration.

Post-copy has other advantages though, it immediately frees logical
CPUs on the source machine (though RAM and network bandwidth is still
required until migration completes).

Stefan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-01 16:27       ` Stefan Hajnoczi
  0 siblings, 0 replies; 88+ messages in thread
From: Stefan Hajnoczi @ 2012-01-01 16:27 UTC (permalink / raw)
  To: Orit Wasserman
  Cc: kvm, Juan Quintela, t.hirofuchi, satoshi.itoh, Michael Roth,
	qemu-devel, Isaku Yamahata

On Sun, Jan 1, 2012 at 9:43 AM, Orit Wasserman <owasserm@redhat.com> wrote:
> On 12/30/2011 12:39 AM, Anthony Liguori wrote:
>> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>>> Intro
>>> =====
>>> This patch series implements postcopy live migration.[1]
>>> As discussed at KVM forum 2011, dedicated character device is used for
>>> distributed shared memory between migration source and destination.
>>> Now we can discuss/benchmark/compare with precopy. I believe there are
>>> much rooms for improvement.
>>>
>>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>
>>>
>>> Usage
>>> =====
>>> You need load umem character device on the host before starting migration.
>>> Postcopy can be used for tcg and kvm accelarator. The implementation depend
>>> on only linux umem character device. But the driver dependent code is split
>>> into a file.
>>> I tested only host page size == guest page size case, but the implementation
>>> allows host page size != guest page size case.
>>>
>>> The following options are added with this patch series.
>>> - incoming part
>>>    command line options
>>>    -postcopy [-postcopy-flags<flags>]
>>>    where flags is for changing behavior for benchmark/debugging
>>>    Currently the following flags are available
>>>    0: default
>>>    1: enable touching page request
>>>
>>>    example:
>>>    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>>
>>> - outging part
>>>    options for migrate command
>>>    migrate [-p [-n]] URI
>>>    -p: indicate postcopy migration
>>>    -n: disable background transferring pages: This is for benchmark/debugging
>>>
>>>    example:
>>>    migrate -p -n tcp:<dest ip address>:4444
>>>
>>>
>>> TODO
>>> ====
>>> - benchmark/evaluation. Especially how async page fault affects the result.
>>
>> I'll review this series next week (Mike/Juan, please also review when you can).
>>
>> But we really need to think hard about whether this is the right thing to take into the tree.  I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test.
>>
>> It's also not clear to me why post-copy is better.  If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them?
>
> Start with pre-copy , if it doesn't converge switch to post-copy

Post-copy throttles the guest when page faults are encountered because
the destination machine waits for memory pages from the source
machine.  Is there a reason this page fault-based throttling cannot be
done on the source machine with pre-copy migration?  I'm not sure
post-copy provides new behavior in terms of convergence, we could do
the same with pre-copy migration.

Post-copy has other advantages though, it immediately frees logical
CPUs on the source machine (though RAM and network bandwidth is still
required until migration completes).

Stefan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
  2012-01-01 16:27       ` Stefan Hajnoczi
@ 2012-01-02  9:28         ` Dor Laor
  -1 siblings, 0 replies; 88+ messages in thread
From: Dor Laor @ 2012-01-02  9:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Orit Wasserman, kvm, Juan Quintela, t.hirofuchi, satoshi.itoh,
	Michael Roth, qemu-devel, Isaku Yamahata

On 01/01/2012 06:27 PM, Stefan Hajnoczi wrote:
> On Sun, Jan 1, 2012 at 9:43 AM, Orit Wasserman<owasserm@redhat.com>  wrote:
>> On 12/30/2011 12:39 AM, Anthony Liguori wrote:
>>> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>>>> Intro
>>>> =====
>>>> This patch series implements postcopy live migration.[1]
>>>> As discussed at KVM forum 2011, dedicated character device is used for
>>>> distributed shared memory between migration source and destination.
>>>> Now we can discuss/benchmark/compare with precopy. I believe there are
>>>> much rooms for improvement.
>>>>
>>>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>>
>>>>
>>>> Usage
>>>> =====
>>>> You need load umem character device on the host before starting migration.
>>>> Postcopy can be used for tcg and kvm accelarator. The implementation depend
>>>> on only linux umem character device. But the driver dependent code is split
>>>> into a file.
>>>> I tested only host page size == guest page size case, but the implementation
>>>> allows host page size != guest page size case.
>>>>
>>>> The following options are added with this patch series.
>>>> - incoming part
>>>>     command line options
>>>>     -postcopy [-postcopy-flags<flags>]
>>>>     where flags is for changing behavior for benchmark/debugging
>>>>     Currently the following flags are available
>>>>     0: default
>>>>     1: enable touching page request
>>>>
>>>>     example:
>>>>     qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>>>
>>>> - outging part
>>>>     options for migrate command
>>>>     migrate [-p [-n]] URI
>>>>     -p: indicate postcopy migration
>>>>     -n: disable background transferring pages: This is for benchmark/debugging
>>>>
>>>>     example:
>>>>     migrate -p -n tcp:<dest ip address>:4444
>>>>
>>>>
>>>> TODO
>>>> ====
>>>> - benchmark/evaluation. Especially how async page fault affects the result.
>>>
>>> I'll review this series next week (Mike/Juan, please also review when you can).
>>>
>>> But we really need to think hard about whether this is the right thing to take into the tree.  I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test.
>>>
>>> It's also not clear to me why post-copy is better.  If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them?
>>
>> Start with pre-copy , if it doesn't converge switch to post-copy
>
> Post-copy throttles the guest when page faults are encountered because
> the destination machine waits for memory pages from the source
> machine.  Is there a reason this page fault-based throttling cannot be
> done on the source machine with pre-copy migration?  I'm not sure
> post-copy provides new behavior in terms of convergence, we could do
> the same with pre-copy migration.

There is different w/ these two approaches:
1. post-copy allows progress to vcpus that are not faulting at the
    moment.

    Assuming a subset of the guest vcpu can execute freely w/ their
    memory already at the destination, they can get 100% cpu time.
    The slowing down approach on the source host, slows down all vcpus.

2. Difference page access pattern
    post-copy uses on-demand like paging, so the page that is really
    required get transferred. The slow-down approach can just guess what
    page to send first.

>
> Post-copy has other advantages though, it immediately frees logical
> CPUs on the source machine (though RAM and network bandwidth is still
> required until migration completes).

W/ post-copy you can immediately free any page that got transferred to 
the destination.

At the end of the day, it's performance testing using various scenarios 
that can educate us whether post-copy worth the extra complexity over 
slowing down the guest on the source.

Cheers,
Dor

>
> Stefan
>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-02  9:28         ` Dor Laor
  0 siblings, 0 replies; 88+ messages in thread
From: Dor Laor @ 2012-01-02  9:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Orit Wasserman, Isaku Yamahata

On 01/01/2012 06:27 PM, Stefan Hajnoczi wrote:
> On Sun, Jan 1, 2012 at 9:43 AM, Orit Wasserman<owasserm@redhat.com>  wrote:
>> On 12/30/2011 12:39 AM, Anthony Liguori wrote:
>>> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>>>> Intro
>>>> =====
>>>> This patch series implements postcopy live migration.[1]
>>>> As discussed at KVM forum 2011, dedicated character device is used for
>>>> distributed shared memory between migration source and destination.
>>>> Now we can discuss/benchmark/compare with precopy. I believe there are
>>>> much rooms for improvement.
>>>>
>>>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>>
>>>>
>>>> Usage
>>>> =====
>>>> You need load umem character device on the host before starting migration.
>>>> Postcopy can be used for tcg and kvm accelarator. The implementation depend
>>>> on only linux umem character device. But the driver dependent code is split
>>>> into a file.
>>>> I tested only host page size == guest page size case, but the implementation
>>>> allows host page size != guest page size case.
>>>>
>>>> The following options are added with this patch series.
>>>> - incoming part
>>>>     command line options
>>>>     -postcopy [-postcopy-flags<flags>]
>>>>     where flags is for changing behavior for benchmark/debugging
>>>>     Currently the following flags are available
>>>>     0: default
>>>>     1: enable touching page request
>>>>
>>>>     example:
>>>>     qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>>>
>>>> - outging part
>>>>     options for migrate command
>>>>     migrate [-p [-n]] URI
>>>>     -p: indicate postcopy migration
>>>>     -n: disable background transferring pages: This is for benchmark/debugging
>>>>
>>>>     example:
>>>>     migrate -p -n tcp:<dest ip address>:4444
>>>>
>>>>
>>>> TODO
>>>> ====
>>>> - benchmark/evaluation. Especially how async page fault affects the result.
>>>
>>> I'll review this series next week (Mike/Juan, please also review when you can).
>>>
>>> But we really need to think hard about whether this is the right thing to take into the tree.  I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test.
>>>
>>> It's also not clear to me why post-copy is better.  If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them?
>>
>> Start with pre-copy , if it doesn't converge switch to post-copy
>
> Post-copy throttles the guest when page faults are encountered because
> the destination machine waits for memory pages from the source
> machine.  Is there a reason this page fault-based throttling cannot be
> done on the source machine with pre-copy migration?  I'm not sure
> post-copy provides new behavior in terms of convergence, we could do
> the same with pre-copy migration.

There is different w/ these two approaches:
1. post-copy allows progress to vcpus that are not faulting at the
    moment.

    Assuming a subset of the guest vcpu can execute freely w/ their
    memory already at the destination, they can get 100% cpu time.
    The slowing down approach on the source host, slows down all vcpus.

2. Difference page access pattern
    post-copy uses on-demand like paging, so the page that is really
    required get transferred. The slow-down approach can just guess what
    page to send first.

>
> Post-copy has other advantages though, it immediately frees logical
> CPUs on the source machine (though RAM and network bandwidth is still
> required until migration completes).

W/ post-copy you can immediately free any page that got transferred to 
the destination.

At the end of the day, it's performance testing using various scenarios 
that can educate us whether post-copy worth the extra complexity over 
slowing down the guest on the source.

Cheers,
Dor

>
> Stefan
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/21][RFC] postcopy live migration
  2012-01-02  9:28         ` Dor Laor
@ 2012-01-02 17:22           ` Stefan Hajnoczi
  -1 siblings, 0 replies; 88+ messages in thread
From: Stefan Hajnoczi @ 2012-01-02 17:22 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Orit Wasserman, Isaku Yamahata

On Mon, Jan 2, 2012 at 9:28 AM, Dor Laor <dlaor@redhat.com> wrote:
> At the end of the day, it's performance testing using various scenarios that
> can educate us whether post-copy worth the extra complexity over slowing
> down the guest on the source.

True.  It's certainly an interesting patch series to benchmark and evaluate.

Stefan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-02 17:22           ` Stefan Hajnoczi
  0 siblings, 0 replies; 88+ messages in thread
From: Stefan Hajnoczi @ 2012-01-02 17:22 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Orit Wasserman, Isaku Yamahata

On Mon, Jan 2, 2012 at 9:28 AM, Dor Laor <dlaor@redhat.com> wrote:
> At the end of the day, it's performance testing using various scenarios that
> can educate us whether post-copy worth the extra complexity over slowing
> down the guest on the source.

True.  It's certainly an interesting patch series to benchmark and evaluate.

Stefan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/21][RFC] postcopy live migration
  2012-01-01  9:52     ` [Qemu-devel] " Dor Laor
@ 2012-01-04  1:30       ` Takuya Yoshikawa
  -1 siblings, 0 replies; 88+ messages in thread
From: Takuya Yoshikawa @ 2012-01-04  1:30 UTC (permalink / raw)
  To: dlaor
  Cc: Anthony Liguori, Isaku Yamahata, kvm, Juan Quintela, t.hirofuchi,
	satoshi.itoh, Michael Roth, qemu-devel, Umesh Deshpande

(2012/01/01 18:52), Dor Laor wrote:
>> But we really need to think hard about whether this is the right thing
>> to take into the tree. I worry a lot about the fact that we don't test
>> pre-copy migration nearly enough and adding a second form just
>> introduces more things to test.
>
> It is an issue but it can't be a merge criteria, Isaku is not blame of pre copy live migration lack of testing.
>
> I would say that 90% of issues of live migration problems are not related to the pre|post stage but more of issues of device model save state. So post-copy shouldn't add a significant regression here.

Though they may be only 10% the remaining issues tend to be hard to find.

>
> Probably it will be good to ask every migration patch writer to write an additional unit test for migration.
>
>> It's also not clear to me why post-copy is better. If you were going to
>> sit down and explain to someone building a management tool when they
>> should use pre-copy and when they should use post-copy, what would you
>> tell them?
>
> Today, we have a default of max-downtime of 100ms.
> If either the guest work set size or the host networking throughput can't match the downtime, migration won't end.
> The mgmt user options are:
> - increase the downtime more and more to an actual stop
> - fail migrate
>
> W/ post-copy there is another option.
> Performance measurements will teach us (probably prior to commit) when this stage is valuable. Most likely, we better try first with pre-copy and if we can't meet the downtime we can optionally use post-copy.

It is difficult to recommend mixing two methods which have different requirements
to users:

	post-copy cannot be canceled and, probably, needs some dedicated/reliable
	lines to make it sure that guests will not be broken during copy stage.

What we want to know, from user's point of view, is clear/simple criteria:

	what is needed for post-copy
	for what services we should select post-copy


	Takuya

>
> Here's a paper by Umesh (the migration thread writer):
> http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf
>
> Regards,
> Dor
>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-04  1:30       ` Takuya Yoshikawa
  0 siblings, 0 replies; 88+ messages in thread
From: Takuya Yoshikawa @ 2012-01-04  1:30 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, satoshi.itoh, t.hirofuchi, Juan Quintela, Michael Roth,
	qemu-devel, Isaku Yamahata, Umesh Deshpande

(2012/01/01 18:52), Dor Laor wrote:
>> But we really need to think hard about whether this is the right thing
>> to take into the tree. I worry a lot about the fact that we don't test
>> pre-copy migration nearly enough and adding a second form just
>> introduces more things to test.
>
> It is an issue but it can't be a merge criteria, Isaku is not blame of pre copy live migration lack of testing.
>
> I would say that 90% of issues of live migration problems are not related to the pre|post stage but more of issues of device model save state. So post-copy shouldn't add a significant regression here.

Though they may be only 10% the remaining issues tend to be hard to find.

>
> Probably it will be good to ask every migration patch writer to write an additional unit test for migration.
>
>> It's also not clear to me why post-copy is better. If you were going to
>> sit down and explain to someone building a management tool when they
>> should use pre-copy and when they should use post-copy, what would you
>> tell them?
>
> Today, we have a default of max-downtime of 100ms.
> If either the guest work set size or the host networking throughput can't match the downtime, migration won't end.
> The mgmt user options are:
> - increase the downtime more and more to an actual stop
> - fail migrate
>
> W/ post-copy there is another option.
> Performance measurements will teach us (probably prior to commit) when this stage is valuable. Most likely, we better try first with pre-copy and if we can't meet the downtime we can optionally use post-copy.

It is difficult to recommend mixing two methods which have different requirements
to users:

	post-copy cannot be canceled and, probably, needs some dedicated/reliable
	lines to make it sure that guests will not be broken during copy stage.

What we want to know, from user's point of view, is clear/simple criteria:

	what is needed for post-copy
	for what services we should select post-copy


	Takuya

>
> Here's a paper by Umesh (the migration thread writer):
> http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf
>
> Regards,
> Dor
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 21/21] postcopy: implement postcopy livemigration
  2011-12-29 16:06     ` [Qemu-devel] " Avi Kivity
@ 2012-01-04  3:29       ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 06:06:10PM +0200, Avi Kivity wrote:
> On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > This patch implements postcopy livemigration.
> >
> >  
> > +/* RAM is allocated via umem for postcopy incoming mode */
> > +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> > +
> >  typedef struct RAMBlock {
> >      uint8_t *host;
> >      ram_addr_t offset;
> > @@ -485,6 +488,10 @@ typedef struct RAMBlock {
> >  #if defined(__linux__) && !defined(TARGET_S390X)
> >      int fd;
> >  #endif
> > +
> > +#ifdef CONFIG_POSTCOPY
> > +    UMem *umem;    /* for incoming postcopy mode */
> > +#endif
> >  } RAMBlock;
> 
> Is it possible to implement this via the MemoryListener API (which
> replaces CPUPhysMemoryClient)?  This is how kvm, vhost, and xen manage
> their memory tables.

I'm afraid no. Those three you listed above are for outgoing part,
but this case is for incoming part. The requirement is quite different
from those three. What is needed is
- get the corresponding RAMBlock and UMem from (id, idlen)
- hook ram_alloc/ram_free (or RAM api corresponding)

thanks,
-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
@ 2012-01-04  3:29       ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 06:06:10PM +0200, Avi Kivity wrote:
> On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > This patch implements postcopy livemigration.
> >
> >  
> > +/* RAM is allocated via umem for postcopy incoming mode */
> > +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> > +
> >  typedef struct RAMBlock {
> >      uint8_t *host;
> >      ram_addr_t offset;
> > @@ -485,6 +488,10 @@ typedef struct RAMBlock {
> >  #if defined(__linux__) && !defined(TARGET_S390X)
> >      int fd;
> >  #endif
> > +
> > +#ifdef CONFIG_POSTCOPY
> > +    UMem *umem;    /* for incoming postcopy mode */
> > +#endif
> >  } RAMBlock;
> 
> Is it possible to implement this via the MemoryListener API (which
> replaces CPUPhysMemoryClient)?  This is how kvm, vhost, and xen manage
> their memory tables.

I'm afraid no. Those three you listed above are for outgoing part,
but this case is for incoming part. The requirement is quite different
from those three. What is needed is
- get the corresponding RAMBlock and UMem from (id, idlen)
- hook ram_alloc/ram_free (or RAM api corresponding)

thanks,
-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 21/21] postcopy: implement postcopy livemigration
  2011-12-29 15:51     ` Orit Wasserman
@ 2012-01-04  3:34       ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:34 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 05:51:36PM +0200, Orit Wasserman wrote:
> Hi,

Thank you for review.

> A general comment this patch is a bit too long,which makes it hard to review.
> Can you split it please?

Will do. Maybe split into umem.[hc] part, incoming part and outgoing part.


> On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > This patch implements postcopy livemigration.
> > 
> > Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
> > ---
> >  Makefile.target           |    4 +
> >  arch_init.c               |   26 +-
> >  cpu-all.h                 |    7 +
> >  exec.c                    |   20 +-
> >  migration-exec.c          |    8 +
> >  migration-fd.c            |   30 +
> >  migration-postcopy-stub.c |   77 ++
> >  migration-postcopy.c      | 1891 +++++++++++++++++++++++++++++++++++++++++++++
> >  migration-tcp.c           |   37 +-
> >  migration-unix.c          |   32 +-
> >  migration.c               |   31 +
> >  migration.h               |   30 +
> >  qemu-common.h             |    1 +
> >  qemu-options.hx           |    5 +-
> >  umem.c                    |  379 +++++++++
> >  umem.h                    |  105 +++
> >  vl.c                      |   14 +-
> >  17 files changed, 2677 insertions(+), 20 deletions(-)
> >  create mode 100644 migration-postcopy-stub.c
> >  create mode 100644 migration-postcopy.c
> >  create mode 100644 umem.c
> >  create mode 100644 umem.h
> > 
> > diff --git a/Makefile.target b/Makefile.target
> > index 3261383..d94c53f 100644
> > --- a/Makefile.target
> > +++ b/Makefile.target
> > @@ -4,6 +4,7 @@ GENERATED_HEADERS = config-target.h
> >  CONFIG_NO_PCI = $(if $(subst n,,$(CONFIG_PCI)),n,y)
> >  CONFIG_NO_KVM = $(if $(subst n,,$(CONFIG_KVM)),n,y)
> >  CONFIG_NO_XEN = $(if $(subst n,,$(CONFIG_XEN)),n,y)
> > +CONFIG_NO_POSTCOPY = $(if $(subst n,,$(CONFIG_POSTCOPY)),n,y)
> >  
> >  include ../config-host.mak
> >  include config-devices.mak
> > @@ -199,6 +200,9 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
> >  obj-y += memory.o
> >  LIBS+=-lz
> >  
> > +common-obj-$(CONFIG_POSTCOPY) += migration-postcopy.o umem.o
> > +common-obj-$(CONFIG_NO_POSTCOPY) += migration-postcopy-stub.o
> > +
> >  QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
> >  QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
> >  QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
> > diff --git a/arch_init.c b/arch_init.c
> > index bc53092..8b3130d 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -102,6 +102,13 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
> >      return 1;
> >  }
> >  
> > +static bool outgoing_postcopy = false;
> > +
> > +void ram_save_set_params(const MigrationParams *params, void *opaque)
> > +{
> > +    outgoing_postcopy = params->postcopy;
> > +}
> > +
> >  static RAMBlock *last_block_sent = NULL;
> >  
> >  int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
> > @@ -284,6 +291,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
> >      uint64_t expected_time = 0;
> >      int ret;
> >  
> > +    if (stage == 1) {
> > +        last_block_sent = NULL;
> > +
> > +        bytes_transferred = 0;
> > +        last_block = NULL;
> > +        last_offset = 0;
> 
> Changing of line order + new empty line
> 
> > +    }
> > +    if (outgoing_postcopy) {
> > +        return postcopy_outgoing_ram_save_live(mon, f, stage, opaque);
> > +    }
> > +
> 
> I would just do :
> 
> unregister_savevm_live and then register_savevm_live(...,postcopy_outgoing_ram_save_live,...)
> when starting outgoing postcopy migration.
> 
> >      if (stage < 0) {
> >          cpu_physical_memory_set_dirty_tracking(0);
> >          return 0;
> > @@ -295,10 +313,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
> >      }
> >  
> >      if (stage == 1) {
> > -        bytes_transferred = 0;
> > -        last_block_sent = NULL;
> > -        last_block = NULL;
> > -        last_offset = 0;
> >          sort_ram_list();
> >  
> >          /* Make sure all dirty bits are set */
> > @@ -436,6 +450,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
> >      int flags;
> >      int error;
> >  
> > +    if (incoming_postcopy) {
> > +        return postcopy_incoming_ram_load(f, opaque, version_id);
> > +    }
> > +
> why not call register_savevm_live(...,postcopy_incoming_ram_load,...) when starting guest with postcopy_incoming
> 
> >      if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
> >          return -EINVAL;
> >      }
> > diff --git a/cpu-all.h b/cpu-all.h
> > index 0244f7a..2e9d8a7 100644
> > --- a/cpu-all.h
> > +++ b/cpu-all.h
> > @@ -475,6 +475,9 @@ extern ram_addr_t ram_size;
> >  /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
> >  #define RAM_PREALLOC_MASK   (1 << 0)
> >  
> > +/* RAM is allocated via umem for postcopy incoming mode */
> > +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> > +
> >  typedef struct RAMBlock {
> >      uint8_t *host;
> >      ram_addr_t offset;
> > @@ -485,6 +488,10 @@ typedef struct RAMBlock {
> >  #if defined(__linux__) && !defined(TARGET_S390X)
> >      int fd;
> >  #endif
> > +
> > +#ifdef CONFIG_POSTCOPY
> > +    UMem *umem;    /* for incoming postcopy mode */
> > +#endif
> >  } RAMBlock;
> >  
> >  typedef struct RAMList {
> > diff --git a/exec.c b/exec.c
> > index c8c6692..90b0491 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -35,6 +35,7 @@
> >  #include "qemu-timer.h"
> >  #include "memory.h"
> >  #include "exec-memory.h"
> > +#include "migration.h"
> >  #if defined(CONFIG_USER_ONLY)
> >  #include <qemu.h>
> >  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> > @@ -2949,6 +2950,13 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
> >          new_block->host = host;
> >          new_block->flags |= RAM_PREALLOC_MASK;
> >      } else {
> > +#ifdef CONFIG_POSTCOPY
> > +        if (incoming_postcopy) {
> > +            postcopy_incoming_ram_alloc(name, size,
> > +                                        &new_block->host, &new_block->umem);
> > +            new_block->flags |= RAM_POSTCOPY_UMEM_MASK;
> > +        } else
> > +#endif
> >          if (mem_path) {
> >  #if defined (__linux__) && !defined(TARGET_S390X)
> >              new_block->host = file_ram_alloc(new_block, size, mem_path);
> > @@ -3027,7 +3035,13 @@ void qemu_ram_free(ram_addr_t addr)
> >              QLIST_REMOVE(block, next);
> >              if (block->flags & RAM_PREALLOC_MASK) {
> >                  ;
> > -            } else if (mem_path) {
> > +            }
> > +#ifdef CONFIG_POSTCOPY
> > +            else if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> > +                postcopy_incoming_ram_free(block->umem);
> > +            }
> > +#endif
> > +            else if (mem_path) {
> >  #if defined (__linux__) && !defined(TARGET_S390X)
> >                  if (block->fd) {
> >                      munmap(block->host, block->length);
> > @@ -3073,6 +3087,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
> >              } else {
> >                  flags = MAP_FIXED;
> >                  munmap(vaddr, length);
> > +                if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> > +                    postcopy_incoming_qemu_pages_unmapped(addr, length);
> > +                    block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> > +                }
> >                  if (mem_path) {
> >  #if defined(__linux__) && !defined(TARGET_S390X)
> >                      if (block->fd) {
> > diff --git a/migration-exec.c b/migration-exec.c
> > index e14552e..2bd0c3b 100644
> > --- a/migration-exec.c
> > +++ b/migration-exec.c
> > @@ -62,6 +62,10 @@ int exec_start_outgoing_migration(MigrationState *s, const char *command)
> >  {
> >      FILE *f;
> >  
> > +    if (s->params.postcopy) {
> > +        return -ENOSYS;
> > +    }
> > +
> >      f = popen(command, "w");
> >      if (f == NULL) {
> >          DPRINTF("Unable to popen exec target\n");
> > @@ -104,6 +108,10 @@ int exec_start_incoming_migration(const char *command)
> >  {
> >      QEMUFile *f;
> >  
> > +    if (incoming_postcopy) {
> > +        return -ENOSYS;
> > +    }
> > +
> >      DPRINTF("Attempting to start an incoming migration\n");
> >      f = qemu_popen_cmd(command, "r");
> >      if(f == NULL) {
> > diff --git a/migration-fd.c b/migration-fd.c
> > index 6211124..5a62ab9 100644
> > --- a/migration-fd.c
> > +++ b/migration-fd.c
> > @@ -88,6 +88,23 @@ int fd_start_outgoing_migration(MigrationState *s, const char *fdname)
> >      s->write = fd_write;
> >      s->close = fd_close;
> >  
> > +    if (s->params.postcopy) {
> > +        int flags = fcntl(s->fd, F_GETFL);
> > +        if ((flags & O_ACCMODE) != O_RDWR) {
> > +            goto err_after_open;
> > +        }
> > +
> > +        s->fd_read = dup(s->fd);
> > +        if (s->fd_read == -1) {
> > +            goto err_after_open;
> > +        }
> > +        s->file_read = qemu_fdopen(s->fd_read, "r");
> > +        if (s->file_read == NULL) {
> > +            close(s->fd_read);
> > +            goto err_after_open;
> > +        }
> > +    }
> > +
> >      migrate_fd_connect(s);
> >      return 0;
> >  
> > @@ -103,7 +120,14 @@ static void fd_accept_incoming_migration(void *opaque)
> >  
> >      process_incoming_migration(f);
> >      qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_fork_umemd(qemu_stdio_fd(f), f);
> > +    }
> >      qemu_fclose(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_qemu_ready();
> > +    }
> > +    return;
> >  }
> >  
> >  int fd_start_incoming_migration(const char *infd)
> > @@ -114,6 +138,12 @@ int fd_start_incoming_migration(const char *infd)
> >      DPRINTF("Attempting to start an incoming migration via fd\n");
> >  
> >      fd = strtol(infd, NULL, 0);
> > +    if (incoming_postcopy) {
> > +        int flags = fcntl(fd, F_GETFL);
> > +        if ((flags & O_ACCMODE) != O_RDWR) {
> > +            return -EINVAL;
> > +        }
> > +    }
> >      f = qemu_fdopen(fd, "rb");
> >      if(f == NULL) {
> >          DPRINTF("Unable to apply qemu wrapper to file descriptor\n");
> > diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
> > new file mode 100644
> > index 0000000..0b78de7
> > --- /dev/null
> > +++ b/migration-postcopy-stub.c
> > @@ -0,0 +1,77 @@
> > +/*
> > + * migration-postcopy-stub.c: postcopy livemigration
> > + *                            stub functions for non-supported hosts
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "sysemu.h"
> > +#include "migration.h"
> > +
> > +int postcopy_outgoing_create_read_socket(MigrationState *s)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_live(Monitor *mon,
> > +                                    QEMUFile *f, int stage, void *opaque)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +void *postcopy_outgoing_begin(MigrationState *ms)
> > +{
> > +    return NULL;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> > +                                          void *postcopy)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +void postcopy_incoming_prepare(void)
> > +{
> > +}
> > +
> > +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> > +{
> > +}
> > +
> > +void postcopy_incoming_qemu_ready(void)
> > +{
> > +}
> > +
> > +void postcopy_incoming_qemu_cleanup(void)
> > +{
> > +}
> > +
> > +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> > +{
> > +}
> > diff --git a/migration-postcopy.c b/migration-postcopy.c
> > new file mode 100644
> > index 0000000..ed0d574
> > --- /dev/null
> > +++ b/migration-postcopy.c
> > @@ -0,0 +1,1891 @@
> > +/*
> > + * migration-postcopy.c: postcopy livemigration
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "bitmap.h"
> > +#include "sysemu.h"
> > +#include "hw/hw.h"
> > +#include "arch_init.h"
> > +#include "migration.h"
> > +#include "umem.h"
> > +
> > +#include "memory.h"
> > +#define WANT_EXEC_OBSOLETE
> > +#include "exec-obsolete.h"
> > +
> > +//#define DEBUG_POSTCOPY
> > +#ifdef DEBUG_POSTCOPY
> > +#include <sys/syscall.h>
> > +#define DPRINTF(fmt, ...)                                               \
> > +    do {                                                                \
> > +        printf("%d:%ld %s:%d: " fmt, getpid(), syscall(SYS_gettid),     \
> > +               __func__, __LINE__, ## __VA_ARGS__);                     \
> > +    } while (0)
> > +#else
> > +#define DPRINTF(fmt, ...)       do { } while (0)
> > +#endif
> > +
> > +#define ALIGN_UP(size, align)   (((size) + (align) - 1) & ~((align) - 1))
> > +
> > +static void fd_close(int *fd)
> > +{
> > +    if (*fd >= 0) {
> > +        close(*fd);
> > +        *fd = -1;
> > +    }
> > +}
> > +
> > +/***************************************************************************
> > + * QEMUFile for non blocking pipe
> > + */
> > +
> > +/* read only */
> > +struct QEMUFilePipe {
> > +    int fd;
> > +    QEMUFile *file;
> > +};
> 
> Why not use QEMUFileSocket ?

Okay, will rename it to QEMUFile_FD (or whatever) and share the struct.


> > +typedef struct QEMUFilePipe QEMUFilePipe;
> > +
> > +static int pipe_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
> > +{
> > +    QEMUFilePipe *s = opaque;
> > +    ssize_t len = 0;
> > +
> > +    while (size > 0) {
> > +        ssize_t ret = read(s->fd, buf, size);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            }
> > +            if (len == 0) {
> > +                len = -errno;
> > +            }
> > +            break;
> > +        }
> > +
> > +        if (ret == 0) {
> > +            /* the write end of the pipe is closed */
> > +            break;
> > +        }
> > +        len += ret;
> > +        buf += ret;
> > +        size -= ret;
> > +    }
> > +
> > +    return len;
> > +}
> > +
> > +static int pipe_close(void *opaque)
> > +{
> > +    QEMUFilePipe *s = opaque;
> > +    g_free(s);
> > +    return 0;
> > +}
> > +
> > +static QEMUFile *qemu_fopen_pipe(int fd)
> > +{
> > +    QEMUFilePipe *s = g_malloc0(sizeof(*s));
> > +
> > +    s->fd = fd;
> > +    fcntl_setfl(fd, O_NONBLOCK);
> > +    s->file = qemu_fopen_ops(s, NULL, pipe_get_buffer, pipe_close,
> > +                             NULL, NULL, NULL);
> > +    return s->file;
> > +}
> > +
> > +/* write only */
> > +struct QEMUFileNonblock {
> > +    int fd;
> > +    QEMUFile *file;
> > +
> > +    /* for pipe-write nonblocking mode */
> > +#define BUF_SIZE_INC    (32 * 1024)     /* = IO_BUF_SIZE */
> > +    uint8_t *buffer;
> > +    size_t buffer_size;
> > +    size_t buffer_capacity;
> > +    bool freeze_output;
> > +};
> > +typedef struct QEMUFileNonblock QEMUFileNonblock;
> > +
> 
> Couldn't you use QEMUFileBuffered ?

QEMUFileBuffered can be built on top of QEMUFileNonblock.
I'll refactor buffered_file.c


> > +static void nonblock_flush_buffer(QEMUFileNonblock *s)
> > +{
> > +    size_t offset = 0;
> > +    ssize_t ret;
> > +
> > +    while (offset < s->buffer_size) {
> > +        ret = write(s->fd, s->buffer + offset, s->buffer_size - offset);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            } else if (errno == EAGAIN) {
> > +                s->freeze_output = true;
> > +            } else {
> > +                qemu_file_set_error(s->file, errno);
> > +            }
> > +            break;
> > +        }
> > +
> > +        if (ret == 0) {
> > +            DPRINTF("ret == 0\n");
> > +            break;
> > +        }
> > +
> > +        offset += ret;
> > +    }
> > +
> > +    if (offset > 0) {
> > +        assert(s->buffer_size >= offset);
> > +        memmove(s->buffer, s->buffer + offset, s->buffer_size - offset);
> > +        s->buffer_size -= offset;
> > +    }
> > +    if (s->buffer_size > 0) {
> > +        s->freeze_output = true;
> > +    }
> > +}
> > +
> > +static int nonblock_put_buffer(void *opaque,
> > +                               const uint8_t *buf, int64_t pos, int size)
> > +{
> > +    QEMUFileNonblock *s = opaque;
> > +    int error;
> > +    ssize_t len = 0;
> > +
> > +    error = qemu_file_get_error(s->file);
> > +    if (error) {
> > +        return error;
> > +    }
> > +
> > +    nonblock_flush_buffer(s);
> > +    error = qemu_file_get_error(s->file);
> > +    if (error) {
> > +        return error;
> > +    }
> > +
> > +    while (!s->freeze_output && size > 0) {
> > +        ssize_t ret;
> > +        assert(s->buffer_size == 0);
> > +
> > +        ret = write(s->fd, buf, size);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            } else if (errno == EAGAIN) {
> > +                s->freeze_output = true;
> > +            } else {
> > +                qemu_file_set_error(s->file, errno);
> > +            }
> > +            break;
> > +        }
> > +
> > +        len += ret;
> > +        buf += ret;
> > +        size -= ret;
> > +    }
> > +
> > +    if (size > 0) {
> > +        int inc = size - (s->buffer_capacity - s->buffer_size);
> > +        if (inc > 0) {
> > +            s->buffer_capacity +=
> > +                DIV_ROUND_UP(inc, BUF_SIZE_INC) * BUF_SIZE_INC;
> > +            s->buffer = g_realloc(s->buffer, s->buffer_capacity);
> > +        }
> > +        memcpy(s->buffer + s->buffer_size, buf, size);
> > +        s->buffer_size += size;
> > +
> > +        len += size;
> > +    }
> > +
> > +    return len;
> > +}
> > +
> > +static int nonblock_pending_size(QEMUFileNonblock *s)
> > +{
> > +    return qemu_pending_size(s->file) + s->buffer_size;
> > +}
> > +
> > +static void nonblock_fflush(QEMUFileNonblock *s)
> > +{
> > +    s->freeze_output = false;
> > +    nonblock_flush_buffer(s);
> > +    if (!s->freeze_output) {
> > +        qemu_fflush(s->file);
> > +    }
> > +}
> > +
> > +static void nonblock_wait_for_flush(QEMUFileNonblock *s)
> > +{
> > +    while (nonblock_pending_size(s) > 0) {
> > +        fd_set fds;
> > +        FD_ZERO(&fds);
> > +        FD_SET(s->fd, &fds);
> > +        select(s->fd + 1, NULL, &fds, NULL, NULL);
> > +
> > +        nonblock_fflush(s);
> > +    }
> > +}
> > +
> > +static int nonblock_close(void *opaque)
> > +{
> > +    QEMUFileNonblock *s = opaque;
> > +    nonblock_wait_for_flush(s);
> > +    g_free(s->buffer);
> > +    g_free(s);
> > +    return 0;
> > +}
> > +
> > +static QEMUFileNonblock *qemu_fopen_nonblock(int fd)
> > +{
> > +    QEMUFileNonblock *s = g_malloc0(sizeof(*s));
> > +
> > +    s->fd = fd;
> > +    fcntl_setfl(fd, O_NONBLOCK);
> > +    s->file = qemu_fopen_ops(s, nonblock_put_buffer, NULL, nonblock_close,
> > +                             NULL, NULL, NULL);
> > +    return s;
> > +}
> > +
> > +/***************************************************************************
> > + * umem daemon on destination <-> qemu on source protocol
> > + */
> > +
> > +#define QEMU_UMEM_REQ_INIT              0x00
> > +#define QEMU_UMEM_REQ_ON_DEMAND         0x01
> > +#define QEMU_UMEM_REQ_ON_DEMAND_CONT    0x02
> > +#define QEMU_UMEM_REQ_BACKGROUND        0x03
> > +#define QEMU_UMEM_REQ_BACKGROUND_CONT   0x04
> > +#define QEMU_UMEM_REQ_REMOVE            0x05
> > +#define QEMU_UMEM_REQ_EOC               0x06
> > +
> > +struct qemu_umem_req {
> > +    int8_t cmd;
> > +    uint8_t len;
> > +    char *idstr;        /* ON_DEMAND, BACKGROUND, REMOVE */
> > +    uint32_t nr;        /* ON_DEMAND, ON_DEMAND_CONT,
> > +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> > +
> > +    /* in target page size as qemu migration protocol */
> > +    uint64_t *pgoffs;   /* ON_DEMAND, ON_DEMAND_CONT,
> > +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> > +};
> > +
> > +static void postcopy_incoming_send_req_idstr(QEMUFile *f, const char* idstr)
> > +{
> > +    qemu_put_byte(f, strlen(idstr));
> > +    qemu_put_buffer(f, (uint8_t *)idstr, strlen(idstr));
> > +}
> > +
> > +static void postcopy_incoming_send_req_pgoffs(QEMUFile *f, uint32_t nr,
> > +                                              const uint64_t *pgoffs)
> > +{
> > +    uint32_t i;
> > +
> > +    qemu_put_be32(f, nr);
> > +    for (i = 0; i < nr; i++) {
> > +        qemu_put_be64(f, pgoffs[i]);
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_send_req_one(QEMUFile *f,
> > +                                           const struct qemu_umem_req *req)
> > +{
> > +    DPRINTF("cmd %d\n", req->cmd);
> > +    qemu_put_byte(f, req->cmd);
> > +    switch (req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +    case QEMU_UMEM_REQ_EOC:
> > +        /* nothing */
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +        postcopy_incoming_send_req_idstr(f, req->idstr);
> > +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +}
> > +
> > +/* QEMUFile can buffer up to IO_BUF_SIZE = 32 * 1024.
> > + * So one message size must be <= IO_BUF_SIZE
> > + * cmd: 1
> > + * id len: 1
> > + * id: 256
> > + * nr: 2
> > + */
> > +#define MAX_PAGE_NR     ((32 * 1024 - 1 - 1 - 256 - 2) / sizeof(uint64_t))
> > +static void postcopy_incoming_send_req(QEMUFile *f,
> > +                                       const struct qemu_umem_req *req)
> > +{
> > +    uint32_t nr = req->nr;
> > +    struct qemu_umem_req tmp = *req;
> > +
> > +    switch (req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +    case QEMU_UMEM_REQ_EOC:
> > +        postcopy_incoming_send_req_one(f, &tmp);
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +        tmp.nr = MIN(nr, MAX_PAGE_NR);
> > +        postcopy_incoming_send_req_one(f, &tmp);
> > +
> > +        nr -= tmp.nr;
> > +        tmp.pgoffs += tmp.nr;
> > +        if (tmp.cmd == QEMU_UMEM_REQ_ON_DEMAND) {
> > +            tmp.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> > +        }else {
> > +            tmp.cmd = QEMU_UMEM_REQ_BACKGROUND_CONT;
> > +        }
> > +        /* fall through */
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        while (nr > 0) {
> > +            tmp.nr = MIN(nr, MAX_PAGE_NR);
> > +            postcopy_incoming_send_req_one(f, &tmp);
> > +
> > +            nr -= tmp.nr;
> > +            tmp.pgoffs += tmp.nr;
> > +        }
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +}
> > +
> > +static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
> > +                                            struct qemu_umem_req *req,
> > +                                            size_t *offset)
> > +{
> > +    int ret;
> > +
> > +    req->len = qemu_peek_byte(f, *offset);
> > +    *offset += 1;
> > +    if (req->len == 0) {
> > +        return -EAGAIN;
> > +    }
> > +    req->idstr = g_malloc((int)req->len + 1);
> > +    ret = qemu_peek_buffer(f, (uint8_t*)req->idstr, req->len, *offset);
> > +    *offset += ret;
> > +    if (ret != req->len) {
> > +        g_free(req->idstr);
> > +        req->idstr = NULL;
> > +        return -EAGAIN;
> > +    }
> > +    req->idstr[req->len] = 0;
> > +    return 0;
> > +}
> > +
> > +static int postcopy_outgoing_recv_req_pgoffs(QEMUFile *f,
> > +                                             struct qemu_umem_req *req,
> > +                                             size_t *offset)
> > +{
> > +    int ret;
> > +    uint32_t be32;
> > +    uint32_t i;
> > +
> > +    ret = qemu_peek_buffer(f, (uint8_t*)&be32, sizeof(be32), *offset);
> > +    *offset += sizeof(be32);
> > +    if (ret != sizeof(be32)) {
> > +        return -EAGAIN;
> > +    }
> > +
> > +    req->nr = be32_to_cpu(be32);
> > +    req->pgoffs = g_new(uint64_t, req->nr);
> > +    for (i = 0; i < req->nr; i++) {
> > +        uint64_t be64;
> > +        ret = qemu_peek_buffer(f, (uint8_t*)&be64, sizeof(be64), *offset);
> > +        *offset += sizeof(be64);
> > +        if (ret != sizeof(be64)) {
> > +            g_free(req->pgoffs);
> > +            req->pgoffs = NULL;
> > +            return -EAGAIN;
> > +        }
> > +        req->pgoffs[i] = be64_to_cpu(be64);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int postcopy_outgoing_recv_req(QEMUFile *f, struct qemu_umem_req *req)
> > +{
> > +    int size;
> > +    int ret;
> > +    size_t offset = 0;
> > +
> > +    size = qemu_peek_buffer(f, (uint8_t*)&req->cmd, 1, offset);
> > +    if (size <= 0) {
> > +        return -EAGAIN;
> > +    }
> > +    offset += 1;
> > +
> > +    switch (req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +    case QEMU_UMEM_REQ_EOC:
> > +        /* nothing */
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +        ret = postcopy_outgoing_recv_req_idstr(f, req, &offset);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +    qemu_file_skip(f, offset);
> > +    DPRINTF("cmd %d\n", req->cmd);
> > +    return 0;
> > +}
> > +
> > +static void postcopy_outgoing_free_req(struct qemu_umem_req *req)
> > +{
> > +    g_free(req->idstr);
> > +    g_free(req->pgoffs);
> > +}
> > +
> > +/***************************************************************************
> > + * outgoing part
> > + */
> > +
> > +#define QEMU_SAVE_LIVE_STAGE_START      0x01    /* = QEMU_VM_SECTION_START */
> > +#define QEMU_SAVE_LIVE_STAGE_PART       0x02    /* = QEMU_VM_SECTION_PART */
> > +#define QEMU_SAVE_LIVE_STAGE_END        0x03    /* = QEMU_VM_SECTION_END */
> > +
> > +enum POState {
> > +    PO_STATE_ERROR_RECEIVE,
> > +    PO_STATE_ACTIVE,
> > +    PO_STATE_EOC_RECEIVED,
> > +    PO_STATE_ALL_PAGES_SENT,
> > +    PO_STATE_COMPLETED,
> > +};
> > +typedef enum POState POState;
> > +
> > +struct PostcopyOutgoingState {
> > +    POState state;
> > +    QEMUFile *mig_read;
> > +    int fd_read;
> > +    RAMBlock *last_block_read;
> > +
> > +    QEMUFile *mig_buffered_write;
> > +    MigrationState *ms;
> > +
> > +    /* For nobg mode. Check if all pages are sent */
> > +    RAMBlock *block;
> > +    ram_addr_t addr;
> > +};
> > +typedef struct PostcopyOutgoingState PostcopyOutgoingState;
> > +
> > +int postcopy_outgoing_create_read_socket(MigrationState *s)
> > +{
> > +    if (!s->params.postcopy) {
> > +        return 0;
> > +    }
> > +
> > +    s->fd_read = dup(s->fd);
> > +    if (s->fd_read == -1) {
> > +        int ret = -errno;
> > +        perror("dup");
> > +        return ret;
> > +    }
> > +    s->file_read = qemu_fopen_socket(s->fd_read);
> > +    if (s->file_read == NULL) {
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_live(Monitor *mon,
> > +                                    QEMUFile *f, int stage, void *opaque)
> > +{
> > +    int ret = 0;
> > +    DPRINTF("stage %d\n", stage);
> > +    if (stage == QEMU_SAVE_LIVE_STAGE_START) {
> > +        sort_ram_list();
> > +        ram_save_live_mem_size(f);
> > +    }
> > +    if (stage == QEMU_SAVE_LIVE_STAGE_PART) {
> > +        ret = 1;
> > +    }
> > +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> > +    return ret;
> > +}
> > +
> > +static RAMBlock *postcopy_outgoing_find_block(const char *idstr)
> > +{
> > +    RAMBlock *block;
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        if (!strncmp(idstr, block->idstr, strlen(idstr))) {
> > +            return block;
> > +        }
> > +    }
> > +    return NULL;
> > +}
> > +
> > +/*
> > + * return value
> > + *   0: continue postcopy mode
> > + * > 0: completed postcopy mode.
> > + * < 0: error
> > + */
> > +static int postcopy_outgoing_handle_req(PostcopyOutgoingState *s,
> > +                                        const struct qemu_umem_req *req,
> > +                                        bool *written)
> > +{
> > +    int i;
> > +    RAMBlock *block;
> > +
> > +    DPRINTF("cmd %d state %d\n", req->cmd, s->state);
> > +    switch(req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +        /* nothing */
> > +        break;
> > +    case QEMU_UMEM_REQ_EOC:
> > +        /* tell to finish migration. */
> > +        if (s->state == PO_STATE_ALL_PAGES_SENT) {
> > +            s->state = PO_STATE_COMPLETED;
> > +            DPRINTF("-> PO_STATE_COMPLETED\n");
> > +        } else {
> > +            s->state = PO_STATE_EOC_RECEIVED;
> > +            DPRINTF("-> PO_STATE_EOC_RECEIVED\n");
> > +        }
> > +        return 1;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +        DPRINTF("idstr: %s\n", req->idstr);
> > +        block = postcopy_outgoing_find_block(req->idstr);
> > +        if (block == NULL) {
> > +            return -EINVAL;
> > +        }
> > +        s->last_block_read = block;
> > +        /* fall through */
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        DPRINTF("nr %d\n", req->nr);
> > +        for (i = 0; i < req->nr; i++) {
> > +            DPRINTF("offs[%d] 0x%"PRIx64"\n", i, req->pgoffs[i]);
> > +            int ret = ram_save_page(s->mig_buffered_write, s->last_block_read,
> > +                                    req->pgoffs[i] << TARGET_PAGE_BITS);
> > +            if (ret > 0) {
> > +                *written = true;
> > +            }
> > +        }
> > +        break;
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +        block = postcopy_outgoing_find_block(req->idstr);
> > +        if (block == NULL) {
> > +            return -EINVAL;
> > +        }
> > +        for (i = 0; i < req->nr; i++) {
> > +            ram_addr_t addr = block->offset +
> > +                (req->pgoffs[i] << TARGET_PAGE_BITS);
> > +            cpu_physical_memory_reset_dirty(addr,
> > +                                            addr + TARGET_PAGE_SIZE,
> > +                                            MIGRATION_DIRTY_FLAG);
> > +        }
> > +        break;
> > +    default:
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void postcopy_outgoing_close_mig_read(PostcopyOutgoingState *s)
> > +{
> > +    if (s->mig_read != NULL) {
> > +        qemu_set_fd_handler(s->fd_read, NULL, NULL, NULL);
> > +        qemu_fclose(s->mig_read);
> > +        s->mig_read = NULL;
> > +        fd_close(&s->fd_read);
> > +
> > +        s->ms->file_read = NULL;
> > +        s->ms->fd_read = -1;
> > +    }
> > +}
> > +
> > +static void postcopy_outgoing_completed(PostcopyOutgoingState *s)
> > +{
> > +    postcopy_outgoing_close_mig_read(s);
> > +    s->ms->postcopy = NULL;
> > +    g_free(s);
> > +}
> > +
> > +static void postcopy_outgoing_recv_handler(void *opaque)
> > +{
> > +    PostcopyOutgoingState *s = opaque;
> > +    bool written = false;
> > +    int ret = 0;
> > +
> > +    assert(s->state == PO_STATE_ACTIVE ||
> > +           s->state == PO_STATE_ALL_PAGES_SENT);
> > +
> > +    do {
> > +        struct qemu_umem_req req = {.idstr = NULL,
> > +                                    .pgoffs = NULL};
> > +
> > +        ret = postcopy_outgoing_recv_req(s->mig_read, &req);
> > +        if (ret < 0) {
> > +            if (ret == -EAGAIN) {
> > +                ret = 0;
> > +            }
> > +            break;
> > +        }
> > +        if (s->state == PO_STATE_ACTIVE) {
> > +            ret = postcopy_outgoing_handle_req(s, &req, &written);
> > +        }
> > +        postcopy_outgoing_free_req(&req);
> > +    } while (ret == 0);
> > +
> > +    /*
> > +     * flush buffered_file.
> > +     * Although mig_write is rate-limited buffered file, those written pages
> > +     * are requested on demand by the destination. So forcibly push
> > +     * those pages ignoring rate limiting
> > +     */
> > +    if (written) {
> > +        qemu_fflush(s->mig_buffered_write);
> > +        /* qemu_buffered_file_drain(s->mig_buffered_write); */
> > +    }
> > +
> > +    if (ret < 0) {
> > +        switch (s->state) {
> > +        case PO_STATE_ACTIVE:
> > +            s->state = PO_STATE_ERROR_RECEIVE;
> > +            DPRINTF("-> PO_STATE_ERROR_RECEIVE\n");
> > +            break;
> > +        case PO_STATE_ALL_PAGES_SENT:
> > +            s->state = PO_STATE_COMPLETED;
> > +            DPRINTF("-> PO_STATE_ALL_PAGES_SENT\n");
> > +            break;
> > +        default:
> > +            abort();
> > +        }
> > +    }
> > +    if (s->state == PO_STATE_ERROR_RECEIVE || s->state == PO_STATE_COMPLETED) {
> > +        postcopy_outgoing_close_mig_read(s);
> > +    }
> > +    if (s->state == PO_STATE_COMPLETED) {
> > +        DPRINTF("PO_STATE_COMPLETED\n");
> > +        MigrationState *ms = s->ms;
> > +        postcopy_outgoing_completed(s);
> > +        migrate_fd_completed(ms);
> > +    }
> > +}
> > +
> > +void *postcopy_outgoing_begin(MigrationState *ms)
> > +{
> > +    PostcopyOutgoingState *s = g_new(PostcopyOutgoingState, 1);
> > +    DPRINTF("outgoing begin\n");
> > +    qemu_fflush(ms->file);
> > +
> > +    s->ms = ms;
> > +    s->state = PO_STATE_ACTIVE;
> > +    s->fd_read = ms->fd_read;
> > +    s->mig_read = ms->file_read;
> > +    s->mig_buffered_write = ms->file;
> > +    s->block = NULL;
> > +    s->addr = 0;
> > +
> > +    /* Make sure all dirty bits are set */
> > +    ram_save_memory_set_dirty();
> > +
> > +    qemu_set_fd_handler(s->fd_read,
> > +                        &postcopy_outgoing_recv_handler, NULL, s);
> > +    return s;
> > +}
> > +
> > +static void postcopy_outgoing_ram_all_sent(QEMUFile *f,
> > +                                           PostcopyOutgoingState *s)
> > +{
> > +    assert(s->state == PO_STATE_ACTIVE);
> > +
> > +    s->state = PO_STATE_ALL_PAGES_SENT;
> > +    /* tell incoming side that all pages are sent */
> > +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> > +    qemu_fflush(f);
> > +    qemu_buffered_file_drain(f);
> > +    DPRINTF("sent RAM_SAVE_FLAG_EOS\n");
> > +    migrate_fd_cleanup(s->ms);
> > +
> > +    /* Later migrate_fd_complete() will be called which calls
> > +     * migrate_fd_cleanup() again. So dummy file is created
> > +     * for qemu monitor to keep working.
> > +     */
> > +    s->ms->file = qemu_fopen_ops(NULL, NULL, NULL, NULL, NULL,
> > +                                 NULL, NULL);
> > +}
> > +
> > +static int postcopy_outgoing_check_all_ram_sent(PostcopyOutgoingState *s,
> > +                                                RAMBlock *block,
> > +                                                ram_addr_t addr)
> > +{
> > +    if (block == NULL) {
> > +        block = QLIST_FIRST(&ram_list.blocks);
> > +        addr = block->offset;
> > +    }
> > +
> > +    for (; block != NULL;
> > +         s->block = QLIST_NEXT(s->block, next), addr = block->offset) {
> > +        for (; addr < block->offset + block->length;
> > +             addr += TARGET_PAGE_SIZE) {
> > +            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
> > +                s->block = block;
> > +                s->addr = addr;
> > +                return 0;
> > +            }
> > +        }
> > +    }
> > +
> > +    return 1;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> > +                                          void *postcopy)
> > +{
> > +    PostcopyOutgoingState *s = postcopy;
> > +
> > +    assert(s->state == PO_STATE_ACTIVE ||
> > +           s->state == PO_STATE_EOC_RECEIVED ||
> > +           s->state == PO_STATE_ERROR_RECEIVE);
> > +
> > +    switch (s->state) {
> > +    case PO_STATE_ACTIVE:
> > +        /* nothing. processed below */
> > +        break;
> > +    case PO_STATE_EOC_RECEIVED:
> > +        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> > +        s->state = PO_STATE_COMPLETED;
> > +        postcopy_outgoing_completed(s);
> > +        DPRINTF("PO_STATE_COMPLETED\n");
> > +        return 1;
> > +    case PO_STATE_ERROR_RECEIVE:
> > +        postcopy_outgoing_completed(s);
> > +        DPRINTF("PO_STATE_ERROR_RECEIVE\n");
> > +        return -1;
> > +    default:
> > +        abort();
> > +    }
> > +
> > +    if (s->ms->params.nobg) {
> > +        /* See if all pages are sent. */
> > +        if (postcopy_outgoing_check_all_ram_sent(s, s->block, s->addr) == 0) {
> > +            return 0;
> > +        }
> > +        /* ram_list can be reordered. (it doesn't seem so during migration,
> > +           though) So the whole list needs to be checked again */
> > +        if (postcopy_outgoing_check_all_ram_sent(s, NULL, 0) == 0) {
> > +            return 0;
> > +        }
> > +
> > +        postcopy_outgoing_ram_all_sent(f, s);
> > +        return 0;
> > +    }
> > +
> > +    DPRINTF("outgoing background state: %d\n", s->state);
> > +
> > +    while (qemu_file_rate_limit(f) == 0) {
> > +        if (ram_save_block(f) == 0) { /* no more blocks */
> > +            assert(s->state == PO_STATE_ACTIVE);
> > +            postcopy_outgoing_ram_all_sent(f, s);
> > +            return 0;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/***************************************************************************
> > + * incoming part
> > + */
> > +
> > +/* flags for incoming mode to modify the behavior.
> > +   This is for benchmark/debug purpose */
> > +#define INCOMING_FLAGS_FAULT_REQUEST 0x01
> > +
> > +
> > +static void postcopy_incoming_umemd(void);
> > +
> > +#define PIS_STATE_QUIT_RECEIVED         0x01
> > +#define PIS_STATE_QUIT_QUEUED           0x02
> > +#define PIS_STATE_QUIT_SENT             0x04
> > +
> > +#define PIS_STATE_QUIT_MASK             (PIS_STATE_QUIT_RECEIVED | \
> > +                                         PIS_STATE_QUIT_QUEUED | \
> > +                                         PIS_STATE_QUIT_SENT)
> > +
> > +struct PostcopyIncomingState {
> > +    /* dest qemu state */
> > +    uint32_t    state;
> > +
> > +    UMemDev *dev;
> > +    int host_page_size;
> > +    int host_page_shift;
> > +
> > +    /* qemu side */
> > +    int to_umemd_fd;
> > +    QEMUFileNonblock *to_umemd;
> > +#define MAX_FAULTED_PAGES       256
> > +    struct umem_pages *faulted_pages;
> > +
> > +    int from_umemd_fd;
> > +    QEMUFile *from_umemd;
> > +    int version_id;     /* save/load format version id */
> > +};
> > +typedef struct PostcopyIncomingState PostcopyIncomingState;
> > +
> > +
> > +#define UMEM_STATE_EOS_RECEIVED         0x01    /* umem daemon <-> src qemu */
> > +#define UMEM_STATE_EOC_SENT             0x02    /* umem daemon <-> src qemu */
> > +#define UMEM_STATE_QUIT_RECEIVED        0x04    /* umem daemon <-> dst qemu */
> > +#define UMEM_STATE_QUIT_QUEUED          0x08    /* umem daemon <-> dst qemu */
> > +#define UMEM_STATE_QUIT_SENT            0x10    /* umem daemon <-> dst qemu */
> > +
> > +#define UMEM_STATE_QUIT_MASK            (UMEM_STATE_QUIT_QUEUED | \
> > +                                         UMEM_STATE_QUIT_SENT | \
> > +                                         UMEM_STATE_QUIT_RECEIVED)
> > +#define UMEM_STATE_END_MASK             (UMEM_STATE_EOS_RECEIVED | \
> > +                                         UMEM_STATE_EOC_SENT | \
> > +                                         UMEM_STATE_QUIT_MASK)
> > +
> > +struct PostcopyIncomingUMemDaemon {
> > +    /* umem daemon side */
> > +    uint32_t state;
> > +
> > +    int host_page_size;
> > +    int host_page_shift;
> > +    int nr_host_pages_per_target_page;
> > +    int host_to_target_page_shift;
> > +    int nr_target_pages_per_host_page;
> > +    int target_to_host_page_shift;
> > +    int version_id;     /* save/load format version id */
> > +
> > +    int to_qemu_fd;
> > +    QEMUFileNonblock *to_qemu;
> > +    int from_qemu_fd;
> > +    QEMUFile *from_qemu;
> > +
> > +    int mig_read_fd;
> > +    QEMUFile *mig_read;         /* qemu on source -> umem daemon */
> > +
> > +    int mig_write_fd;
> > +    QEMUFileNonblock *mig_write;        /* umem daemon -> qemu on source */
> > +
> > +    /* = KVM_MAX_VCPUS * (ASYNC_PF_PER_VCPUS + 1) */
> > +#define MAX_REQUESTS    (512 * (64 + 1))
> > +
> > +    struct umem_page_request page_request;
> > +    struct umem_page_cached page_cached;
> > +
> > +#define MAX_PRESENT_REQUESTS    MAX_FAULTED_PAGES
> > +    struct umem_pages *present_request;
> > +
> > +    uint64_t *target_pgoffs;
> > +
> > +    /* bitmap indexed by target page offset */
> > +    unsigned long *phys_requested;
> > +
> > +    /* bitmap indexed by target page offset */
> > +    unsigned long *phys_received;
> > +
> > +    RAMBlock *last_block_read;  /* qemu on source -> umem daemon */
> > +    RAMBlock *last_block_write; /* umem daemon -> qemu on source */
> > +};
> > +typedef struct PostcopyIncomingUMemDaemon PostcopyIncomingUMemDaemon;
> > +
> > +static PostcopyIncomingState state = {
> > +    .state = 0,
> > +    .dev = NULL,
> > +    .to_umemd_fd = -1,
> > +    .to_umemd = NULL,
> > +    .from_umemd_fd = -1,
> > +    .from_umemd = NULL,
> > +};
> > +
> > +static PostcopyIncomingUMemDaemon umemd = {
> > +    .state = 0,
> > +    .to_qemu_fd = -1,
> > +    .to_qemu = NULL,
> > +    .from_qemu_fd = -1,
> > +    .from_qemu = NULL,
> > +    .mig_read_fd = -1,
> > +    .mig_read = NULL,
> > +    .mig_write_fd = -1,
> > +    .mig_write = NULL,
> > +};
> > +
> > +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> > +{
> > +    /* incoming_postcopy makes sense only when incoming migration mode */
> > +    if (!incoming && incoming_postcopy) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!incoming_postcopy) {
> > +        return 0;
> > +    }
> > +
> > +    state.state = 0;
> > +    state.dev = umem_dev_new();
> > +    state.host_page_size = getpagesize();
> > +    state.host_page_shift = ffs(state.host_page_size) - 1;
> > +    state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
> > +                                               ram_save_live() */
> > +    return 0;
> > +}
> > +
> > +void postcopy_incoming_ram_alloc(const char *name,
> > +                                 size_t size, uint8_t **hostp, UMem **umemp)
> > +{
> > +    UMem *umem;
> > +    size = ALIGN_UP(size, state.host_page_size);
> > +    umem = umem_dev_create(state.dev, size, name);
> > +
> > +    *umemp = umem;
> > +    *hostp = umem->umem;
> > +}
> > +
> > +void postcopy_incoming_ram_free(UMem *umem)
> > +{
> > +    umem_unmap(umem);
> > +    umem_close(umem);
> > +    umem_destroy(umem);
> > +}
> > +
> > +void postcopy_incoming_prepare(void)
> > +{
> > +    RAMBlock *block;
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        if (block->umem != NULL) {
> > +            umem_mmap(block->umem);
> > +        }
> > +    }
> > +}
> > +
> > +static int postcopy_incoming_ram_load_get64(QEMUFile *f,
> > +                                             ram_addr_t *addr, int *flags)
> > +{
> > +    *addr = qemu_get_be64(f);
> > +    *flags = *addr & ~TARGET_PAGE_MASK;
> > +    *addr &= TARGET_PAGE_MASK;
> > +    return qemu_file_get_error(f);
> > +}
> > +
> > +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    ram_addr_t addr;
> > +    int flags;
> > +    int error;
> > +
> > +    DPRINTF("incoming ram load\n");
> > +    /*
> > +     * RAM_SAVE_FLAGS_EOS or
> > +     * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
> > +     * see postcopy_outgoing_ram_save_live()
> > +     */
> > +
> > +    if (version_id != RAM_SAVE_VERSION_ID) {
> > +        DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
> > +                version_id, RAM_SAVE_VERSION_ID);
> > +        return -EINVAL;
> > +    }
> > +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> > +    DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
> > +    if (error) {
> > +        DPRINTF("error %d\n", error);
> > +        return error;
> > +    }
> > +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> > +        DPRINTF("EOS\n");
> > +        return 0;
> > +    }
> > +
> > +    if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
> > +        DPRINTF("-EINVAL flags 0x%x\n", flags);
> > +        return -EINVAL;
> > +    }
> > +    error = ram_load_mem_size(f, addr);
> > +    if (error) {
> > +        DPRINTF("addr 0x%lx error %d\n", addr, error);
> > +        return error;
> > +    }
> > +
> > +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> > +    if (error) {
> > +        DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
> > +        return error;
> > +    }
> > +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> > +        DPRINTF("done\n");
> > +        return 0;
> > +    }
> > +    DPRINTF("-EINVAL\n");
> > +    return -EINVAL;
> > +}
> > +
> > +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> > +{
> > +    int fds[2];
> > +    RAMBlock *block;
> > +
> > +    DPRINTF("fork\n");
> > +
> > +    /* socketpair(AF_UNIX)? */
> > +
> > +    if (qemu_pipe(fds) == -1) {
> > +        perror("qemu_pipe");
> > +        abort();
> > +    }
> > +    state.from_umemd_fd = fds[0];
> > +    umemd.to_qemu_fd = fds[1];
> > +
> > +    if (qemu_pipe(fds) == -1) {
> > +        perror("qemu_pipe");
> > +        abort();
> > +    }
> > +    umemd.from_qemu_fd = fds[0];
> > +    state.to_umemd_fd = fds[1];
> > +
> > +    pid_t child = fork();
> > +    if (child < 0) {
> > +        perror("fork");
> > +        abort();
> > +    }
> > +
> > +    if (child == 0) {
> > +        int mig_write_fd;
> > +
> > +        fd_close(&state.to_umemd_fd);
> > +        fd_close(&state.from_umemd_fd);
> > +        umemd.host_page_size = state.host_page_size;
> > +        umemd.host_page_shift = state.host_page_shift;
> > +
> > +        umemd.nr_host_pages_per_target_page =
> > +            TARGET_PAGE_SIZE / umemd.host_page_size;
> > +        umemd.nr_target_pages_per_host_page =
> > +            umemd.host_page_size / TARGET_PAGE_SIZE;
> > +
> > +        umemd.target_to_host_page_shift =
> > +            ffs(umemd.nr_host_pages_per_target_page) - 1;
> > +        umemd.host_to_target_page_shift =
> > +            ffs(umemd.nr_target_pages_per_host_page) - 1;
> > +
> > +        umemd.state = 0;
> > +        umemd.version_id = state.version_id;
> > +        umemd.mig_read_fd = mig_read_fd;
> > +        umemd.mig_read = mig_read;
> > +
> > +        mig_write_fd = dup(mig_read_fd);
> > +        if (mig_write_fd < 0) {
> > +            perror("could not dup for writable socket \n");
> > +            abort();
> > +        }
> > +        umemd.mig_write_fd = mig_write_fd;
> > +        umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
> > +
> > +        postcopy_incoming_umemd(); /* noreturn */
> > +    }
> > +
> > +    DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
> > +    fd_close(&umemd.to_qemu_fd);
> > +    fd_close(&umemd.from_qemu_fd);
> > +    state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAGES));
> > +    state.faulted_pages->nr = 0;
> > +
> > +    /* close all UMem.shmem_fd */
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        umem_close_shmem(block->umem);
> > +    }
> > +    umem_qemu_wait_for_daemon(state.from_umemd_fd);
> > +}
> > +
> > +static void postcopy_incoming_qemu_recv_quit(void)
> > +{
> > +    RAMBlock *block;
> > +    if (state.state & PIS_STATE_QUIT_RECEIVED) {
> > +        return;
> > +    }
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        if (block->umem != NULL) {
> > +            umem_destroy(block->umem);
> > +            block->umem = NULL;
> > +            block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> > +        }
> > +    }
> > +
> > +    DPRINTF("|= PIS_STATE_QUIT_RECEIVED\n");
> > +    state.state |= PIS_STATE_QUIT_RECEIVED;
> > +    qemu_set_fd_handler(state.from_umemd_fd, NULL, NULL, NULL);
> > +    qemu_fclose(state.from_umemd);
> > +    state.from_umemd = NULL;
> > +    fd_close(&state.from_umemd_fd);
> > +}
> > +
> > +static void postcopy_incoming_qemu_fflush_to_umemd_handler(void *opaque)
> > +{
> > +    assert(state.to_umemd != NULL);
> > +
> > +    nonblock_fflush(state.to_umemd);
> > +    if (nonblock_pending_size(state.to_umemd) > 0) {
> > +        return;
> > +    }
> > +
> > +    qemu_set_fd_handler(state.to_umemd->fd, NULL, NULL, NULL);
> > +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> > +        DPRINTF("|= PIS_STATE_QUIT_SENT\n");
> > +        state.state |= PIS_STATE_QUIT_SENT;
> > +        qemu_fclose(state.to_umemd->file);
> > +        state.to_umemd = NULL;
> > +        fd_close(&state.to_umemd_fd);
> > +        g_free(state.faulted_pages);
> > +        state.faulted_pages = NULL;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_qemu_fflush_to_umemd(void)
> > +{
> > +    qemu_set_fd_handler(state.to_umemd->fd, NULL,
> > +                        postcopy_incoming_qemu_fflush_to_umemd_handler, NULL);
> > +    postcopy_incoming_qemu_fflush_to_umemd_handler(NULL);
> > +}
> > +
> > +static void postcopy_incoming_qemu_queue_quit(void)
> > +{
> > +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> > +        return;
> > +    }
> > +
> > +    DPRINTF("|= PIS_STATE_QUIT_QUEUED\n");
> > +    umem_qemu_quit(state.to_umemd->file);
> > +    state.state |= PIS_STATE_QUIT_QUEUED;
> > +}
> > +
> > +static void postcopy_incoming_qemu_send_pages_present(void)
> > +{
> > +    if (state.faulted_pages->nr > 0) {
> > +        umem_qemu_send_pages_present(state.to_umemd->file,
> > +                                     state.faulted_pages);
> > +        state.faulted_pages->nr = 0;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_qemu_faulted_pages(
> > +    const struct umem_pages *pages)
> > +{
> > +    assert(pages->nr <= MAX_FAULTED_PAGES);
> > +    assert(state.faulted_pages != NULL);
> > +
> > +    if (state.faulted_pages->nr + pages->nr > MAX_FAULTED_PAGES) {
> > +        postcopy_incoming_qemu_send_pages_present();
> > +    }
> > +    memcpy(&state.faulted_pages->pgoffs[state.faulted_pages->nr],
> > +           &pages->pgoffs[0], sizeof(pages->pgoffs[0]) * pages->nr);
> > +    state.faulted_pages->nr += pages->nr;
> > +}
> > +
> > +static void postcopy_incoming_qemu_cleanup_umem(void);
> > +
> > +static int postcopy_incoming_qemu_handle_req_one(void)
> > +{
> > +    int offset = 0;
> > +    int ret;
> > +    uint8_t cmd;
> > +
> > +    ret = qemu_peek_buffer(state.from_umemd, &cmd, sizeof(cmd), offset);
> > +    offset += sizeof(cmd);
> > +    if (ret != sizeof(cmd)) {
> > +        return -EAGAIN;
> > +    }
> > +    DPRINTF("cmd %c\n", cmd);
> > +
> > +    switch (cmd) {
> > +    case UMEM_DAEMON_QUIT:
> > +        postcopy_incoming_qemu_recv_quit();
> > +        postcopy_incoming_qemu_queue_quit();
> > +        postcopy_incoming_qemu_cleanup_umem();
> > +        break;
> > +    case UMEM_DAEMON_TRIGGER_PAGE_FAULT: {
> > +        struct umem_pages *pages =
> > +            umem_qemu_trigger_page_fault(state.from_umemd, &offset);
> > +        if (pages == NULL) {
> > +            return -EAGAIN;
> > +        }
> > +        if (state.to_umemd_fd >= 0 && !(state.state & PIS_STATE_QUIT_QUEUED)) {
> > +            postcopy_incoming_qemu_faulted_pages(pages);
> > +            g_free(pages);
> > +        }
> > +        break;
> > +    }
> > +    case UMEM_DAEMON_ERROR:
> > +        /* umem daemon hit troubles, so it warned us to stop vm execution */
> > +        vm_stop(RUN_STATE_IO_ERROR); /* or RUN_STATE_INTERNAL_ERROR */
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +
> > +    if (state.from_umemd != NULL) {
> > +        qemu_file_skip(state.from_umemd, offset);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void postcopy_incoming_qemu_handle_req(void *opaque)
> > +{
> > +    do {
> > +        int ret = postcopy_incoming_qemu_handle_req_one();
> > +        if (ret == -EAGAIN) {
> > +            break;
> > +        }
> > +    } while (state.from_umemd != NULL &&
> > +             qemu_pending_size(state.from_umemd) > 0);
> > +
> > +    if (state.to_umemd != NULL) {
> > +        if (state.faulted_pages->nr > 0) {
> > +            postcopy_incoming_qemu_send_pages_present();
> > +        }
> > +        postcopy_incoming_qemu_fflush_to_umemd();
> > +    }
> > +}
> > +
> > +void postcopy_incoming_qemu_ready(void)
> > +{
> > +    umem_qemu_ready(state.to_umemd_fd);
> > +
> > +    state.from_umemd = qemu_fopen_pipe(state.from_umemd_fd);
> > +    state.to_umemd = qemu_fopen_nonblock(state.to_umemd_fd);
> > +    qemu_set_fd_handler(state.from_umemd_fd,
> > +                        postcopy_incoming_qemu_handle_req, NULL, NULL);
> > +}
> > +
> > +static void postcopy_incoming_qemu_cleanup_umem(void)
> > +{
> > +    /* when qemu will quit before completing postcopy, tell umem daemon
> > +       to tear down umem device and exit. */
> > +    if (state.to_umemd_fd >= 0) {
> > +        postcopy_incoming_qemu_queue_quit();
> > +        postcopy_incoming_qemu_fflush_to_umemd();
> > +    }
> > +
> > +    if (state.dev) {
> > +        umem_dev_destroy(state.dev);
> > +        state.dev = NULL;
> > +    }
> > +}
> > +
> > +void postcopy_incoming_qemu_cleanup(void)
> > +{
> > +    postcopy_incoming_qemu_cleanup_umem();
> > +    if (state.to_umemd != NULL) {
> > +        nonblock_wait_for_flush(state.to_umemd);
> > +    }
> > +}
> > +
> > +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> > +{
> > +    uint64_t nr = DIV_ROUND_UP(size, state.host_page_size);
> > +    size_t len = umem_pages_size(nr);
> > +    ram_addr_t end = addr + size;
> > +    struct umem_pages *pages;
> > +    int i;
> > +
> > +    if (state.to_umemd_fd < 0 || state.state & PIS_STATE_QUIT_QUEUED) {
> > +        return;
> > +    }
> > +    pages = g_malloc(len);
> > +    pages->nr = nr;
> > +    for (i = 0; addr < end; addr += state.host_page_size, i++) {
> > +        pages->pgoffs[i] = addr >> state.host_page_shift;
> > +    }
> > +    umem_qemu_send_pages_unmapped(state.to_umemd->file, pages);
> > +    g_free(pages);
> > +    assert(state.to_umemd != NULL);
> > +    postcopy_incoming_qemu_fflush_to_umemd();
> > +}
> > +
> > +/**************************************************************************
> > + * incoming umem daemon
> > + */
> > +
> > +static void postcopy_incoming_umem_recv_quit(void)
> > +{
> > +    if (umemd.state & UMEM_STATE_QUIT_RECEIVED) {
> > +        return;
> > +    }
> > +    DPRINTF("|= UMEM_STATE_QUIT_RECEIVED\n");
> > +    umemd.state |= UMEM_STATE_QUIT_RECEIVED;
> > +    qemu_fclose(umemd.from_qemu);
> > +    umemd.from_qemu = NULL;
> > +    fd_close(&umemd.from_qemu_fd);
> > +}
> > +
> > +static void postcopy_incoming_umem_queue_quit(void)
> > +{
> > +    if (umemd.state & UMEM_STATE_QUIT_QUEUED) {
> > +        return;
> > +    }
> > +    DPRINTF("|= UMEM_STATE_QUIT_QUEUED\n");
> > +    umem_daemon_quit(umemd.to_qemu->file);
> > +    umemd.state |= UMEM_STATE_QUIT_QUEUED;
> > +}
> > +
> > +static void postcopy_incoming_umem_send_eoc_req(void)
> > +{
> > +    struct qemu_umem_req req;
> > +
> > +    if (umemd.state & UMEM_STATE_EOC_SENT) {
> > +        return;
> > +    }
> > +
> > +    DPRINTF("|= UMEM_STATE_EOC_SENT\n");
> > +    req.cmd = QEMU_UMEM_REQ_EOC;
> > +    postcopy_incoming_send_req(umemd.mig_write->file, &req);
> > +    umemd.state |= UMEM_STATE_EOC_SENT;
> > +    qemu_fclose(umemd.mig_write->file);
> > +    umemd.mig_write = NULL;
> > +    fd_close(&umemd.mig_write_fd);
> > +}
> > +
> > +static void postcopy_incoming_umem_send_page_req(RAMBlock *block)
> > +{
> > +    struct qemu_umem_req req;
> > +    int bit;
> > +    uint64_t target_pgoff;
> > +    int i;
> > +
> > +    umemd.page_request.nr = MAX_REQUESTS;
> > +    umem_get_page_request(block->umem, &umemd.page_request);
> > +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> > +            block->idstr, umemd.page_request.nr,
> > +            (uint64_t)umemd.page_request.pgoffs[0],
> > +            (uint64_t)umemd.page_request.pgoffs[1]);
> > +
> > +    if (umemd.last_block_write != block) {
> > +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND;
> > +        req.idstr = block->idstr;
> > +    } else {
> > +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> > +    }
> > +
> > +    req.nr = 0;
> > +    req.pgoffs = umemd.target_pgoffs;
> > +    if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> > +        for (i = 0; i < umemd.page_request.nr; i++) {
> > +            target_pgoff =
> > +                umemd.page_request.pgoffs[i] >> umemd.host_to_target_page_shift;
> > +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> > +
> > +            if (!test_and_set_bit(bit, umemd.phys_requested)) {
> > +                req.pgoffs[req.nr] = target_pgoff;
> > +                req.nr++;
> > +            }
> > +        }
> > +    } else {
> > +        for (i = 0; i < umemd.page_request.nr; i++) {
> > +            int j;
> > +            target_pgoff =
> > +                umemd.page_request.pgoffs[i] << umemd.host_to_target_page_shift;
> > +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> > +
> > +            for (j = 0; j < umemd.nr_target_pages_per_host_page; j++) {
> > +                if (!test_and_set_bit(bit + j, umemd.phys_requested)) {
> > +                    req.pgoffs[req.nr] = target_pgoff + j;
> > +                    req.nr++;
> > +                }
> > +            }
> > +        }
> > +    }
> > +
> > +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> > +            block->idstr, req.nr, req.pgoffs[0], req.pgoffs[1]);
> > +    if (req.nr > 0 && umemd.mig_write != NULL) {
> > +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> > +        umemd.last_block_write = block;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_umem_send_pages_present(void)
> > +{
> > +    if (umemd.present_request->nr > 0) {
> > +        umem_daemon_send_pages_present(umemd.to_qemu->file,
> > +                                       umemd.present_request);
> > +        umemd.present_request->nr = 0;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_umem_pages_present_one(
> > +    uint32_t nr, const __u64 *pgoffs, uint64_t ramblock_pgoffset)
> > +{
> > +    uint32_t i;
> > +    assert(nr <= MAX_PRESENT_REQUESTS);
> > +
> > +    if (umemd.present_request->nr + nr > MAX_PRESENT_REQUESTS) {
> > +        postcopy_incoming_umem_send_pages_present();
> > +    }
> > +
> > +    for (i = 0; i < nr; i++) {
> > +        umemd.present_request->pgoffs[umemd.present_request->nr + i] =
> > +            pgoffs[i] + ramblock_pgoffset;
> > +    }
> > +    umemd.present_request->nr += nr;
> > +}
> > +
> > +static void postcopy_incoming_umem_pages_present(
> > +    const struct umem_page_cached *page_cached, uint64_t ramblock_pgoffset)
> > +{
> > +    uint32_t left = page_cached->nr;
> > +    uint32_t offset = 0;
> > +
> > +    while (left > 0) {
> > +        uint32_t nr = MIN(left, MAX_PRESENT_REQUESTS);
> > +        postcopy_incoming_umem_pages_present_one(
> > +            nr, &page_cached->pgoffs[offset], ramblock_pgoffset);
> > +
> > +        left -= nr;
> > +        offset += nr;
> > +    }
> > +}
> > +
> > +static int postcopy_incoming_umem_ram_load(void)
> > +{
> > +    ram_addr_t offset;
> > +    int flags;
> > +    int error;
> > +    void *shmem;
> > +    int i;
> > +    int bit;
> > +
> > +    if (umemd.version_id != RAM_SAVE_VERSION_ID) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    offset = qemu_get_be64(umemd.mig_read);
> > +
> > +    flags = offset & ~TARGET_PAGE_MASK;
> > +    offset &= TARGET_PAGE_MASK;
> > +
> > +    assert(!(flags & RAM_SAVE_FLAG_MEM_SIZE));
> > +
> > +    if (flags & RAM_SAVE_FLAG_EOS) {
> > +        DPRINTF("RAM_SAVE_FLAG_EOS\n");
> > +        postcopy_incoming_umem_send_eoc_req();
> > +
> > +        qemu_fclose(umemd.mig_read);
> > +        umemd.mig_read = NULL;
> > +        fd_close(&umemd.mig_read_fd);
> > +        umemd.state |= UMEM_STATE_EOS_RECEIVED;
> > +
> > +        postcopy_incoming_umem_queue_quit();
> > +        DPRINTF("|= UMEM_STATE_EOS_RECEIVED\n");
> > +        return 0;
> > +    }
> > +
> > +    shmem = ram_load_host_from_stream_offset(umemd.mig_read, offset, flags,
> > +                                             &umemd.last_block_read);
> > +    if (!shmem) {
> > +        DPRINTF("shmem == NULL\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (flags & RAM_SAVE_FLAG_COMPRESS) {
> > +        uint8_t ch = qemu_get_byte(umemd.mig_read);
> > +        memset(shmem, ch, TARGET_PAGE_SIZE);
> > +    } else if (flags & RAM_SAVE_FLAG_PAGE) {
> > +        qemu_get_buffer(umemd.mig_read, shmem, TARGET_PAGE_SIZE);
> > +    }
> > +
> > +    error = qemu_file_get_error(umemd.mig_read);
> > +    if (error) {
> > +        DPRINTF("error %d\n", error);
> > +        return error;
> > +    }
> > +
> > +    umemd.page_cached.nr = 0;
> > +    bit = (umemd.last_block_read->offset + offset) >> TARGET_PAGE_BITS;
> > +    if (!test_and_set_bit(bit, umemd.phys_received)) {
> > +        if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> > +            __u64 pgoff = offset >> umemd.host_page_shift;
> > +            for (i = 0; i < umemd.nr_host_pages_per_target_page; i++) {
> > +                umemd.page_cached.pgoffs[umemd.page_cached.nr] = pgoff + i;
> > +                umemd.page_cached.nr++;
> > +            }
> > +        } else {
> > +            bool mark_cache = true;
> > +            for (i = 0; i < umemd.nr_target_pages_per_host_page; i++) {
> > +                if (!test_bit(bit + i, umemd.phys_received)) {
> > +                    mark_cache = false;
> > +                    break;
> > +                }
> > +            }
> > +            if (mark_cache) {
> > +                umemd.page_cached.pgoffs[0] = offset >> umemd.host_page_shift;
> > +                umemd.page_cached.nr = 1;
> > +            }
> > +        }
> > +    }
> > +
> > +    if (umemd.page_cached.nr > 0) {
> > +        umem_mark_page_cached(umemd.last_block_read->umem, &umemd.page_cached);
> > +
> > +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED) && umemd.to_qemu_fd >=0 &&
> > +            (incoming_postcopy_flags & INCOMING_FLAGS_FAULT_REQUEST)) {
> > +            uint64_t ramblock_pgoffset;
> > +
> > +            ramblock_pgoffset =
> > +                umemd.last_block_read->offset >> umemd.host_page_shift;
> > +            postcopy_incoming_umem_pages_present(&umemd.page_cached,
> > +                                                 ramblock_pgoffset);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static bool postcopy_incoming_umem_check_umem_done(void)
> > +{
> > +    bool all_done = true;
> > +    RAMBlock *block;
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        UMem *umem = block->umem;
> > +        if (umem != NULL && umem->nsets == umem->nbits) {
> > +            umem_unmap_shmem(umem);
> > +            umem_destroy(umem);
> > +            block->umem = NULL;
> > +        }
> > +        if (block->umem != NULL) {
> > +            all_done = false;
> > +        }
> > +    }
> > +    return all_done;
> > +}
> > +
> > +static bool postcopy_incoming_umem_page_faulted(const struct umem_pages *pages)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < pages->nr; i++) {
> > +        ram_addr_t addr = pages->pgoffs[i] << umemd.host_page_shift;
> > +        RAMBlock *block = qemu_get_ram_block(addr);
> > +        addr -= block->offset;
> > +        umem_remove_shmem(block->umem, addr, umemd.host_page_size);
> > +    }
> > +    return postcopy_incoming_umem_check_umem_done();
> > +}
> > +
> > +static bool
> > +postcopy_incoming_umem_page_unmapped(const struct umem_pages *pages)
> > +{
> > +    RAMBlock *block;
> > +    ram_addr_t addr;
> > +    int i;
> > +
> > +    struct qemu_umem_req req = {
> > +        .cmd = QEMU_UMEM_REQ_REMOVE,
> > +        .nr = 0,
> > +        .pgoffs = (uint64_t*)pages->pgoffs,
> > +    };
> > +
> > +    addr = pages->pgoffs[0] << umemd.host_page_shift;
> > +    block = qemu_get_ram_block(addr);
> > +
> > +    for (i = 0; i < pages->nr; i++)  {
> > +        int pgoff;
> > +
> > +        addr = pages->pgoffs[i] << umemd.host_page_shift;
> > +        pgoff = addr >> TARGET_PAGE_BITS;
> > +        if (!test_bit(pgoff, umemd.phys_received) &&
> > +            !test_bit(pgoff, umemd.phys_requested)) {
> > +            req.pgoffs[req.nr] = pgoff;
> > +            req.nr++;
> > +        }
> > +        set_bit(pgoff, umemd.phys_received);
> > +        set_bit(pgoff, umemd.phys_requested);
> > +
> > +        umem_remove_shmem(block->umem,
> > +                          addr - block->offset, umemd.host_page_size);
> > +    }
> > +    if (req.nr > 0 && umemd.mig_write != NULL) {
> > +        req.idstr = block->idstr;
> > +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> > +    }
> > +
> > +    return postcopy_incoming_umem_check_umem_done();
> > +}
> > +
> > +static void postcopy_incoming_umem_done(void)
> > +{
> > +    postcopy_incoming_umem_send_eoc_req();
> > +    postcopy_incoming_umem_queue_quit();
> > +}
> > +
> > +static int postcopy_incoming_umem_handle_qemu(void)
> > +{
> > +    int ret;
> > +    int offset = 0;
> > +    uint8_t cmd;
> > +
> > +    ret = qemu_peek_buffer(umemd.from_qemu, &cmd, sizeof(cmd), offset);
> > +    offset += sizeof(cmd);
> > +    if (ret != sizeof(cmd)) {
> > +        return -EAGAIN;
> > +    }
> > +    DPRINTF("cmd %c\n", cmd);
> > +    switch (cmd) {
> > +    case UMEM_QEMU_QUIT:
> > +        postcopy_incoming_umem_recv_quit();
> > +        postcopy_incoming_umem_done();
> > +        break;
> > +    case UMEM_QEMU_PAGE_FAULTED: {
> > +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> > +                                                   &offset);
> > +        if (pages == NULL) {
> > +            return -EAGAIN;
> > +        }
> > +        if (postcopy_incoming_umem_page_faulted(pages)){
> > +            postcopy_incoming_umem_done();
> > +        }
> > +        g_free(pages);
> > +        break;
> > +    }
> > +    case UMEM_QEMU_PAGE_UNMAPPED: {
> > +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> > +                                                   &offset);
> > +        if (pages == NULL) {
> > +            return -EAGAIN;
> > +        }
> > +        if (postcopy_incoming_umem_page_unmapped(pages)){
> > +            postcopy_incoming_umem_done();
> > +        }
> > +        g_free(pages);
> > +        break;
> > +    }
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +    if (umemd.from_qemu != NULL) {
> > +        qemu_file_skip(umemd.from_qemu, offset);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void set_fd(int fd, fd_set *fds, int *nfds)
> > +{
> > +    FD_SET(fd, fds);
> > +    if (fd > *nfds) {
> > +        *nfds = fd;
> > +    }
> > +}
> > +
> > +static int postcopy_incoming_umemd_main_loop(void)
> > +{
> > +    fd_set writefds;
> > +    fd_set readfds;
> > +    int nfds;
> > +    RAMBlock *block;
> > +    int ret;
> > +
> > +    int pending_size;
> > +    bool get_page_request;
> > +
> > +    nfds = -1;
> > +    FD_ZERO(&writefds);
> > +    FD_ZERO(&readfds);
> > +
> > +    if (umemd.mig_write != NULL) {
> > +        pending_size = nonblock_pending_size(umemd.mig_write);
> > +        if (pending_size > 0) {
> > +            set_fd(umemd.mig_write_fd, &writefds, &nfds);
> > +        }
> > +    } else {
> > +        pending_size = 0;
> > +    }
> > +
> > +#define PENDING_SIZE_MAX (MAX_REQUESTS * sizeof(uint64_t) * 2)
> > +    /* If page request to the migration source is accumulated,
> > +       suspend getting page fault request. */
> > +    get_page_request = (pending_size <= PENDING_SIZE_MAX);
> > +
> > +    if (get_page_request) {
> > +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +            if (block->umem != NULL) {
> > +                set_fd(block->umem->fd, &readfds, &nfds);
> > +            }
> > +        }
> > +    }
> > +
> > +    if (umemd.mig_read_fd >= 0) {
> > +        set_fd(umemd.mig_read_fd, &readfds, &nfds);
> > +    }
> > +
> > +    if (umemd.to_qemu != NULL &&
> > +        nonblock_pending_size(umemd.to_qemu) > 0) {
> > +        set_fd(umemd.to_qemu_fd, &writefds, &nfds);
> > +    }
> > +    if (umemd.from_qemu_fd >= 0) {
> > +        set_fd(umemd.from_qemu_fd, &readfds, &nfds);
> > +    }
> > +
> > +    ret = select(nfds + 1, &readfds, &writefds, NULL, NULL);
> > +    if (ret == -1) {
> > +        if (errno == EINTR) {
> > +            return 0;
> > +        }
> > +        return ret;
> > +    }
> > +
> > +    if (umemd.mig_write_fd >= 0 && FD_ISSET(umemd.mig_write_fd, &writefds)) {
> > +        nonblock_fflush(umemd.mig_write);
> > +    }
> > +    if (umemd.to_qemu_fd >= 0 && FD_ISSET(umemd.to_qemu_fd, &writefds)) {
> > +        nonblock_fflush(umemd.to_qemu);
> > +    }
> > +    if (get_page_request) {
> > +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +            if (block->umem != NULL && FD_ISSET(block->umem->fd, &readfds)) {
> > +                postcopy_incoming_umem_send_page_req(block);
> > +            }
> > +        }
> > +    }
> > +    if (umemd.mig_read_fd >= 0 && FD_ISSET(umemd.mig_read_fd, &readfds)) {
> > +        do {
> > +            ret = postcopy_incoming_umem_ram_load();
> > +            if (ret < 0) {
> > +                return ret;
> > +            }
> > +        } while (umemd.mig_read != NULL &&
> > +                 qemu_pending_size(umemd.mig_read) > 0);
> > +    }
> > +    if (umemd.from_qemu_fd >= 0 && FD_ISSET(umemd.from_qemu_fd, &readfds)) {
> > +        do {
> > +            ret = postcopy_incoming_umem_handle_qemu();
> > +            if (ret == -EAGAIN) {
> > +                break;
> > +            }
> > +        } while (umemd.from_qemu != NULL &&
> > +                 qemu_pending_size(umemd.from_qemu) > 0);
> > +    }
> > +
> > +    if (umemd.mig_write != NULL) {
> > +        nonblock_fflush(umemd.mig_write);
> > +    }
> > +    if (umemd.to_qemu != NULL) {
> > +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED)) {
> > +            postcopy_incoming_umem_send_pages_present();
> > +        }
> > +        nonblock_fflush(umemd.to_qemu);
> > +        if ((umemd.state & UMEM_STATE_QUIT_QUEUED) &&
> > +            nonblock_pending_size(umemd.to_qemu) == 0) {
> > +            DPRINTF("|= UMEM_STATE_QUIT_SENT\n");
> > +            qemu_fclose(umemd.to_qemu->file);
> > +            umemd.to_qemu = NULL;
> > +            fd_close(&umemd.to_qemu_fd);
> > +            umemd.state |= UMEM_STATE_QUIT_SENT;
> > +        }
> > +    }
> > +
> > +    return (umemd.state & UMEM_STATE_END_MASK) == UMEM_STATE_END_MASK;
> > +}
> > +
> > +static void postcopy_incoming_umemd(void)
> > +{
> > +    ram_addr_t last_ram_offset;
> > +    int nbits;
> > +    RAMBlock *block;
> > +    int ret;
> > +
> > +    qemu_daemon(1, 1);
> > +    signal(SIGPIPE, SIG_IGN);
> > +    DPRINTF("daemon pid: %d\n", getpid());
> > +
> > +    umemd.page_request.pgoffs = g_new(__u64, MAX_REQUESTS);
> > +    umemd.page_cached.pgoffs =
> > +        g_new(__u64, MAX_REQUESTS *
> > +              (TARGET_PAGE_SIZE >= umemd.host_page_size ?
> > +               1: umemd.nr_host_pages_per_target_page));
> > +    umemd.target_pgoffs =
> > +        g_new(uint64_t, MAX_REQUESTS *
> > +              MAX(umemd.nr_host_pages_per_target_page,
> > +                  umemd.nr_target_pages_per_host_page));
> > +    umemd.present_request = g_malloc(umem_pages_size(MAX_PRESENT_REQUESTS));
> > +    umemd.present_request->nr = 0;
> > +
> > +    last_ram_offset = qemu_last_ram_offset();
> > +    nbits = last_ram_offset >> TARGET_PAGE_BITS;
> > +    umemd.phys_requested = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> > +    umemd.phys_received = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> > +    umemd.last_block_read = NULL;
> > +    umemd.last_block_write = NULL;
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        UMem *umem = block->umem;
> > +        umem->umem = NULL;      /* umem mapping area has VM_DONT_COPY flag,
> > +                                   so we lost those mappings by fork */
> > +        block->host = umem_map_shmem(umem);
> > +        umem_close_shmem(umem);
> > +    }
> > +    umem_daemon_ready(umemd.to_qemu_fd);
> > +    umemd.to_qemu = qemu_fopen_nonblock(umemd.to_qemu_fd);
> > +
> > +    /* wait for qemu to disown migration_fd */
> > +    umem_daemon_wait_for_qemu(umemd.from_qemu_fd);
> > +    umemd.from_qemu = qemu_fopen_pipe(umemd.from_qemu_fd);
> > +
> > +    DPRINTF("entering umemd main loop\n");
> > +    for (;;) {
> > +        ret = postcopy_incoming_umemd_main_loop();
> > +        if (ret != 0) {
> > +            break;
> > +        }
> > +    }
> > +    DPRINTF("exiting umemd main loop\n");
> > +
> > +    /* This daemon forked from qemu and the parent qemu is still running.
> > +     * Cleanups of linked libraries like SDL should not be triggered,
> > +     * otherwise the parent qemu may use resources which was already freed.
> > +     */
> > +    fflush(stdout);
> > +    fflush(stderr);
> > +    _exit(ret < 0? EXIT_FAILURE: 0);
> > +}
> > diff --git a/migration-tcp.c b/migration-tcp.c
> > index cf6a9b8..aa35050 100644
> > --- a/migration-tcp.c
> > +++ b/migration-tcp.c
> > @@ -63,18 +63,25 @@ static void tcp_wait_for_connect(void *opaque)
> >      } while (ret == -1 && (socket_error()) == EINTR);
> >  
> >      if (ret < 0) {
> > -        migrate_fd_error(s);
> > -        return;
> > +        goto error_out;
> >      }
> >  
> >      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
> >  
> > -    if (val == 0)
> > +    if (val == 0) {
> > +        ret = postcopy_outgoing_create_read_socket(s);
> > +        if (ret < 0) {
> > +            goto error_out;
> > +        }
> >          migrate_fd_connect(s);
> > -    else {
> > +    } else {
> >          DPRINTF("error connecting %d\n", val);
> > -        migrate_fd_error(s);
> > +        goto error_out;
> >      }
> > +    return;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> >  }
> >  
> >  int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
> > @@ -112,11 +119,19 @@ int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
> >  
> >      if (ret < 0) {
> >          DPRINTF("connect failed\n");
> > -        migrate_fd_error(s);
> > -        return ret;
> > +        goto error_out;
> > +    }
> > +
> > +    ret = postcopy_outgoing_create_read_socket(s);
> > +    if (ret < 0) {
> > +        goto error_out;
> >      }
> >      migrate_fd_connect(s);
> >      return 0;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> > +    return ret;
> >  }
> >  
> >  static void tcp_accept_incoming_migration(void *opaque)
> > @@ -145,7 +160,15 @@ static void tcp_accept_incoming_migration(void *opaque)
> >      }
> >  
> >      process_incoming_migration(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_fork_umemd(c, f);
> > +    }
> >      qemu_fclose(f);
> > +    if (incoming_postcopy) {
> > +        /* now socket is disowned.
> > +           So tell umem server that it's safe to use it */
> > +        postcopy_incoming_qemu_ready();
> > +    }
> >  out:
> >      close(c);
> >  out2:
> > diff --git a/migration-unix.c b/migration-unix.c
> > index dfcf203..3707505 100644
> > --- a/migration-unix.c
> > +++ b/migration-unix.c
> > @@ -69,12 +69,20 @@ static void unix_wait_for_connect(void *opaque)
> >  
> >      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
> >  
> > -    if (val == 0)
> > +    if (val == 0) {
> > +        ret = postcopy_outgoing_create_read_socket(s);
> > +        if (ret < 0) {
> > +            goto error_out;
> > +        }
> >          migrate_fd_connect(s);
> > -    else {
> > +    } else {
> >          DPRINTF("error connecting %d\n", val);
> > -        migrate_fd_error(s);
> > +        goto error_out;
> >      }
> > +    return;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> >  }
> >  
> >  int unix_start_outgoing_migration(MigrationState *s, const char *path)
> > @@ -109,11 +117,19 @@ int unix_start_outgoing_migration(MigrationState *s, const char *path)
> >  
> >      if (ret < 0) {
> >          DPRINTF("connect failed\n");
> > -        migrate_fd_error(s);
> > -        return ret;
> > +        goto error_out;
> > +    }
> > +
> > +    ret = postcopy_outgoing_create_read_socket(s);
> > +    if (ret < 0) {
> > +        goto error_out;
> >      }
> >      migrate_fd_connect(s);
> >      return 0;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> > +    return ret;
> >  }
> >  
> >  static void unix_accept_incoming_migration(void *opaque)
> > @@ -142,7 +158,13 @@ static void unix_accept_incoming_migration(void *opaque)
> >      }
> >  
> >      process_incoming_migration(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_fork_umemd(c, f);
> > +    }
> >      qemu_fclose(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_qemu_ready();
> > +    }
> >  out:
> >      close(c);
> >  out2:
> > diff --git a/migration.c b/migration.c
> > index 0149ab3..51efe44 100644
> > --- a/migration.c
> > +++ b/migration.c
> > @@ -39,6 +39,11 @@ enum {
> >      MIG_STATE_COMPLETED,
> >  };
> >  
> > +enum {
> > +    MIG_SUBSTATE_PRECOPY,
> > +    MIG_SUBSTATE_POSTCOPY,
> > +};
> > +
> >  #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
> >  
> >  static NotifierList migration_state_notifiers =
> > @@ -255,6 +260,18 @@ static void migrate_fd_put_ready(void *opaque)
> >          return;
> >      }
> >  
> > +    if (s->substate == MIG_SUBSTATE_POSTCOPY) {
> > +        /* PRINTF("postcopy background\n"); */
> > +        ret = postcopy_outgoing_ram_save_background(s->mon, s->file,
> > +                                                    s->postcopy);
> > +        if (ret > 0) {
> > +            migrate_fd_completed(s);
> > +        } else if (ret < 0) {
> > +            migrate_fd_error(s);
> > +        }
> > +        return;
> > +    }
> > +
> >      DPRINTF("iterate\n");
> >      ret = qemu_savevm_state_iterate(s->mon, s->file);
> >      if (ret < 0) {
> > @@ -265,6 +282,19 @@ static void migrate_fd_put_ready(void *opaque)
> >          DPRINTF("done iterating\n");
> >          vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> >  
> > +        if (s->params.postcopy) {
> > +            if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
> > +                migrate_fd_error(s);
> > +                if (old_vm_running) {
> > +                    vm_start();
> > +                }
> > +                return;
> > +            }
> > +            s->substate = MIG_SUBSTATE_POSTCOPY;
> > +            s->postcopy = postcopy_outgoing_begin(s);
> > +            return;
> > +        }
> > +
> >          if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
> >              migrate_fd_error(s);
> >          } else {
> > @@ -357,6 +387,7 @@ void migrate_fd_connect(MigrationState *s)
> >      int ret;
> >  
> >      s->state = MIG_STATE_ACTIVE;
> > +    s->substate = MIG_SUBSTATE_PRECOPY;
> >      s->file = qemu_fopen_ops_buffered(s,
> >                                        s->bandwidth_limit,
> >                                        migrate_fd_put_buffer,
> > diff --git a/migration.h b/migration.h
> > index 90ae362..2809e99 100644
> > --- a/migration.h
> > +++ b/migration.h
> > @@ -40,6 +40,12 @@ struct MigrationState
> >      int (*write)(MigrationState *s, const void *buff, size_t size);
> >      void *opaque;
> >      MigrationParams params;
> > +
> > +    /* for postcopy */
> > +    int substate;              /* precopy or postcopy */
> > +    int fd_read;
> > +    QEMUFile *file_read;        /* connection from the detination */
> > +    void *postcopy;
> >  };
> >  
> >  void process_incoming_migration(QEMUFile *f);
> > @@ -86,6 +92,7 @@ uint64_t ram_bytes_remaining(void);
> >  uint64_t ram_bytes_transferred(void);
> >  uint64_t ram_bytes_total(void);
> >  
> > +void ram_save_set_params(const MigrationParams *params, void *opaque);
> >  void sort_ram_list(void);
> >  int ram_save_block(QEMUFile *f);
> >  void ram_save_memory_set_dirty(void);
> > @@ -107,7 +114,30 @@ void migrate_add_blocker(Error *reason);
> >   */
> >  void migrate_del_blocker(Error *reason);
> >  
> > +/* For outgoing postcopy */
> > +int postcopy_outgoing_create_read_socket(MigrationState *s);
> > +int postcopy_outgoing_ram_save_live(Monitor *mon,
> > +                                    QEMUFile *f, int stage, void *opaque);
> > +void *postcopy_outgoing_begin(MigrationState *s);
> > +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> > +                                          void *postcopy);
> > +
> > +/* For incoming postcopy */
> >  extern bool incoming_postcopy;
> >  extern unsigned long incoming_postcopy_flags;
> >  
> > +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy);
> > +void postcopy_incoming_ram_alloc(const char *name,
> > +                                 size_t size, uint8_t **hostp, UMem **umemp);
> > +void postcopy_incoming_ram_free(UMem *umem);
> > +void postcopy_incoming_prepare(void);
> > +
> > +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id);
> > +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read);
> > +void postcopy_incoming_qemu_ready(void);
> > +void postcopy_incoming_qemu_cleanup(void);
> > +#ifdef NEED_CPU_H
> > +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size);
> > +#endif
> > +
> >  #endif
> > diff --git a/qemu-common.h b/qemu-common.h
> > index 725922b..d74a8c9 100644
> > --- a/qemu-common.h
> > +++ b/qemu-common.h
> > @@ -17,6 +17,7 @@ typedef struct DeviceState DeviceState;
> >  
> >  struct Monitor;
> >  typedef struct Monitor Monitor;
> > +typedef struct UMem UMem;
> >  
> >  /* we put basic includes here to avoid repeating them in device drivers */
> >  #include <stdlib.h>
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 5c5b8f3..19e20f9 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -2510,7 +2510,10 @@ DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
> >      "-postcopy-flags unsigned-int(flags)\n"
> >      "	                flags for postcopy incoming migration\n"
> >      "                   when -incoming and -postcopy are specified.\n"
> > -    "                   This is for benchmark/debug purpose (default: 0)\n",
> > +    "                   This is for benchmark/debug purpose (default: 0)\n"
> > +    "                   Currently supprted flags are\n"
> > +    "                   1: enable fault request from umemd to qemu\n"
> > +    "                      (default: disabled)\n",
> >      QEMU_ARCH_ALL)
> >  STEXI
> >  @item -postcopy-flags int
> 
> Can you move umem.h and umem.h to a separate patch please ,
> this patch
> > diff --git a/umem.c b/umem.c
> > new file mode 100644
> > index 0000000..b7be006
> > --- /dev/null
> > +++ b/umem.c
> > @@ -0,0 +1,379 @@
> > +/*
> > + * umem.c: user process backed memory module for postcopy livemigration
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +
> > +#include <linux/umem.h>
> > +
> > +#include "bitops.h"
> > +#include "sysemu.h"
> > +#include "hw/hw.h"
> > +#include "umem.h"
> > +
> > +//#define DEBUG_UMEM
> > +#ifdef DEBUG_UMEM
> > +#include <sys/syscall.h>
> > +#define DPRINTF(format, ...)                                            \
> > +    do {                                                                \
> > +        printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),    \
> > +               __func__, __LINE__, ## __VA_ARGS__);                     \
> > +    } while (0)
> > +#else
> > +#define DPRINTF(format, ...)    do { } while (0)
> > +#endif
> > +
> > +#define DEV_UMEM        "/dev/umem"
> > +
> > +struct UMemDev {
> > +    int fd;
> > +    int page_shift;
> > +};
> > +
> > +UMemDev *umem_dev_new(void)
> > +{
> > +    UMemDev *umem_dev;
> > +    int umem_dev_fd = open(DEV_UMEM, O_RDWR);
> > +    if (umem_dev_fd < 0) {
> > +        perror("can't open "DEV_UMEM);
> > +        abort();
> > +    }
> > +
> > +    umem_dev = g_new(UMemDev, 1);
> > +    umem_dev->fd = umem_dev_fd;
> > +    umem_dev->page_shift = ffs(getpagesize()) - 1;
> > +    return umem_dev;
> > +}
> > +
> > +void umem_dev_destroy(UMemDev *dev)
> > +{
> > +    close(dev->fd);
> > +    g_free(dev);
> > +}
> > +
> > +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name)
> > +{
> > +    struct umem_create create = {
> > +        .size = size,
> > +        .async_req_max = 0,
> > +        .sync_req_max = 0,
> > +    };
> > +    UMem *umem;
> > +
> > +    snprintf(create.name.id, sizeof(create.name.id),
> > +             "pid-%"PRId64, (uint64_t)getpid());
> > +    create.name.id[UMEM_ID_MAX - 1] = 0;
> > +    strncpy(create.name.name, name, sizeof(create.name.name));
> > +    create.name.name[UMEM_NAME_MAX - 1] = 0;
> > +
> > +    assert((size % getpagesize()) == 0);
> > +    if (ioctl(dev->fd, UMEM_DEV_CREATE_UMEM, &create) < 0) {
> > +        perror("UMEM_DEV_CREATE_UMEM");
> > +        abort();
> > +    }
> > +    if (ftruncate(create.shmem_fd, create.size) < 0) {
> > +        perror("truncate(\"shmem_fd\")");
> > +        abort();
> > +    }
> > +
> > +    umem = g_new(UMem, 1);
> > +    umem->nbits = 0;
> > +    umem->nsets = 0;
> > +    umem->faulted = NULL;
> > +    umem->page_shift = dev->page_shift;
> > +    umem->fd = create.umem_fd;
> > +    umem->shmem_fd = create.shmem_fd;
> > +    umem->size = create.size;
> > +    umem->umem = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
> > +                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +    if (umem->umem == MAP_FAILED) {
> > +        perror("mmap(UMem) failed");
> > +        abort();
> > +    }
> > +    return umem;
> > +}
> > +
> > +void umem_mmap(UMem *umem)
> > +{
> > +    void *ret = mmap(umem->umem, umem->size,
> > +                     PROT_EXEC | PROT_READ | PROT_WRITE,
> > +                     MAP_PRIVATE | MAP_FIXED, umem->fd, 0);
> > +    if (ret == MAP_FAILED) {
> > +        perror("umem_mmap(UMem) failed");
> > +        abort();
> > +    }
> > +}
> > +
> > +void umem_destroy(UMem *umem)
> > +{
> > +    if (umem->fd != -1) {
> > +        close(umem->fd);
> > +    }
> > +    if (umem->shmem_fd != -1) {
> > +        close(umem->shmem_fd);
> > +    }
> > +    g_free(umem->faulted);
> > +    g_free(umem);
> > +}
> > +
> > +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request)
> > +{
> > +    if (ioctl(umem->fd, UMEM_GET_PAGE_REQUEST, page_request)) {
> > +        perror("daemon: UMEM_GET_PAGE_REQUEST");
> > +        abort();
> > +    }
> > +}
> > +
> > +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached)
> > +{
> > +    if (ioctl(umem->fd, UMEM_MARK_PAGE_CACHED, page_cached)) {
> > +        perror("daemon: UMEM_MARK_PAGE_CACHED");
> > +        abort();
> > +    }
> > +}
> > +
> > +void umem_unmap(UMem *umem)
> > +{
> > +    munmap(umem->umem, umem->size);
> > +    umem->umem = NULL;
> > +}
> > +
> > +void umem_close(UMem *umem)
> > +{
> > +    close(umem->fd);
> > +    umem->fd = -1;
> > +}
> > +
> > +void *umem_map_shmem(UMem *umem)
> > +{
> > +    umem->nbits = umem->size >> umem->page_shift;
> > +    umem->nsets = 0;
> > +    umem->faulted = g_new0(unsigned long, BITS_TO_LONGS(umem->nbits));
> > +
> > +    umem->shmem = mmap(NULL, umem->size, PROT_READ | PROT_WRITE, MAP_SHARED,
> > +                       umem->shmem_fd, 0);
> > +    if (umem->shmem == MAP_FAILED) {
> > +        perror("daemon: mmap(\"shmem\")");
> > +        abort();
> > +    }
> > +    return umem->shmem;
> > +}
> > +
> > +void umem_unmap_shmem(UMem *umem)
> > +{
> > +    munmap(umem->shmem, umem->size);
> > +    umem->shmem = NULL;
> > +}
> > +
> > +void umem_remove_shmem(UMem *umem, size_t offset, size_t size)
> > +{
> > +    int s = offset >> umem->page_shift;
> > +    int e = (offset + size) >> umem->page_shift;
> > +    int i;
> > +
> > +    for (i = s; i < e; i++) {
> > +        if (!test_and_set_bit(i, umem->faulted)) {
> > +            umem->nsets++;
> > +#if defined(CONFIG_MADVISE) && defined(MADV_REMOVE)
> > +            madvise(umem->shmem + offset, size, MADV_REMOVE);
> > +#endif
> > +        }
> > +    }
> > +}
> > +
> > +void umem_close_shmem(UMem *umem)
> > +{
> > +    close(umem->shmem_fd);
> > +    umem->shmem_fd = -1;
> > +}
> > +
> > +/***************************************************************************/
> > +/* qemu <-> umem daemon communication */
> > +
> > +size_t umem_pages_size(uint64_t nr)
> > +{
> > +    return sizeof(struct umem_pages) + nr * sizeof(uint64_t);
> > +}
> > +
> > +static void umem_write_cmd(int fd, uint8_t cmd)
> > +{
> > +    DPRINTF("write cmd %c\n", cmd);
> > +
> > +    for (;;) {
> > +        ssize_t ret = write(fd, &cmd, 1);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            } else if (errno == EPIPE) {
> > +                perror("pipe");
> > +                DPRINTF("write cmd %c %zd %d: pipe is closed\n",
> > +                        cmd, ret, errno);
> > +                break;
> > +            }
> > +
> > +            perror("pipe");
> > +            DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
> > +            abort();
> > +        }
> > +
> > +        break;
> > +    }
> > +}
> > +
> > +static void umem_read_cmd(int fd, uint8_t expect)
> > +{
> > +    uint8_t cmd;
> > +    for (;;) {
> > +        ssize_t ret = read(fd, &cmd, 1);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            }
> > +            perror("pipe");
> > +            DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
> > +            abort();
> > +        }
> > +
> > +        if (ret == 0) {
> > +            DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
> > +            abort();
> > +        }
> > +
> > +        break;
> > +    }
> > +
> > +    DPRINTF("read cmd %c\n", cmd);
> > +    if (cmd != expect) {
> > +        DPRINTF("cmd %c expect %d\n", cmd, expect);
> > +        abort();
> > +    }
> > +}
> > +
> > +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
> > +{
> > +    int ret;
> > +    uint64_t nr;
> > +    size_t size;
> > +    struct umem_pages *pages;
> > +
> > +    ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
> > +    *offset += sizeof(nr);
> > +    DPRINTF("ret %d nr %ld\n", ret, nr);
> > +    if (ret != sizeof(nr) || nr == 0) {
> > +        return NULL;
> > +    }
> > +
> > +    size = umem_pages_size(nr);
> > +    pages = g_malloc(size);
> > +    pages->nr = nr;
> > +    size -= sizeof(pages->nr);
> > +
> > +    ret = qemu_peek_buffer(f, (uint8_t*)pages->pgoffs, size, *offset);
> > +    *offset += size;
> > +    if (ret != size) {
> > +        g_free(pages);
> > +        return NULL;
> > +    }
> > +    return pages;
> > +}
> > +
> > +static void umem_send_pages(QEMUFile *f, const struct umem_pages *pages)
> > +{
> > +    size_t len = umem_pages_size(pages->nr);
> > +    qemu_put_buffer(f, (const uint8_t*)pages, len);
> > +}
> > +
> > +/* umem daemon -> qemu */
> > +void umem_daemon_ready(int to_qemu_fd)
> > +{
> > +    umem_write_cmd(to_qemu_fd, UMEM_DAEMON_READY);
> > +}
> > +
> > +void umem_daemon_quit(QEMUFile *to_qemu)
> > +{
> > +    qemu_put_byte(to_qemu, UMEM_DAEMON_QUIT);
> > +}
> > +
> > +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> > +                                    struct umem_pages *pages)
> > +{
> > +    qemu_put_byte(to_qemu, UMEM_DAEMON_TRIGGER_PAGE_FAULT);
> > +    umem_send_pages(to_qemu, pages);
> > +}
> > +
> > +void umem_daemon_wait_for_qemu(int from_qemu_fd)
> > +{
> > +    umem_read_cmd(from_qemu_fd, UMEM_QEMU_READY);
> > +}
> > +
> > +/* qemu -> umem daemon */
> > +void umem_qemu_wait_for_daemon(int from_umemd_fd)
> > +{
> > +    umem_read_cmd(from_umemd_fd, UMEM_DAEMON_READY);
> > +}
> > +
> > +void umem_qemu_ready(int to_umemd_fd)
> > +{
> > +    umem_write_cmd(to_umemd_fd, UMEM_QEMU_READY);
> > +}
> > +
> > +void umem_qemu_quit(QEMUFile *to_umemd)
> > +{
> > +    qemu_put_byte(to_umemd, UMEM_QEMU_QUIT);
> > +}
> > +
> > +/* qemu side handler */
> > +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> > +                                                int *offset)
> > +{
> > +    uint64_t i;
> > +    int page_shift = ffs(getpagesize()) - 1;
> > +    struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
> > +    if (pages == NULL) {
> > +        return NULL;
> > +    }
> > +
> > +    for (i = 0; i < pages->nr; i++) {
> > +        ram_addr_t addr = pages->pgoffs[i] << page_shift;
> > +
> > +        /* make pages present by forcibly triggering page fault. */
> > +        volatile uint8_t *ram = qemu_get_ram_ptr(addr);
> > +        uint8_t dummy_read = ram[0];
> > +        (void)dummy_read;   /* suppress unused variable warning */
> > +    }
> > +
> > +    return pages;
> > +}
> > +
> > +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> > +                                  const struct umem_pages *pages)
> > +{
> > +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_FAULTED);
> > +    umem_send_pages(to_umemd, pages);
> > +}
> > +
> > +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> > +                                   const struct umem_pages *pages)
> > +{
> > +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_UNMAPPED);
> > +    umem_send_pages(to_umemd, pages);
> > +}
> > diff --git a/umem.h b/umem.h
> > new file mode 100644
> > index 0000000..5ca19ef
> > --- /dev/null
> > +++ b/umem.h
> > @@ -0,0 +1,105 @@
> > +/*
> > + * umem.h: user process backed memory module for postcopy livemigration
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef QEMU_UMEM_H
> > +#define QEMU_UMEM_H
> > +
> > +#include <linux/umem.h>
> > +
> > +#include "qemu-common.h"
> > +
> > +typedef struct UMemDev UMemDev;
> > +
> > +struct UMem {
> > +    void *umem;
> > +    int fd;
> > +    void *shmem;
> > +    int shmem_fd;
> > +    uint64_t size;
> > +
> > +    /* indexed by host page size */
> > +    int page_shift;
> > +    int nbits;
> > +    int nsets;
> > +    unsigned long *faulted;
> > +};
> > +
> > +UMemDev *umem_dev_new(void);
> > +void umem_dev_destroy(UMemDev *dev);
> > +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name);
> > +void umem_mmap(UMem *umem);
> > +
> > +void umem_destroy(UMem *umem);
> > +
> > +/* umem device operations */
> > +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request);
> > +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached);
> > +void umem_unmap(UMem *umem);
> > +void umem_close(UMem *umem);
> > +
> > +/* umem shmem operations */
> > +void *umem_map_shmem(UMem *umem);
> > +void umem_unmap_shmem(UMem *umem);
> > +void umem_remove_shmem(UMem *umem, size_t offset, size_t size);
> > +void umem_close_shmem(UMem *umem);
> > +
> > +/* qemu on source <-> umem daemon communication */
> > +
> > +struct umem_pages {
> > +    uint64_t nr;        /* nr = 0 means completed */
> > +    uint64_t pgoffs[0];
> > +};
> > +
> > +/* daemon -> qemu */
> > +#define UMEM_DAEMON_READY               'R'
> > +#define UMEM_DAEMON_QUIT                'Q'
> > +#define UMEM_DAEMON_TRIGGER_PAGE_FAULT  'T'
> > +#define UMEM_DAEMON_ERROR               'E'
> > +
> > +/* qemu -> daemon */
> > +#define UMEM_QEMU_READY                 'r'
> > +#define UMEM_QEMU_QUIT                  'q'
> > +#define UMEM_QEMU_PAGE_FAULTED          't'
> > +#define UMEM_QEMU_PAGE_UNMAPPED         'u'
> > +
> > +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset);
> > +size_t umem_pages_size(uint64_t nr);
> > +
> > +/* for umem daemon */
> > +void umem_daemon_ready(int to_qemu_fd);
> > +void umem_daemon_wait_for_qemu(int from_qemu_fd);
> > +void umem_daemon_quit(QEMUFile *to_qemu);
> > +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> > +                                    struct umem_pages *pages);
> > +
> > +/* for qemu */
> > +void umem_qemu_wait_for_daemon(int from_umemd_fd);
> > +void umem_qemu_ready(int to_umemd_fd);
> > +void umem_qemu_quit(QEMUFile *to_umemd);
> > +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> > +                                                int *offset);
> > +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> > +                                  const struct umem_pages *pages);
> > +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> > +                                   const struct umem_pages *pages);
> > +
> > +#endif /* QEMU_UMEM_H */
> > diff --git a/vl.c b/vl.c
> > index 5430b8c..17427a0 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -3274,8 +3274,12 @@ int main(int argc, char **argv, char **envp)
> >      default_drive(default_sdcard, snapshot, machine->use_scsi,
> >                    IF_SD, 0, SD_OPTS);
> >  
> > -    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
> > -                         ram_save_live, NULL, ram_load, NULL);
> > +    if (postcopy_incoming_init(incoming, incoming_postcopy) < 0) {
> > +        exit(1);
> > +    }
> > +    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
> > +                         ram_save_set_params, ram_save_live, NULL,
> > +                         ram_load, NULL);
> >  
> >      if (nb_numa_nodes > 0) {
> >          int i;
> > @@ -3471,6 +3475,9 @@ int main(int argc, char **argv, char **envp)
> >  
> >      if (incoming) {
> >          runstate_set(RUN_STATE_INMIGRATE);
> > +        if (incoming_postcopy) {
> > +            postcopy_incoming_prepare();
> >+        }
> 
> how about moving postcopy_incoming_prepare into qemu_start_incoming_migration ?
> 
> >          int ret = qemu_start_incoming_migration(incoming);
> >          if (ret < 0) {
> >              fprintf(stderr, "Migration failed. Exit code %s(%d), exiting.\n",
> > @@ -3488,6 +3495,9 @@ int main(int argc, char **argv, char **envp)
> >      bdrv_close_all();
> >      pause_all_vcpus();
> >      net_cleanup();
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_qemu_cleanup();
> > +    }
> >      res_free();
> >  
> >      return 0;
> 
> Orit
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
@ 2012-01-04  3:34       ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:34 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 05:51:36PM +0200, Orit Wasserman wrote:
> Hi,

Thank you for review.

> A general comment this patch is a bit too long,which makes it hard to review.
> Can you split it please?

Will do. Maybe split into umem.[hc] part, incoming part and outgoing part.


> On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > This patch implements postcopy livemigration.
> > 
> > Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
> > ---
> >  Makefile.target           |    4 +
> >  arch_init.c               |   26 +-
> >  cpu-all.h                 |    7 +
> >  exec.c                    |   20 +-
> >  migration-exec.c          |    8 +
> >  migration-fd.c            |   30 +
> >  migration-postcopy-stub.c |   77 ++
> >  migration-postcopy.c      | 1891 +++++++++++++++++++++++++++++++++++++++++++++
> >  migration-tcp.c           |   37 +-
> >  migration-unix.c          |   32 +-
> >  migration.c               |   31 +
> >  migration.h               |   30 +
> >  qemu-common.h             |    1 +
> >  qemu-options.hx           |    5 +-
> >  umem.c                    |  379 +++++++++
> >  umem.h                    |  105 +++
> >  vl.c                      |   14 +-
> >  17 files changed, 2677 insertions(+), 20 deletions(-)
> >  create mode 100644 migration-postcopy-stub.c
> >  create mode 100644 migration-postcopy.c
> >  create mode 100644 umem.c
> >  create mode 100644 umem.h
> > 
> > diff --git a/Makefile.target b/Makefile.target
> > index 3261383..d94c53f 100644
> > --- a/Makefile.target
> > +++ b/Makefile.target
> > @@ -4,6 +4,7 @@ GENERATED_HEADERS = config-target.h
> >  CONFIG_NO_PCI = $(if $(subst n,,$(CONFIG_PCI)),n,y)
> >  CONFIG_NO_KVM = $(if $(subst n,,$(CONFIG_KVM)),n,y)
> >  CONFIG_NO_XEN = $(if $(subst n,,$(CONFIG_XEN)),n,y)
> > +CONFIG_NO_POSTCOPY = $(if $(subst n,,$(CONFIG_POSTCOPY)),n,y)
> >  
> >  include ../config-host.mak
> >  include config-devices.mak
> > @@ -199,6 +200,9 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
> >  obj-y += memory.o
> >  LIBS+=-lz
> >  
> > +common-obj-$(CONFIG_POSTCOPY) += migration-postcopy.o umem.o
> > +common-obj-$(CONFIG_NO_POSTCOPY) += migration-postcopy-stub.o
> > +
> >  QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
> >  QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
> >  QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
> > diff --git a/arch_init.c b/arch_init.c
> > index bc53092..8b3130d 100644
> > --- a/arch_init.c
> > +++ b/arch_init.c
> > @@ -102,6 +102,13 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
> >      return 1;
> >  }
> >  
> > +static bool outgoing_postcopy = false;
> > +
> > +void ram_save_set_params(const MigrationParams *params, void *opaque)
> > +{
> > +    outgoing_postcopy = params->postcopy;
> > +}
> > +
> >  static RAMBlock *last_block_sent = NULL;
> >  
> >  int ram_save_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset)
> > @@ -284,6 +291,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
> >      uint64_t expected_time = 0;
> >      int ret;
> >  
> > +    if (stage == 1) {
> > +        last_block_sent = NULL;
> > +
> > +        bytes_transferred = 0;
> > +        last_block = NULL;
> > +        last_offset = 0;
> 
> Changing of line order + new empty line
> 
> > +    }
> > +    if (outgoing_postcopy) {
> > +        return postcopy_outgoing_ram_save_live(mon, f, stage, opaque);
> > +    }
> > +
> 
> I would just do :
> 
> unregister_savevm_live and then register_savevm_live(...,postcopy_outgoing_ram_save_live,...)
> when starting outgoing postcopy migration.
> 
> >      if (stage < 0) {
> >          cpu_physical_memory_set_dirty_tracking(0);
> >          return 0;
> > @@ -295,10 +313,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
> >      }
> >  
> >      if (stage == 1) {
> > -        bytes_transferred = 0;
> > -        last_block_sent = NULL;
> > -        last_block = NULL;
> > -        last_offset = 0;
> >          sort_ram_list();
> >  
> >          /* Make sure all dirty bits are set */
> > @@ -436,6 +450,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
> >      int flags;
> >      int error;
> >  
> > +    if (incoming_postcopy) {
> > +        return postcopy_incoming_ram_load(f, opaque, version_id);
> > +    }
> > +
> why not call register_savevm_live(...,postcopy_incoming_ram_load,...) when starting guest with postcopy_incoming
> 
> >      if (version_id < 3 || version_id > RAM_SAVE_VERSION_ID) {
> >          return -EINVAL;
> >      }
> > diff --git a/cpu-all.h b/cpu-all.h
> > index 0244f7a..2e9d8a7 100644
> > --- a/cpu-all.h
> > +++ b/cpu-all.h
> > @@ -475,6 +475,9 @@ extern ram_addr_t ram_size;
> >  /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
> >  #define RAM_PREALLOC_MASK   (1 << 0)
> >  
> > +/* RAM is allocated via umem for postcopy incoming mode */
> > +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> > +
> >  typedef struct RAMBlock {
> >      uint8_t *host;
> >      ram_addr_t offset;
> > @@ -485,6 +488,10 @@ typedef struct RAMBlock {
> >  #if defined(__linux__) && !defined(TARGET_S390X)
> >      int fd;
> >  #endif
> > +
> > +#ifdef CONFIG_POSTCOPY
> > +    UMem *umem;    /* for incoming postcopy mode */
> > +#endif
> >  } RAMBlock;
> >  
> >  typedef struct RAMList {
> > diff --git a/exec.c b/exec.c
> > index c8c6692..90b0491 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -35,6 +35,7 @@
> >  #include "qemu-timer.h"
> >  #include "memory.h"
> >  #include "exec-memory.h"
> > +#include "migration.h"
> >  #if defined(CONFIG_USER_ONLY)
> >  #include <qemu.h>
> >  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> > @@ -2949,6 +2950,13 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
> >          new_block->host = host;
> >          new_block->flags |= RAM_PREALLOC_MASK;
> >      } else {
> > +#ifdef CONFIG_POSTCOPY
> > +        if (incoming_postcopy) {
> > +            postcopy_incoming_ram_alloc(name, size,
> > +                                        &new_block->host, &new_block->umem);
> > +            new_block->flags |= RAM_POSTCOPY_UMEM_MASK;
> > +        } else
> > +#endif
> >          if (mem_path) {
> >  #if defined (__linux__) && !defined(TARGET_S390X)
> >              new_block->host = file_ram_alloc(new_block, size, mem_path);
> > @@ -3027,7 +3035,13 @@ void qemu_ram_free(ram_addr_t addr)
> >              QLIST_REMOVE(block, next);
> >              if (block->flags & RAM_PREALLOC_MASK) {
> >                  ;
> > -            } else if (mem_path) {
> > +            }
> > +#ifdef CONFIG_POSTCOPY
> > +            else if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> > +                postcopy_incoming_ram_free(block->umem);
> > +            }
> > +#endif
> > +            else if (mem_path) {
> >  #if defined (__linux__) && !defined(TARGET_S390X)
> >                  if (block->fd) {
> >                      munmap(block->host, block->length);
> > @@ -3073,6 +3087,10 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
> >              } else {
> >                  flags = MAP_FIXED;
> >                  munmap(vaddr, length);
> > +                if (block->flags & RAM_POSTCOPY_UMEM_MASK) {
> > +                    postcopy_incoming_qemu_pages_unmapped(addr, length);
> > +                    block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> > +                }
> >                  if (mem_path) {
> >  #if defined(__linux__) && !defined(TARGET_S390X)
> >                      if (block->fd) {
> > diff --git a/migration-exec.c b/migration-exec.c
> > index e14552e..2bd0c3b 100644
> > --- a/migration-exec.c
> > +++ b/migration-exec.c
> > @@ -62,6 +62,10 @@ int exec_start_outgoing_migration(MigrationState *s, const char *command)
> >  {
> >      FILE *f;
> >  
> > +    if (s->params.postcopy) {
> > +        return -ENOSYS;
> > +    }
> > +
> >      f = popen(command, "w");
> >      if (f == NULL) {
> >          DPRINTF("Unable to popen exec target\n");
> > @@ -104,6 +108,10 @@ int exec_start_incoming_migration(const char *command)
> >  {
> >      QEMUFile *f;
> >  
> > +    if (incoming_postcopy) {
> > +        return -ENOSYS;
> > +    }
> > +
> >      DPRINTF("Attempting to start an incoming migration\n");
> >      f = qemu_popen_cmd(command, "r");
> >      if(f == NULL) {
> > diff --git a/migration-fd.c b/migration-fd.c
> > index 6211124..5a62ab9 100644
> > --- a/migration-fd.c
> > +++ b/migration-fd.c
> > @@ -88,6 +88,23 @@ int fd_start_outgoing_migration(MigrationState *s, const char *fdname)
> >      s->write = fd_write;
> >      s->close = fd_close;
> >  
> > +    if (s->params.postcopy) {
> > +        int flags = fcntl(s->fd, F_GETFL);
> > +        if ((flags & O_ACCMODE) != O_RDWR) {
> > +            goto err_after_open;
> > +        }
> > +
> > +        s->fd_read = dup(s->fd);
> > +        if (s->fd_read == -1) {
> > +            goto err_after_open;
> > +        }
> > +        s->file_read = qemu_fdopen(s->fd_read, "r");
> > +        if (s->file_read == NULL) {
> > +            close(s->fd_read);
> > +            goto err_after_open;
> > +        }
> > +    }
> > +
> >      migrate_fd_connect(s);
> >      return 0;
> >  
> > @@ -103,7 +120,14 @@ static void fd_accept_incoming_migration(void *opaque)
> >  
> >      process_incoming_migration(f);
> >      qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_fork_umemd(qemu_stdio_fd(f), f);
> > +    }
> >      qemu_fclose(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_qemu_ready();
> > +    }
> > +    return;
> >  }
> >  
> >  int fd_start_incoming_migration(const char *infd)
> > @@ -114,6 +138,12 @@ int fd_start_incoming_migration(const char *infd)
> >      DPRINTF("Attempting to start an incoming migration via fd\n");
> >  
> >      fd = strtol(infd, NULL, 0);
> > +    if (incoming_postcopy) {
> > +        int flags = fcntl(fd, F_GETFL);
> > +        if ((flags & O_ACCMODE) != O_RDWR) {
> > +            return -EINVAL;
> > +        }
> > +    }
> >      f = qemu_fdopen(fd, "rb");
> >      if(f == NULL) {
> >          DPRINTF("Unable to apply qemu wrapper to file descriptor\n");
> > diff --git a/migration-postcopy-stub.c b/migration-postcopy-stub.c
> > new file mode 100644
> > index 0000000..0b78de7
> > --- /dev/null
> > +++ b/migration-postcopy-stub.c
> > @@ -0,0 +1,77 @@
> > +/*
> > + * migration-postcopy-stub.c: postcopy livemigration
> > + *                            stub functions for non-supported hosts
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "sysemu.h"
> > +#include "migration.h"
> > +
> > +int postcopy_outgoing_create_read_socket(MigrationState *s)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_live(Monitor *mon,
> > +                                    QEMUFile *f, int stage, void *opaque)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +void *postcopy_outgoing_begin(MigrationState *ms)
> > +{
> > +    return NULL;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> > +                                          void *postcopy)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +void postcopy_incoming_prepare(void)
> > +{
> > +}
> > +
> > +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    return -ENOSYS;
> > +}
> > +
> > +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> > +{
> > +}
> > +
> > +void postcopy_incoming_qemu_ready(void)
> > +{
> > +}
> > +
> > +void postcopy_incoming_qemu_cleanup(void)
> > +{
> > +}
> > +
> > +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> > +{
> > +}
> > diff --git a/migration-postcopy.c b/migration-postcopy.c
> > new file mode 100644
> > index 0000000..ed0d574
> > --- /dev/null
> > +++ b/migration-postcopy.c
> > @@ -0,0 +1,1891 @@
> > +/*
> > + * migration-postcopy.c: postcopy livemigration
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "bitmap.h"
> > +#include "sysemu.h"
> > +#include "hw/hw.h"
> > +#include "arch_init.h"
> > +#include "migration.h"
> > +#include "umem.h"
> > +
> > +#include "memory.h"
> > +#define WANT_EXEC_OBSOLETE
> > +#include "exec-obsolete.h"
> > +
> > +//#define DEBUG_POSTCOPY
> > +#ifdef DEBUG_POSTCOPY
> > +#include <sys/syscall.h>
> > +#define DPRINTF(fmt, ...)                                               \
> > +    do {                                                                \
> > +        printf("%d:%ld %s:%d: " fmt, getpid(), syscall(SYS_gettid),     \
> > +               __func__, __LINE__, ## __VA_ARGS__);                     \
> > +    } while (0)
> > +#else
> > +#define DPRINTF(fmt, ...)       do { } while (0)
> > +#endif
> > +
> > +#define ALIGN_UP(size, align)   (((size) + (align) - 1) & ~((align) - 1))
> > +
> > +static void fd_close(int *fd)
> > +{
> > +    if (*fd >= 0) {
> > +        close(*fd);
> > +        *fd = -1;
> > +    }
> > +}
> > +
> > +/***************************************************************************
> > + * QEMUFile for non blocking pipe
> > + */
> > +
> > +/* read only */
> > +struct QEMUFilePipe {
> > +    int fd;
> > +    QEMUFile *file;
> > +};
> 
> Why not use QEMUFileSocket ?

Okay, will rename it to QEMUFile_FD (or whatever) and share the struct.


> > +typedef struct QEMUFilePipe QEMUFilePipe;
> > +
> > +static int pipe_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
> > +{
> > +    QEMUFilePipe *s = opaque;
> > +    ssize_t len = 0;
> > +
> > +    while (size > 0) {
> > +        ssize_t ret = read(s->fd, buf, size);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            }
> > +            if (len == 0) {
> > +                len = -errno;
> > +            }
> > +            break;
> > +        }
> > +
> > +        if (ret == 0) {
> > +            /* the write end of the pipe is closed */
> > +            break;
> > +        }
> > +        len += ret;
> > +        buf += ret;
> > +        size -= ret;
> > +    }
> > +
> > +    return len;
> > +}
> > +
> > +static int pipe_close(void *opaque)
> > +{
> > +    QEMUFilePipe *s = opaque;
> > +    g_free(s);
> > +    return 0;
> > +}
> > +
> > +static QEMUFile *qemu_fopen_pipe(int fd)
> > +{
> > +    QEMUFilePipe *s = g_malloc0(sizeof(*s));
> > +
> > +    s->fd = fd;
> > +    fcntl_setfl(fd, O_NONBLOCK);
> > +    s->file = qemu_fopen_ops(s, NULL, pipe_get_buffer, pipe_close,
> > +                             NULL, NULL, NULL);
> > +    return s->file;
> > +}
> > +
> > +/* write only */
> > +struct QEMUFileNonblock {
> > +    int fd;
> > +    QEMUFile *file;
> > +
> > +    /* for pipe-write nonblocking mode */
> > +#define BUF_SIZE_INC    (32 * 1024)     /* = IO_BUF_SIZE */
> > +    uint8_t *buffer;
> > +    size_t buffer_size;
> > +    size_t buffer_capacity;
> > +    bool freeze_output;
> > +};
> > +typedef struct QEMUFileNonblock QEMUFileNonblock;
> > +
> 
> Couldn't you use QEMUFileBuffered ?

QEMUFileBuffered can be built on top of QEMUFileNonblock.
I'll refactor buffered_file.c


> > +static void nonblock_flush_buffer(QEMUFileNonblock *s)
> > +{
> > +    size_t offset = 0;
> > +    ssize_t ret;
> > +
> > +    while (offset < s->buffer_size) {
> > +        ret = write(s->fd, s->buffer + offset, s->buffer_size - offset);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            } else if (errno == EAGAIN) {
> > +                s->freeze_output = true;
> > +            } else {
> > +                qemu_file_set_error(s->file, errno);
> > +            }
> > +            break;
> > +        }
> > +
> > +        if (ret == 0) {
> > +            DPRINTF("ret == 0\n");
> > +            break;
> > +        }
> > +
> > +        offset += ret;
> > +    }
> > +
> > +    if (offset > 0) {
> > +        assert(s->buffer_size >= offset);
> > +        memmove(s->buffer, s->buffer + offset, s->buffer_size - offset);
> > +        s->buffer_size -= offset;
> > +    }
> > +    if (s->buffer_size > 0) {
> > +        s->freeze_output = true;
> > +    }
> > +}
> > +
> > +static int nonblock_put_buffer(void *opaque,
> > +                               const uint8_t *buf, int64_t pos, int size)
> > +{
> > +    QEMUFileNonblock *s = opaque;
> > +    int error;
> > +    ssize_t len = 0;
> > +
> > +    error = qemu_file_get_error(s->file);
> > +    if (error) {
> > +        return error;
> > +    }
> > +
> > +    nonblock_flush_buffer(s);
> > +    error = qemu_file_get_error(s->file);
> > +    if (error) {
> > +        return error;
> > +    }
> > +
> > +    while (!s->freeze_output && size > 0) {
> > +        ssize_t ret;
> > +        assert(s->buffer_size == 0);
> > +
> > +        ret = write(s->fd, buf, size);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            } else if (errno == EAGAIN) {
> > +                s->freeze_output = true;
> > +            } else {
> > +                qemu_file_set_error(s->file, errno);
> > +            }
> > +            break;
> > +        }
> > +
> > +        len += ret;
> > +        buf += ret;
> > +        size -= ret;
> > +    }
> > +
> > +    if (size > 0) {
> > +        int inc = size - (s->buffer_capacity - s->buffer_size);
> > +        if (inc > 0) {
> > +            s->buffer_capacity +=
> > +                DIV_ROUND_UP(inc, BUF_SIZE_INC) * BUF_SIZE_INC;
> > +            s->buffer = g_realloc(s->buffer, s->buffer_capacity);
> > +        }
> > +        memcpy(s->buffer + s->buffer_size, buf, size);
> > +        s->buffer_size += size;
> > +
> > +        len += size;
> > +    }
> > +
> > +    return len;
> > +}
> > +
> > +static int nonblock_pending_size(QEMUFileNonblock *s)
> > +{
> > +    return qemu_pending_size(s->file) + s->buffer_size;
> > +}
> > +
> > +static void nonblock_fflush(QEMUFileNonblock *s)
> > +{
> > +    s->freeze_output = false;
> > +    nonblock_flush_buffer(s);
> > +    if (!s->freeze_output) {
> > +        qemu_fflush(s->file);
> > +    }
> > +}
> > +
> > +static void nonblock_wait_for_flush(QEMUFileNonblock *s)
> > +{
> > +    while (nonblock_pending_size(s) > 0) {
> > +        fd_set fds;
> > +        FD_ZERO(&fds);
> > +        FD_SET(s->fd, &fds);
> > +        select(s->fd + 1, NULL, &fds, NULL, NULL);
> > +
> > +        nonblock_fflush(s);
> > +    }
> > +}
> > +
> > +static int nonblock_close(void *opaque)
> > +{
> > +    QEMUFileNonblock *s = opaque;
> > +    nonblock_wait_for_flush(s);
> > +    g_free(s->buffer);
> > +    g_free(s);
> > +    return 0;
> > +}
> > +
> > +static QEMUFileNonblock *qemu_fopen_nonblock(int fd)
> > +{
> > +    QEMUFileNonblock *s = g_malloc0(sizeof(*s));
> > +
> > +    s->fd = fd;
> > +    fcntl_setfl(fd, O_NONBLOCK);
> > +    s->file = qemu_fopen_ops(s, nonblock_put_buffer, NULL, nonblock_close,
> > +                             NULL, NULL, NULL);
> > +    return s;
> > +}
> > +
> > +/***************************************************************************
> > + * umem daemon on destination <-> qemu on source protocol
> > + */
> > +
> > +#define QEMU_UMEM_REQ_INIT              0x00
> > +#define QEMU_UMEM_REQ_ON_DEMAND         0x01
> > +#define QEMU_UMEM_REQ_ON_DEMAND_CONT    0x02
> > +#define QEMU_UMEM_REQ_BACKGROUND        0x03
> > +#define QEMU_UMEM_REQ_BACKGROUND_CONT   0x04
> > +#define QEMU_UMEM_REQ_REMOVE            0x05
> > +#define QEMU_UMEM_REQ_EOC               0x06
> > +
> > +struct qemu_umem_req {
> > +    int8_t cmd;
> > +    uint8_t len;
> > +    char *idstr;        /* ON_DEMAND, BACKGROUND, REMOVE */
> > +    uint32_t nr;        /* ON_DEMAND, ON_DEMAND_CONT,
> > +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> > +
> > +    /* in target page size as qemu migration protocol */
> > +    uint64_t *pgoffs;   /* ON_DEMAND, ON_DEMAND_CONT,
> > +                           BACKGROUND, BACKGROUND_CONT, REMOVE */
> > +};
> > +
> > +static void postcopy_incoming_send_req_idstr(QEMUFile *f, const char* idstr)
> > +{
> > +    qemu_put_byte(f, strlen(idstr));
> > +    qemu_put_buffer(f, (uint8_t *)idstr, strlen(idstr));
> > +}
> > +
> > +static void postcopy_incoming_send_req_pgoffs(QEMUFile *f, uint32_t nr,
> > +                                              const uint64_t *pgoffs)
> > +{
> > +    uint32_t i;
> > +
> > +    qemu_put_be32(f, nr);
> > +    for (i = 0; i < nr; i++) {
> > +        qemu_put_be64(f, pgoffs[i]);
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_send_req_one(QEMUFile *f,
> > +                                           const struct qemu_umem_req *req)
> > +{
> > +    DPRINTF("cmd %d\n", req->cmd);
> > +    qemu_put_byte(f, req->cmd);
> > +    switch (req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +    case QEMU_UMEM_REQ_EOC:
> > +        /* nothing */
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +        postcopy_incoming_send_req_idstr(f, req->idstr);
> > +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        postcopy_incoming_send_req_pgoffs(f, req->nr, req->pgoffs);
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +}
> > +
> > +/* QEMUFile can buffer up to IO_BUF_SIZE = 32 * 1024.
> > + * So one message size must be <= IO_BUF_SIZE
> > + * cmd: 1
> > + * id len: 1
> > + * id: 256
> > + * nr: 2
> > + */
> > +#define MAX_PAGE_NR     ((32 * 1024 - 1 - 1 - 256 - 2) / sizeof(uint64_t))
> > +static void postcopy_incoming_send_req(QEMUFile *f,
> > +                                       const struct qemu_umem_req *req)
> > +{
> > +    uint32_t nr = req->nr;
> > +    struct qemu_umem_req tmp = *req;
> > +
> > +    switch (req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +    case QEMU_UMEM_REQ_EOC:
> > +        postcopy_incoming_send_req_one(f, &tmp);
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +        tmp.nr = MIN(nr, MAX_PAGE_NR);
> > +        postcopy_incoming_send_req_one(f, &tmp);
> > +
> > +        nr -= tmp.nr;
> > +        tmp.pgoffs += tmp.nr;
> > +        if (tmp.cmd == QEMU_UMEM_REQ_ON_DEMAND) {
> > +            tmp.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> > +        }else {
> > +            tmp.cmd = QEMU_UMEM_REQ_BACKGROUND_CONT;
> > +        }
> > +        /* fall through */
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        while (nr > 0) {
> > +            tmp.nr = MIN(nr, MAX_PAGE_NR);
> > +            postcopy_incoming_send_req_one(f, &tmp);
> > +
> > +            nr -= tmp.nr;
> > +            tmp.pgoffs += tmp.nr;
> > +        }
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +}
> > +
> > +static int postcopy_outgoing_recv_req_idstr(QEMUFile *f,
> > +                                            struct qemu_umem_req *req,
> > +                                            size_t *offset)
> > +{
> > +    int ret;
> > +
> > +    req->len = qemu_peek_byte(f, *offset);
> > +    *offset += 1;
> > +    if (req->len == 0) {
> > +        return -EAGAIN;
> > +    }
> > +    req->idstr = g_malloc((int)req->len + 1);
> > +    ret = qemu_peek_buffer(f, (uint8_t*)req->idstr, req->len, *offset);
> > +    *offset += ret;
> > +    if (ret != req->len) {
> > +        g_free(req->idstr);
> > +        req->idstr = NULL;
> > +        return -EAGAIN;
> > +    }
> > +    req->idstr[req->len] = 0;
> > +    return 0;
> > +}
> > +
> > +static int postcopy_outgoing_recv_req_pgoffs(QEMUFile *f,
> > +                                             struct qemu_umem_req *req,
> > +                                             size_t *offset)
> > +{
> > +    int ret;
> > +    uint32_t be32;
> > +    uint32_t i;
> > +
> > +    ret = qemu_peek_buffer(f, (uint8_t*)&be32, sizeof(be32), *offset);
> > +    *offset += sizeof(be32);
> > +    if (ret != sizeof(be32)) {
> > +        return -EAGAIN;
> > +    }
> > +
> > +    req->nr = be32_to_cpu(be32);
> > +    req->pgoffs = g_new(uint64_t, req->nr);
> > +    for (i = 0; i < req->nr; i++) {
> > +        uint64_t be64;
> > +        ret = qemu_peek_buffer(f, (uint8_t*)&be64, sizeof(be64), *offset);
> > +        *offset += sizeof(be64);
> > +        if (ret != sizeof(be64)) {
> > +            g_free(req->pgoffs);
> > +            req->pgoffs = NULL;
> > +            return -EAGAIN;
> > +        }
> > +        req->pgoffs[i] = be64_to_cpu(be64);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int postcopy_outgoing_recv_req(QEMUFile *f, struct qemu_umem_req *req)
> > +{
> > +    int size;
> > +    int ret;
> > +    size_t offset = 0;
> > +
> > +    size = qemu_peek_buffer(f, (uint8_t*)&req->cmd, 1, offset);
> > +    if (size <= 0) {
> > +        return -EAGAIN;
> > +    }
> > +    offset += 1;
> > +
> > +    switch (req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +    case QEMU_UMEM_REQ_EOC:
> > +        /* nothing */
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +        ret = postcopy_outgoing_recv_req_idstr(f, req, &offset);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +        break;
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        ret = postcopy_outgoing_recv_req_pgoffs(f, req, &offset);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +    qemu_file_skip(f, offset);
> > +    DPRINTF("cmd %d\n", req->cmd);
> > +    return 0;
> > +}
> > +
> > +static void postcopy_outgoing_free_req(struct qemu_umem_req *req)
> > +{
> > +    g_free(req->idstr);
> > +    g_free(req->pgoffs);
> > +}
> > +
> > +/***************************************************************************
> > + * outgoing part
> > + */
> > +
> > +#define QEMU_SAVE_LIVE_STAGE_START      0x01    /* = QEMU_VM_SECTION_START */
> > +#define QEMU_SAVE_LIVE_STAGE_PART       0x02    /* = QEMU_VM_SECTION_PART */
> > +#define QEMU_SAVE_LIVE_STAGE_END        0x03    /* = QEMU_VM_SECTION_END */
> > +
> > +enum POState {
> > +    PO_STATE_ERROR_RECEIVE,
> > +    PO_STATE_ACTIVE,
> > +    PO_STATE_EOC_RECEIVED,
> > +    PO_STATE_ALL_PAGES_SENT,
> > +    PO_STATE_COMPLETED,
> > +};
> > +typedef enum POState POState;
> > +
> > +struct PostcopyOutgoingState {
> > +    POState state;
> > +    QEMUFile *mig_read;
> > +    int fd_read;
> > +    RAMBlock *last_block_read;
> > +
> > +    QEMUFile *mig_buffered_write;
> > +    MigrationState *ms;
> > +
> > +    /* For nobg mode. Check if all pages are sent */
> > +    RAMBlock *block;
> > +    ram_addr_t addr;
> > +};
> > +typedef struct PostcopyOutgoingState PostcopyOutgoingState;
> > +
> > +int postcopy_outgoing_create_read_socket(MigrationState *s)
> > +{
> > +    if (!s->params.postcopy) {
> > +        return 0;
> > +    }
> > +
> > +    s->fd_read = dup(s->fd);
> > +    if (s->fd_read == -1) {
> > +        int ret = -errno;
> > +        perror("dup");
> > +        return ret;
> > +    }
> > +    s->file_read = qemu_fopen_socket(s->fd_read);
> > +    if (s->file_read == NULL) {
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_live(Monitor *mon,
> > +                                    QEMUFile *f, int stage, void *opaque)
> > +{
> > +    int ret = 0;
> > +    DPRINTF("stage %d\n", stage);
> > +    if (stage == QEMU_SAVE_LIVE_STAGE_START) {
> > +        sort_ram_list();
> > +        ram_save_live_mem_size(f);
> > +    }
> > +    if (stage == QEMU_SAVE_LIVE_STAGE_PART) {
> > +        ret = 1;
> > +    }
> > +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> > +    return ret;
> > +}
> > +
> > +static RAMBlock *postcopy_outgoing_find_block(const char *idstr)
> > +{
> > +    RAMBlock *block;
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        if (!strncmp(idstr, block->idstr, strlen(idstr))) {
> > +            return block;
> > +        }
> > +    }
> > +    return NULL;
> > +}
> > +
> > +/*
> > + * return value
> > + *   0: continue postcopy mode
> > + * > 0: completed postcopy mode.
> > + * < 0: error
> > + */
> > +static int postcopy_outgoing_handle_req(PostcopyOutgoingState *s,
> > +                                        const struct qemu_umem_req *req,
> > +                                        bool *written)
> > +{
> > +    int i;
> > +    RAMBlock *block;
> > +
> > +    DPRINTF("cmd %d state %d\n", req->cmd, s->state);
> > +    switch(req->cmd) {
> > +    case QEMU_UMEM_REQ_INIT:
> > +        /* nothing */
> > +        break;
> > +    case QEMU_UMEM_REQ_EOC:
> > +        /* tell to finish migration. */
> > +        if (s->state == PO_STATE_ALL_PAGES_SENT) {
> > +            s->state = PO_STATE_COMPLETED;
> > +            DPRINTF("-> PO_STATE_COMPLETED\n");
> > +        } else {
> > +            s->state = PO_STATE_EOC_RECEIVED;
> > +            DPRINTF("-> PO_STATE_EOC_RECEIVED\n");
> > +        }
> > +        return 1;
> > +    case QEMU_UMEM_REQ_ON_DEMAND:
> > +    case QEMU_UMEM_REQ_BACKGROUND:
> > +        DPRINTF("idstr: %s\n", req->idstr);
> > +        block = postcopy_outgoing_find_block(req->idstr);
> > +        if (block == NULL) {
> > +            return -EINVAL;
> > +        }
> > +        s->last_block_read = block;
> > +        /* fall through */
> > +    case QEMU_UMEM_REQ_ON_DEMAND_CONT:
> > +    case QEMU_UMEM_REQ_BACKGROUND_CONT:
> > +        DPRINTF("nr %d\n", req->nr);
> > +        for (i = 0; i < req->nr; i++) {
> > +            DPRINTF("offs[%d] 0x%"PRIx64"\n", i, req->pgoffs[i]);
> > +            int ret = ram_save_page(s->mig_buffered_write, s->last_block_read,
> > +                                    req->pgoffs[i] << TARGET_PAGE_BITS);
> > +            if (ret > 0) {
> > +                *written = true;
> > +            }
> > +        }
> > +        break;
> > +    case QEMU_UMEM_REQ_REMOVE:
> > +        block = postcopy_outgoing_find_block(req->idstr);
> > +        if (block == NULL) {
> > +            return -EINVAL;
> > +        }
> > +        for (i = 0; i < req->nr; i++) {
> > +            ram_addr_t addr = block->offset +
> > +                (req->pgoffs[i] << TARGET_PAGE_BITS);
> > +            cpu_physical_memory_reset_dirty(addr,
> > +                                            addr + TARGET_PAGE_SIZE,
> > +                                            MIGRATION_DIRTY_FLAG);
> > +        }
> > +        break;
> > +    default:
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void postcopy_outgoing_close_mig_read(PostcopyOutgoingState *s)
> > +{
> > +    if (s->mig_read != NULL) {
> > +        qemu_set_fd_handler(s->fd_read, NULL, NULL, NULL);
> > +        qemu_fclose(s->mig_read);
> > +        s->mig_read = NULL;
> > +        fd_close(&s->fd_read);
> > +
> > +        s->ms->file_read = NULL;
> > +        s->ms->fd_read = -1;
> > +    }
> > +}
> > +
> > +static void postcopy_outgoing_completed(PostcopyOutgoingState *s)
> > +{
> > +    postcopy_outgoing_close_mig_read(s);
> > +    s->ms->postcopy = NULL;
> > +    g_free(s);
> > +}
> > +
> > +static void postcopy_outgoing_recv_handler(void *opaque)
> > +{
> > +    PostcopyOutgoingState *s = opaque;
> > +    bool written = false;
> > +    int ret = 0;
> > +
> > +    assert(s->state == PO_STATE_ACTIVE ||
> > +           s->state == PO_STATE_ALL_PAGES_SENT);
> > +
> > +    do {
> > +        struct qemu_umem_req req = {.idstr = NULL,
> > +                                    .pgoffs = NULL};
> > +
> > +        ret = postcopy_outgoing_recv_req(s->mig_read, &req);
> > +        if (ret < 0) {
> > +            if (ret == -EAGAIN) {
> > +                ret = 0;
> > +            }
> > +            break;
> > +        }
> > +        if (s->state == PO_STATE_ACTIVE) {
> > +            ret = postcopy_outgoing_handle_req(s, &req, &written);
> > +        }
> > +        postcopy_outgoing_free_req(&req);
> > +    } while (ret == 0);
> > +
> > +    /*
> > +     * flush buffered_file.
> > +     * Although mig_write is rate-limited buffered file, those written pages
> > +     * are requested on demand by the destination. So forcibly push
> > +     * those pages ignoring rate limiting
> > +     */
> > +    if (written) {
> > +        qemu_fflush(s->mig_buffered_write);
> > +        /* qemu_buffered_file_drain(s->mig_buffered_write); */
> > +    }
> > +
> > +    if (ret < 0) {
> > +        switch (s->state) {
> > +        case PO_STATE_ACTIVE:
> > +            s->state = PO_STATE_ERROR_RECEIVE;
> > +            DPRINTF("-> PO_STATE_ERROR_RECEIVE\n");
> > +            break;
> > +        case PO_STATE_ALL_PAGES_SENT:
> > +            s->state = PO_STATE_COMPLETED;
> > +            DPRINTF("-> PO_STATE_ALL_PAGES_SENT\n");
> > +            break;
> > +        default:
> > +            abort();
> > +        }
> > +    }
> > +    if (s->state == PO_STATE_ERROR_RECEIVE || s->state == PO_STATE_COMPLETED) {
> > +        postcopy_outgoing_close_mig_read(s);
> > +    }
> > +    if (s->state == PO_STATE_COMPLETED) {
> > +        DPRINTF("PO_STATE_COMPLETED\n");
> > +        MigrationState *ms = s->ms;
> > +        postcopy_outgoing_completed(s);
> > +        migrate_fd_completed(ms);
> > +    }
> > +}
> > +
> > +void *postcopy_outgoing_begin(MigrationState *ms)
> > +{
> > +    PostcopyOutgoingState *s = g_new(PostcopyOutgoingState, 1);
> > +    DPRINTF("outgoing begin\n");
> > +    qemu_fflush(ms->file);
> > +
> > +    s->ms = ms;
> > +    s->state = PO_STATE_ACTIVE;
> > +    s->fd_read = ms->fd_read;
> > +    s->mig_read = ms->file_read;
> > +    s->mig_buffered_write = ms->file;
> > +    s->block = NULL;
> > +    s->addr = 0;
> > +
> > +    /* Make sure all dirty bits are set */
> > +    ram_save_memory_set_dirty();
> > +
> > +    qemu_set_fd_handler(s->fd_read,
> > +                        &postcopy_outgoing_recv_handler, NULL, s);
> > +    return s;
> > +}
> > +
> > +static void postcopy_outgoing_ram_all_sent(QEMUFile *f,
> > +                                           PostcopyOutgoingState *s)
> > +{
> > +    assert(s->state == PO_STATE_ACTIVE);
> > +
> > +    s->state = PO_STATE_ALL_PAGES_SENT;
> > +    /* tell incoming side that all pages are sent */
> > +    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> > +    qemu_fflush(f);
> > +    qemu_buffered_file_drain(f);
> > +    DPRINTF("sent RAM_SAVE_FLAG_EOS\n");
> > +    migrate_fd_cleanup(s->ms);
> > +
> > +    /* Later migrate_fd_complete() will be called which calls
> > +     * migrate_fd_cleanup() again. So dummy file is created
> > +     * for qemu monitor to keep working.
> > +     */
> > +    s->ms->file = qemu_fopen_ops(NULL, NULL, NULL, NULL, NULL,
> > +                                 NULL, NULL);
> > +}
> > +
> > +static int postcopy_outgoing_check_all_ram_sent(PostcopyOutgoingState *s,
> > +                                                RAMBlock *block,
> > +                                                ram_addr_t addr)
> > +{
> > +    if (block == NULL) {
> > +        block = QLIST_FIRST(&ram_list.blocks);
> > +        addr = block->offset;
> > +    }
> > +
> > +    for (; block != NULL;
> > +         s->block = QLIST_NEXT(s->block, next), addr = block->offset) {
> > +        for (; addr < block->offset + block->length;
> > +             addr += TARGET_PAGE_SIZE) {
> > +            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
> > +                s->block = block;
> > +                s->addr = addr;
> > +                return 0;
> > +            }
> > +        }
> > +    }
> > +
> > +    return 1;
> > +}
> > +
> > +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> > +                                          void *postcopy)
> > +{
> > +    PostcopyOutgoingState *s = postcopy;
> > +
> > +    assert(s->state == PO_STATE_ACTIVE ||
> > +           s->state == PO_STATE_EOC_RECEIVED ||
> > +           s->state == PO_STATE_ERROR_RECEIVE);
> > +
> > +    switch (s->state) {
> > +    case PO_STATE_ACTIVE:
> > +        /* nothing. processed below */
> > +        break;
> > +    case PO_STATE_EOC_RECEIVED:
> > +        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> > +        s->state = PO_STATE_COMPLETED;
> > +        postcopy_outgoing_completed(s);
> > +        DPRINTF("PO_STATE_COMPLETED\n");
> > +        return 1;
> > +    case PO_STATE_ERROR_RECEIVE:
> > +        postcopy_outgoing_completed(s);
> > +        DPRINTF("PO_STATE_ERROR_RECEIVE\n");
> > +        return -1;
> > +    default:
> > +        abort();
> > +    }
> > +
> > +    if (s->ms->params.nobg) {
> > +        /* See if all pages are sent. */
> > +        if (postcopy_outgoing_check_all_ram_sent(s, s->block, s->addr) == 0) {
> > +            return 0;
> > +        }
> > +        /* ram_list can be reordered. (it doesn't seem so during migration,
> > +           though) So the whole list needs to be checked again */
> > +        if (postcopy_outgoing_check_all_ram_sent(s, NULL, 0) == 0) {
> > +            return 0;
> > +        }
> > +
> > +        postcopy_outgoing_ram_all_sent(f, s);
> > +        return 0;
> > +    }
> > +
> > +    DPRINTF("outgoing background state: %d\n", s->state);
> > +
> > +    while (qemu_file_rate_limit(f) == 0) {
> > +        if (ram_save_block(f) == 0) { /* no more blocks */
> > +            assert(s->state == PO_STATE_ACTIVE);
> > +            postcopy_outgoing_ram_all_sent(f, s);
> > +            return 0;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/***************************************************************************
> > + * incoming part
> > + */
> > +
> > +/* flags for incoming mode to modify the behavior.
> > +   This is for benchmark/debug purpose */
> > +#define INCOMING_FLAGS_FAULT_REQUEST 0x01
> > +
> > +
> > +static void postcopy_incoming_umemd(void);
> > +
> > +#define PIS_STATE_QUIT_RECEIVED         0x01
> > +#define PIS_STATE_QUIT_QUEUED           0x02
> > +#define PIS_STATE_QUIT_SENT             0x04
> > +
> > +#define PIS_STATE_QUIT_MASK             (PIS_STATE_QUIT_RECEIVED | \
> > +                                         PIS_STATE_QUIT_QUEUED | \
> > +                                         PIS_STATE_QUIT_SENT)
> > +
> > +struct PostcopyIncomingState {
> > +    /* dest qemu state */
> > +    uint32_t    state;
> > +
> > +    UMemDev *dev;
> > +    int host_page_size;
> > +    int host_page_shift;
> > +
> > +    /* qemu side */
> > +    int to_umemd_fd;
> > +    QEMUFileNonblock *to_umemd;
> > +#define MAX_FAULTED_PAGES       256
> > +    struct umem_pages *faulted_pages;
> > +
> > +    int from_umemd_fd;
> > +    QEMUFile *from_umemd;
> > +    int version_id;     /* save/load format version id */
> > +};
> > +typedef struct PostcopyIncomingState PostcopyIncomingState;
> > +
> > +
> > +#define UMEM_STATE_EOS_RECEIVED         0x01    /* umem daemon <-> src qemu */
> > +#define UMEM_STATE_EOC_SENT             0x02    /* umem daemon <-> src qemu */
> > +#define UMEM_STATE_QUIT_RECEIVED        0x04    /* umem daemon <-> dst qemu */
> > +#define UMEM_STATE_QUIT_QUEUED          0x08    /* umem daemon <-> dst qemu */
> > +#define UMEM_STATE_QUIT_SENT            0x10    /* umem daemon <-> dst qemu */
> > +
> > +#define UMEM_STATE_QUIT_MASK            (UMEM_STATE_QUIT_QUEUED | \
> > +                                         UMEM_STATE_QUIT_SENT | \
> > +                                         UMEM_STATE_QUIT_RECEIVED)
> > +#define UMEM_STATE_END_MASK             (UMEM_STATE_EOS_RECEIVED | \
> > +                                         UMEM_STATE_EOC_SENT | \
> > +                                         UMEM_STATE_QUIT_MASK)
> > +
> > +struct PostcopyIncomingUMemDaemon {
> > +    /* umem daemon side */
> > +    uint32_t state;
> > +
> > +    int host_page_size;
> > +    int host_page_shift;
> > +    int nr_host_pages_per_target_page;
> > +    int host_to_target_page_shift;
> > +    int nr_target_pages_per_host_page;
> > +    int target_to_host_page_shift;
> > +    int version_id;     /* save/load format version id */
> > +
> > +    int to_qemu_fd;
> > +    QEMUFileNonblock *to_qemu;
> > +    int from_qemu_fd;
> > +    QEMUFile *from_qemu;
> > +
> > +    int mig_read_fd;
> > +    QEMUFile *mig_read;         /* qemu on source -> umem daemon */
> > +
> > +    int mig_write_fd;
> > +    QEMUFileNonblock *mig_write;        /* umem daemon -> qemu on source */
> > +
> > +    /* = KVM_MAX_VCPUS * (ASYNC_PF_PER_VCPUS + 1) */
> > +#define MAX_REQUESTS    (512 * (64 + 1))
> > +
> > +    struct umem_page_request page_request;
> > +    struct umem_page_cached page_cached;
> > +
> > +#define MAX_PRESENT_REQUESTS    MAX_FAULTED_PAGES
> > +    struct umem_pages *present_request;
> > +
> > +    uint64_t *target_pgoffs;
> > +
> > +    /* bitmap indexed by target page offset */
> > +    unsigned long *phys_requested;
> > +
> > +    /* bitmap indexed by target page offset */
> > +    unsigned long *phys_received;
> > +
> > +    RAMBlock *last_block_read;  /* qemu on source -> umem daemon */
> > +    RAMBlock *last_block_write; /* umem daemon -> qemu on source */
> > +};
> > +typedef struct PostcopyIncomingUMemDaemon PostcopyIncomingUMemDaemon;
> > +
> > +static PostcopyIncomingState state = {
> > +    .state = 0,
> > +    .dev = NULL,
> > +    .to_umemd_fd = -1,
> > +    .to_umemd = NULL,
> > +    .from_umemd_fd = -1,
> > +    .from_umemd = NULL,
> > +};
> > +
> > +static PostcopyIncomingUMemDaemon umemd = {
> > +    .state = 0,
> > +    .to_qemu_fd = -1,
> > +    .to_qemu = NULL,
> > +    .from_qemu_fd = -1,
> > +    .from_qemu = NULL,
> > +    .mig_read_fd = -1,
> > +    .mig_read = NULL,
> > +    .mig_write_fd = -1,
> > +    .mig_write = NULL,
> > +};
> > +
> > +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy)
> > +{
> > +    /* incoming_postcopy makes sense only when incoming migration mode */
> > +    if (!incoming && incoming_postcopy) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!incoming_postcopy) {
> > +        return 0;
> > +    }
> > +
> > +    state.state = 0;
> > +    state.dev = umem_dev_new();
> > +    state.host_page_size = getpagesize();
> > +    state.host_page_shift = ffs(state.host_page_size) - 1;
> > +    state.version_id = RAM_SAVE_VERSION_ID; /* = save version of
> > +                                               ram_save_live() */
> > +    return 0;
> > +}
> > +
> > +void postcopy_incoming_ram_alloc(const char *name,
> > +                                 size_t size, uint8_t **hostp, UMem **umemp)
> > +{
> > +    UMem *umem;
> > +    size = ALIGN_UP(size, state.host_page_size);
> > +    umem = umem_dev_create(state.dev, size, name);
> > +
> > +    *umemp = umem;
> > +    *hostp = umem->umem;
> > +}
> > +
> > +void postcopy_incoming_ram_free(UMem *umem)
> > +{
> > +    umem_unmap(umem);
> > +    umem_close(umem);
> > +    umem_destroy(umem);
> > +}
> > +
> > +void postcopy_incoming_prepare(void)
> > +{
> > +    RAMBlock *block;
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        if (block->umem != NULL) {
> > +            umem_mmap(block->umem);
> > +        }
> > +    }
> > +}
> > +
> > +static int postcopy_incoming_ram_load_get64(QEMUFile *f,
> > +                                             ram_addr_t *addr, int *flags)
> > +{
> > +    *addr = qemu_get_be64(f);
> > +    *flags = *addr & ~TARGET_PAGE_MASK;
> > +    *addr &= TARGET_PAGE_MASK;
> > +    return qemu_file_get_error(f);
> > +}
> > +
> > +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    ram_addr_t addr;
> > +    int flags;
> > +    int error;
> > +
> > +    DPRINTF("incoming ram load\n");
> > +    /*
> > +     * RAM_SAVE_FLAGS_EOS or
> > +     * RAM_SAVE_FLAGS_MEM_SIZE + mem size + RAM_SAVE_FLAGS_EOS
> > +     * see postcopy_outgoing_ram_save_live()
> > +     */
> > +
> > +    if (version_id != RAM_SAVE_VERSION_ID) {
> > +        DPRINTF("RAM_SAVE_VERSION_ID %d != %d\n",
> > +                version_id, RAM_SAVE_VERSION_ID);
> > +        return -EINVAL;
> > +    }
> > +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> > +    DPRINTF("addr 0x%lx flags 0x%x\n", addr, flags);
> > +    if (error) {
> > +        DPRINTF("error %d\n", error);
> > +        return error;
> > +    }
> > +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> > +        DPRINTF("EOS\n");
> > +        return 0;
> > +    }
> > +
> > +    if (flags != RAM_SAVE_FLAG_MEM_SIZE) {
> > +        DPRINTF("-EINVAL flags 0x%x\n", flags);
> > +        return -EINVAL;
> > +    }
> > +    error = ram_load_mem_size(f, addr);
> > +    if (error) {
> > +        DPRINTF("addr 0x%lx error %d\n", addr, error);
> > +        return error;
> > +    }
> > +
> > +    error = postcopy_incoming_ram_load_get64(f, &addr, &flags);
> > +    if (error) {
> > +        DPRINTF("addr 0x%lx flags 0x%x error %d\n", addr, flags, error);
> > +        return error;
> > +    }
> > +    if (flags == RAM_SAVE_FLAG_EOS && addr == 0) {
> > +        DPRINTF("done\n");
> > +        return 0;
> > +    }
> > +    DPRINTF("-EINVAL\n");
> > +    return -EINVAL;
> > +}
> > +
> > +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read)
> > +{
> > +    int fds[2];
> > +    RAMBlock *block;
> > +
> > +    DPRINTF("fork\n");
> > +
> > +    /* socketpair(AF_UNIX)? */
> > +
> > +    if (qemu_pipe(fds) == -1) {
> > +        perror("qemu_pipe");
> > +        abort();
> > +    }
> > +    state.from_umemd_fd = fds[0];
> > +    umemd.to_qemu_fd = fds[1];
> > +
> > +    if (qemu_pipe(fds) == -1) {
> > +        perror("qemu_pipe");
> > +        abort();
> > +    }
> > +    umemd.from_qemu_fd = fds[0];
> > +    state.to_umemd_fd = fds[1];
> > +
> > +    pid_t child = fork();
> > +    if (child < 0) {
> > +        perror("fork");
> > +        abort();
> > +    }
> > +
> > +    if (child == 0) {
> > +        int mig_write_fd;
> > +
> > +        fd_close(&state.to_umemd_fd);
> > +        fd_close(&state.from_umemd_fd);
> > +        umemd.host_page_size = state.host_page_size;
> > +        umemd.host_page_shift = state.host_page_shift;
> > +
> > +        umemd.nr_host_pages_per_target_page =
> > +            TARGET_PAGE_SIZE / umemd.host_page_size;
> > +        umemd.nr_target_pages_per_host_page =
> > +            umemd.host_page_size / TARGET_PAGE_SIZE;
> > +
> > +        umemd.target_to_host_page_shift =
> > +            ffs(umemd.nr_host_pages_per_target_page) - 1;
> > +        umemd.host_to_target_page_shift =
> > +            ffs(umemd.nr_target_pages_per_host_page) - 1;
> > +
> > +        umemd.state = 0;
> > +        umemd.version_id = state.version_id;
> > +        umemd.mig_read_fd = mig_read_fd;
> > +        umemd.mig_read = mig_read;
> > +
> > +        mig_write_fd = dup(mig_read_fd);
> > +        if (mig_write_fd < 0) {
> > +            perror("could not dup for writable socket \n");
> > +            abort();
> > +        }
> > +        umemd.mig_write_fd = mig_write_fd;
> > +        umemd.mig_write = qemu_fopen_nonblock(mig_write_fd);
> > +
> > +        postcopy_incoming_umemd(); /* noreturn */
> > +    }
> > +
> > +    DPRINTF("qemu pid: %d daemon pid: %d\n", getpid(), child);
> > +    fd_close(&umemd.to_qemu_fd);
> > +    fd_close(&umemd.from_qemu_fd);
> > +    state.faulted_pages = g_malloc(umem_pages_size(MAX_FAULTED_PAGES));
> > +    state.faulted_pages->nr = 0;
> > +
> > +    /* close all UMem.shmem_fd */
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        umem_close_shmem(block->umem);
> > +    }
> > +    umem_qemu_wait_for_daemon(state.from_umemd_fd);
> > +}
> > +
> > +static void postcopy_incoming_qemu_recv_quit(void)
> > +{
> > +    RAMBlock *block;
> > +    if (state.state & PIS_STATE_QUIT_RECEIVED) {
> > +        return;
> > +    }
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        if (block->umem != NULL) {
> > +            umem_destroy(block->umem);
> > +            block->umem = NULL;
> > +            block->flags &= ~RAM_POSTCOPY_UMEM_MASK;
> > +        }
> > +    }
> > +
> > +    DPRINTF("|= PIS_STATE_QUIT_RECEIVED\n");
> > +    state.state |= PIS_STATE_QUIT_RECEIVED;
> > +    qemu_set_fd_handler(state.from_umemd_fd, NULL, NULL, NULL);
> > +    qemu_fclose(state.from_umemd);
> > +    state.from_umemd = NULL;
> > +    fd_close(&state.from_umemd_fd);
> > +}
> > +
> > +static void postcopy_incoming_qemu_fflush_to_umemd_handler(void *opaque)
> > +{
> > +    assert(state.to_umemd != NULL);
> > +
> > +    nonblock_fflush(state.to_umemd);
> > +    if (nonblock_pending_size(state.to_umemd) > 0) {
> > +        return;
> > +    }
> > +
> > +    qemu_set_fd_handler(state.to_umemd->fd, NULL, NULL, NULL);
> > +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> > +        DPRINTF("|= PIS_STATE_QUIT_SENT\n");
> > +        state.state |= PIS_STATE_QUIT_SENT;
> > +        qemu_fclose(state.to_umemd->file);
> > +        state.to_umemd = NULL;
> > +        fd_close(&state.to_umemd_fd);
> > +        g_free(state.faulted_pages);
> > +        state.faulted_pages = NULL;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_qemu_fflush_to_umemd(void)
> > +{
> > +    qemu_set_fd_handler(state.to_umemd->fd, NULL,
> > +                        postcopy_incoming_qemu_fflush_to_umemd_handler, NULL);
> > +    postcopy_incoming_qemu_fflush_to_umemd_handler(NULL);
> > +}
> > +
> > +static void postcopy_incoming_qemu_queue_quit(void)
> > +{
> > +    if (state.state & PIS_STATE_QUIT_QUEUED) {
> > +        return;
> > +    }
> > +
> > +    DPRINTF("|= PIS_STATE_QUIT_QUEUED\n");
> > +    umem_qemu_quit(state.to_umemd->file);
> > +    state.state |= PIS_STATE_QUIT_QUEUED;
> > +}
> > +
> > +static void postcopy_incoming_qemu_send_pages_present(void)
> > +{
> > +    if (state.faulted_pages->nr > 0) {
> > +        umem_qemu_send_pages_present(state.to_umemd->file,
> > +                                     state.faulted_pages);
> > +        state.faulted_pages->nr = 0;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_qemu_faulted_pages(
> > +    const struct umem_pages *pages)
> > +{
> > +    assert(pages->nr <= MAX_FAULTED_PAGES);
> > +    assert(state.faulted_pages != NULL);
> > +
> > +    if (state.faulted_pages->nr + pages->nr > MAX_FAULTED_PAGES) {
> > +        postcopy_incoming_qemu_send_pages_present();
> > +    }
> > +    memcpy(&state.faulted_pages->pgoffs[state.faulted_pages->nr],
> > +           &pages->pgoffs[0], sizeof(pages->pgoffs[0]) * pages->nr);
> > +    state.faulted_pages->nr += pages->nr;
> > +}
> > +
> > +static void postcopy_incoming_qemu_cleanup_umem(void);
> > +
> > +static int postcopy_incoming_qemu_handle_req_one(void)
> > +{
> > +    int offset = 0;
> > +    int ret;
> > +    uint8_t cmd;
> > +
> > +    ret = qemu_peek_buffer(state.from_umemd, &cmd, sizeof(cmd), offset);
> > +    offset += sizeof(cmd);
> > +    if (ret != sizeof(cmd)) {
> > +        return -EAGAIN;
> > +    }
> > +    DPRINTF("cmd %c\n", cmd);
> > +
> > +    switch (cmd) {
> > +    case UMEM_DAEMON_QUIT:
> > +        postcopy_incoming_qemu_recv_quit();
> > +        postcopy_incoming_qemu_queue_quit();
> > +        postcopy_incoming_qemu_cleanup_umem();
> > +        break;
> > +    case UMEM_DAEMON_TRIGGER_PAGE_FAULT: {
> > +        struct umem_pages *pages =
> > +            umem_qemu_trigger_page_fault(state.from_umemd, &offset);
> > +        if (pages == NULL) {
> > +            return -EAGAIN;
> > +        }
> > +        if (state.to_umemd_fd >= 0 && !(state.state & PIS_STATE_QUIT_QUEUED)) {
> > +            postcopy_incoming_qemu_faulted_pages(pages);
> > +            g_free(pages);
> > +        }
> > +        break;
> > +    }
> > +    case UMEM_DAEMON_ERROR:
> > +        /* umem daemon hit troubles, so it warned us to stop vm execution */
> > +        vm_stop(RUN_STATE_IO_ERROR); /* or RUN_STATE_INTERNAL_ERROR */
> > +        break;
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +
> > +    if (state.from_umemd != NULL) {
> > +        qemu_file_skip(state.from_umemd, offset);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void postcopy_incoming_qemu_handle_req(void *opaque)
> > +{
> > +    do {
> > +        int ret = postcopy_incoming_qemu_handle_req_one();
> > +        if (ret == -EAGAIN) {
> > +            break;
> > +        }
> > +    } while (state.from_umemd != NULL &&
> > +             qemu_pending_size(state.from_umemd) > 0);
> > +
> > +    if (state.to_umemd != NULL) {
> > +        if (state.faulted_pages->nr > 0) {
> > +            postcopy_incoming_qemu_send_pages_present();
> > +        }
> > +        postcopy_incoming_qemu_fflush_to_umemd();
> > +    }
> > +}
> > +
> > +void postcopy_incoming_qemu_ready(void)
> > +{
> > +    umem_qemu_ready(state.to_umemd_fd);
> > +
> > +    state.from_umemd = qemu_fopen_pipe(state.from_umemd_fd);
> > +    state.to_umemd = qemu_fopen_nonblock(state.to_umemd_fd);
> > +    qemu_set_fd_handler(state.from_umemd_fd,
> > +                        postcopy_incoming_qemu_handle_req, NULL, NULL);
> > +}
> > +
> > +static void postcopy_incoming_qemu_cleanup_umem(void)
> > +{
> > +    /* when qemu will quit before completing postcopy, tell umem daemon
> > +       to tear down umem device and exit. */
> > +    if (state.to_umemd_fd >= 0) {
> > +        postcopy_incoming_qemu_queue_quit();
> > +        postcopy_incoming_qemu_fflush_to_umemd();
> > +    }
> > +
> > +    if (state.dev) {
> > +        umem_dev_destroy(state.dev);
> > +        state.dev = NULL;
> > +    }
> > +}
> > +
> > +void postcopy_incoming_qemu_cleanup(void)
> > +{
> > +    postcopy_incoming_qemu_cleanup_umem();
> > +    if (state.to_umemd != NULL) {
> > +        nonblock_wait_for_flush(state.to_umemd);
> > +    }
> > +}
> > +
> > +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size)
> > +{
> > +    uint64_t nr = DIV_ROUND_UP(size, state.host_page_size);
> > +    size_t len = umem_pages_size(nr);
> > +    ram_addr_t end = addr + size;
> > +    struct umem_pages *pages;
> > +    int i;
> > +
> > +    if (state.to_umemd_fd < 0 || state.state & PIS_STATE_QUIT_QUEUED) {
> > +        return;
> > +    }
> > +    pages = g_malloc(len);
> > +    pages->nr = nr;
> > +    for (i = 0; addr < end; addr += state.host_page_size, i++) {
> > +        pages->pgoffs[i] = addr >> state.host_page_shift;
> > +    }
> > +    umem_qemu_send_pages_unmapped(state.to_umemd->file, pages);
> > +    g_free(pages);
> > +    assert(state.to_umemd != NULL);
> > +    postcopy_incoming_qemu_fflush_to_umemd();
> > +}
> > +
> > +/**************************************************************************
> > + * incoming umem daemon
> > + */
> > +
> > +static void postcopy_incoming_umem_recv_quit(void)
> > +{
> > +    if (umemd.state & UMEM_STATE_QUIT_RECEIVED) {
> > +        return;
> > +    }
> > +    DPRINTF("|= UMEM_STATE_QUIT_RECEIVED\n");
> > +    umemd.state |= UMEM_STATE_QUIT_RECEIVED;
> > +    qemu_fclose(umemd.from_qemu);
> > +    umemd.from_qemu = NULL;
> > +    fd_close(&umemd.from_qemu_fd);
> > +}
> > +
> > +static void postcopy_incoming_umem_queue_quit(void)
> > +{
> > +    if (umemd.state & UMEM_STATE_QUIT_QUEUED) {
> > +        return;
> > +    }
> > +    DPRINTF("|= UMEM_STATE_QUIT_QUEUED\n");
> > +    umem_daemon_quit(umemd.to_qemu->file);
> > +    umemd.state |= UMEM_STATE_QUIT_QUEUED;
> > +}
> > +
> > +static void postcopy_incoming_umem_send_eoc_req(void)
> > +{
> > +    struct qemu_umem_req req;
> > +
> > +    if (umemd.state & UMEM_STATE_EOC_SENT) {
> > +        return;
> > +    }
> > +
> > +    DPRINTF("|= UMEM_STATE_EOC_SENT\n");
> > +    req.cmd = QEMU_UMEM_REQ_EOC;
> > +    postcopy_incoming_send_req(umemd.mig_write->file, &req);
> > +    umemd.state |= UMEM_STATE_EOC_SENT;
> > +    qemu_fclose(umemd.mig_write->file);
> > +    umemd.mig_write = NULL;
> > +    fd_close(&umemd.mig_write_fd);
> > +}
> > +
> > +static void postcopy_incoming_umem_send_page_req(RAMBlock *block)
> > +{
> > +    struct qemu_umem_req req;
> > +    int bit;
> > +    uint64_t target_pgoff;
> > +    int i;
> > +
> > +    umemd.page_request.nr = MAX_REQUESTS;
> > +    umem_get_page_request(block->umem, &umemd.page_request);
> > +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> > +            block->idstr, umemd.page_request.nr,
> > +            (uint64_t)umemd.page_request.pgoffs[0],
> > +            (uint64_t)umemd.page_request.pgoffs[1]);
> > +
> > +    if (umemd.last_block_write != block) {
> > +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND;
> > +        req.idstr = block->idstr;
> > +    } else {
> > +        req.cmd = QEMU_UMEM_REQ_ON_DEMAND_CONT;
> > +    }
> > +
> > +    req.nr = 0;
> > +    req.pgoffs = umemd.target_pgoffs;
> > +    if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> > +        for (i = 0; i < umemd.page_request.nr; i++) {
> > +            target_pgoff =
> > +                umemd.page_request.pgoffs[i] >> umemd.host_to_target_page_shift;
> > +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> > +
> > +            if (!test_and_set_bit(bit, umemd.phys_requested)) {
> > +                req.pgoffs[req.nr] = target_pgoff;
> > +                req.nr++;
> > +            }
> > +        }
> > +    } else {
> > +        for (i = 0; i < umemd.page_request.nr; i++) {
> > +            int j;
> > +            target_pgoff =
> > +                umemd.page_request.pgoffs[i] << umemd.host_to_target_page_shift;
> > +            bit = (block->offset >> TARGET_PAGE_BITS) + target_pgoff;
> > +
> > +            for (j = 0; j < umemd.nr_target_pages_per_host_page; j++) {
> > +                if (!test_and_set_bit(bit + j, umemd.phys_requested)) {
> > +                    req.pgoffs[req.nr] = target_pgoff + j;
> > +                    req.nr++;
> > +                }
> > +            }
> > +        }
> > +    }
> > +
> > +    DPRINTF("id %s nr %d offs 0x%"PRIx64" 0x%"PRIx64"\n",
> > +            block->idstr, req.nr, req.pgoffs[0], req.pgoffs[1]);
> > +    if (req.nr > 0 && umemd.mig_write != NULL) {
> > +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> > +        umemd.last_block_write = block;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_umem_send_pages_present(void)
> > +{
> > +    if (umemd.present_request->nr > 0) {
> > +        umem_daemon_send_pages_present(umemd.to_qemu->file,
> > +                                       umemd.present_request);
> > +        umemd.present_request->nr = 0;
> > +    }
> > +}
> > +
> > +static void postcopy_incoming_umem_pages_present_one(
> > +    uint32_t nr, const __u64 *pgoffs, uint64_t ramblock_pgoffset)
> > +{
> > +    uint32_t i;
> > +    assert(nr <= MAX_PRESENT_REQUESTS);
> > +
> > +    if (umemd.present_request->nr + nr > MAX_PRESENT_REQUESTS) {
> > +        postcopy_incoming_umem_send_pages_present();
> > +    }
> > +
> > +    for (i = 0; i < nr; i++) {
> > +        umemd.present_request->pgoffs[umemd.present_request->nr + i] =
> > +            pgoffs[i] + ramblock_pgoffset;
> > +    }
> > +    umemd.present_request->nr += nr;
> > +}
> > +
> > +static void postcopy_incoming_umem_pages_present(
> > +    const struct umem_page_cached *page_cached, uint64_t ramblock_pgoffset)
> > +{
> > +    uint32_t left = page_cached->nr;
> > +    uint32_t offset = 0;
> > +
> > +    while (left > 0) {
> > +        uint32_t nr = MIN(left, MAX_PRESENT_REQUESTS);
> > +        postcopy_incoming_umem_pages_present_one(
> > +            nr, &page_cached->pgoffs[offset], ramblock_pgoffset);
> > +
> > +        left -= nr;
> > +        offset += nr;
> > +    }
> > +}
> > +
> > +static int postcopy_incoming_umem_ram_load(void)
> > +{
> > +    ram_addr_t offset;
> > +    int flags;
> > +    int error;
> > +    void *shmem;
> > +    int i;
> > +    int bit;
> > +
> > +    if (umemd.version_id != RAM_SAVE_VERSION_ID) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    offset = qemu_get_be64(umemd.mig_read);
> > +
> > +    flags = offset & ~TARGET_PAGE_MASK;
> > +    offset &= TARGET_PAGE_MASK;
> > +
> > +    assert(!(flags & RAM_SAVE_FLAG_MEM_SIZE));
> > +
> > +    if (flags & RAM_SAVE_FLAG_EOS) {
> > +        DPRINTF("RAM_SAVE_FLAG_EOS\n");
> > +        postcopy_incoming_umem_send_eoc_req();
> > +
> > +        qemu_fclose(umemd.mig_read);
> > +        umemd.mig_read = NULL;
> > +        fd_close(&umemd.mig_read_fd);
> > +        umemd.state |= UMEM_STATE_EOS_RECEIVED;
> > +
> > +        postcopy_incoming_umem_queue_quit();
> > +        DPRINTF("|= UMEM_STATE_EOS_RECEIVED\n");
> > +        return 0;
> > +    }
> > +
> > +    shmem = ram_load_host_from_stream_offset(umemd.mig_read, offset, flags,
> > +                                             &umemd.last_block_read);
> > +    if (!shmem) {
> > +        DPRINTF("shmem == NULL\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (flags & RAM_SAVE_FLAG_COMPRESS) {
> > +        uint8_t ch = qemu_get_byte(umemd.mig_read);
> > +        memset(shmem, ch, TARGET_PAGE_SIZE);
> > +    } else if (flags & RAM_SAVE_FLAG_PAGE) {
> > +        qemu_get_buffer(umemd.mig_read, shmem, TARGET_PAGE_SIZE);
> > +    }
> > +
> > +    error = qemu_file_get_error(umemd.mig_read);
> > +    if (error) {
> > +        DPRINTF("error %d\n", error);
> > +        return error;
> > +    }
> > +
> > +    umemd.page_cached.nr = 0;
> > +    bit = (umemd.last_block_read->offset + offset) >> TARGET_PAGE_BITS;
> > +    if (!test_and_set_bit(bit, umemd.phys_received)) {
> > +        if (TARGET_PAGE_SIZE >= umemd.host_page_size) {
> > +            __u64 pgoff = offset >> umemd.host_page_shift;
> > +            for (i = 0; i < umemd.nr_host_pages_per_target_page; i++) {
> > +                umemd.page_cached.pgoffs[umemd.page_cached.nr] = pgoff + i;
> > +                umemd.page_cached.nr++;
> > +            }
> > +        } else {
> > +            bool mark_cache = true;
> > +            for (i = 0; i < umemd.nr_target_pages_per_host_page; i++) {
> > +                if (!test_bit(bit + i, umemd.phys_received)) {
> > +                    mark_cache = false;
> > +                    break;
> > +                }
> > +            }
> > +            if (mark_cache) {
> > +                umemd.page_cached.pgoffs[0] = offset >> umemd.host_page_shift;
> > +                umemd.page_cached.nr = 1;
> > +            }
> > +        }
> > +    }
> > +
> > +    if (umemd.page_cached.nr > 0) {
> > +        umem_mark_page_cached(umemd.last_block_read->umem, &umemd.page_cached);
> > +
> > +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED) && umemd.to_qemu_fd >=0 &&
> > +            (incoming_postcopy_flags & INCOMING_FLAGS_FAULT_REQUEST)) {
> > +            uint64_t ramblock_pgoffset;
> > +
> > +            ramblock_pgoffset =
> > +                umemd.last_block_read->offset >> umemd.host_page_shift;
> > +            postcopy_incoming_umem_pages_present(&umemd.page_cached,
> > +                                                 ramblock_pgoffset);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static bool postcopy_incoming_umem_check_umem_done(void)
> > +{
> > +    bool all_done = true;
> > +    RAMBlock *block;
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        UMem *umem = block->umem;
> > +        if (umem != NULL && umem->nsets == umem->nbits) {
> > +            umem_unmap_shmem(umem);
> > +            umem_destroy(umem);
> > +            block->umem = NULL;
> > +        }
> > +        if (block->umem != NULL) {
> > +            all_done = false;
> > +        }
> > +    }
> > +    return all_done;
> > +}
> > +
> > +static bool postcopy_incoming_umem_page_faulted(const struct umem_pages *pages)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < pages->nr; i++) {
> > +        ram_addr_t addr = pages->pgoffs[i] << umemd.host_page_shift;
> > +        RAMBlock *block = qemu_get_ram_block(addr);
> > +        addr -= block->offset;
> > +        umem_remove_shmem(block->umem, addr, umemd.host_page_size);
> > +    }
> > +    return postcopy_incoming_umem_check_umem_done();
> > +}
> > +
> > +static bool
> > +postcopy_incoming_umem_page_unmapped(const struct umem_pages *pages)
> > +{
> > +    RAMBlock *block;
> > +    ram_addr_t addr;
> > +    int i;
> > +
> > +    struct qemu_umem_req req = {
> > +        .cmd = QEMU_UMEM_REQ_REMOVE,
> > +        .nr = 0,
> > +        .pgoffs = (uint64_t*)pages->pgoffs,
> > +    };
> > +
> > +    addr = pages->pgoffs[0] << umemd.host_page_shift;
> > +    block = qemu_get_ram_block(addr);
> > +
> > +    for (i = 0; i < pages->nr; i++)  {
> > +        int pgoff;
> > +
> > +        addr = pages->pgoffs[i] << umemd.host_page_shift;
> > +        pgoff = addr >> TARGET_PAGE_BITS;
> > +        if (!test_bit(pgoff, umemd.phys_received) &&
> > +            !test_bit(pgoff, umemd.phys_requested)) {
> > +            req.pgoffs[req.nr] = pgoff;
> > +            req.nr++;
> > +        }
> > +        set_bit(pgoff, umemd.phys_received);
> > +        set_bit(pgoff, umemd.phys_requested);
> > +
> > +        umem_remove_shmem(block->umem,
> > +                          addr - block->offset, umemd.host_page_size);
> > +    }
> > +    if (req.nr > 0 && umemd.mig_write != NULL) {
> > +        req.idstr = block->idstr;
> > +        postcopy_incoming_send_req(umemd.mig_write->file, &req);
> > +    }
> > +
> > +    return postcopy_incoming_umem_check_umem_done();
> > +}
> > +
> > +static void postcopy_incoming_umem_done(void)
> > +{
> > +    postcopy_incoming_umem_send_eoc_req();
> > +    postcopy_incoming_umem_queue_quit();
> > +}
> > +
> > +static int postcopy_incoming_umem_handle_qemu(void)
> > +{
> > +    int ret;
> > +    int offset = 0;
> > +    uint8_t cmd;
> > +
> > +    ret = qemu_peek_buffer(umemd.from_qemu, &cmd, sizeof(cmd), offset);
> > +    offset += sizeof(cmd);
> > +    if (ret != sizeof(cmd)) {
> > +        return -EAGAIN;
> > +    }
> > +    DPRINTF("cmd %c\n", cmd);
> > +    switch (cmd) {
> > +    case UMEM_QEMU_QUIT:
> > +        postcopy_incoming_umem_recv_quit();
> > +        postcopy_incoming_umem_done();
> > +        break;
> > +    case UMEM_QEMU_PAGE_FAULTED: {
> > +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> > +                                                   &offset);
> > +        if (pages == NULL) {
> > +            return -EAGAIN;
> > +        }
> > +        if (postcopy_incoming_umem_page_faulted(pages)){
> > +            postcopy_incoming_umem_done();
> > +        }
> > +        g_free(pages);
> > +        break;
> > +    }
> > +    case UMEM_QEMU_PAGE_UNMAPPED: {
> > +        struct umem_pages *pages = umem_recv_pages(umemd.from_qemu,
> > +                                                   &offset);
> > +        if (pages == NULL) {
> > +            return -EAGAIN;
> > +        }
> > +        if (postcopy_incoming_umem_page_unmapped(pages)){
> > +            postcopy_incoming_umem_done();
> > +        }
> > +        g_free(pages);
> > +        break;
> > +    }
> > +    default:
> > +        abort();
> > +        break;
> > +    }
> > +    if (umemd.from_qemu != NULL) {
> > +        qemu_file_skip(umemd.from_qemu, offset);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static void set_fd(int fd, fd_set *fds, int *nfds)
> > +{
> > +    FD_SET(fd, fds);
> > +    if (fd > *nfds) {
> > +        *nfds = fd;
> > +    }
> > +}
> > +
> > +static int postcopy_incoming_umemd_main_loop(void)
> > +{
> > +    fd_set writefds;
> > +    fd_set readfds;
> > +    int nfds;
> > +    RAMBlock *block;
> > +    int ret;
> > +
> > +    int pending_size;
> > +    bool get_page_request;
> > +
> > +    nfds = -1;
> > +    FD_ZERO(&writefds);
> > +    FD_ZERO(&readfds);
> > +
> > +    if (umemd.mig_write != NULL) {
> > +        pending_size = nonblock_pending_size(umemd.mig_write);
> > +        if (pending_size > 0) {
> > +            set_fd(umemd.mig_write_fd, &writefds, &nfds);
> > +        }
> > +    } else {
> > +        pending_size = 0;
> > +    }
> > +
> > +#define PENDING_SIZE_MAX (MAX_REQUESTS * sizeof(uint64_t) * 2)
> > +    /* If page request to the migration source is accumulated,
> > +       suspend getting page fault request. */
> > +    get_page_request = (pending_size <= PENDING_SIZE_MAX);
> > +
> > +    if (get_page_request) {
> > +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +            if (block->umem != NULL) {
> > +                set_fd(block->umem->fd, &readfds, &nfds);
> > +            }
> > +        }
> > +    }
> > +
> > +    if (umemd.mig_read_fd >= 0) {
> > +        set_fd(umemd.mig_read_fd, &readfds, &nfds);
> > +    }
> > +
> > +    if (umemd.to_qemu != NULL &&
> > +        nonblock_pending_size(umemd.to_qemu) > 0) {
> > +        set_fd(umemd.to_qemu_fd, &writefds, &nfds);
> > +    }
> > +    if (umemd.from_qemu_fd >= 0) {
> > +        set_fd(umemd.from_qemu_fd, &readfds, &nfds);
> > +    }
> > +
> > +    ret = select(nfds + 1, &readfds, &writefds, NULL, NULL);
> > +    if (ret == -1) {
> > +        if (errno == EINTR) {
> > +            return 0;
> > +        }
> > +        return ret;
> > +    }
> > +
> > +    if (umemd.mig_write_fd >= 0 && FD_ISSET(umemd.mig_write_fd, &writefds)) {
> > +        nonblock_fflush(umemd.mig_write);
> > +    }
> > +    if (umemd.to_qemu_fd >= 0 && FD_ISSET(umemd.to_qemu_fd, &writefds)) {
> > +        nonblock_fflush(umemd.to_qemu);
> > +    }
> > +    if (get_page_request) {
> > +        QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +            if (block->umem != NULL && FD_ISSET(block->umem->fd, &readfds)) {
> > +                postcopy_incoming_umem_send_page_req(block);
> > +            }
> > +        }
> > +    }
> > +    if (umemd.mig_read_fd >= 0 && FD_ISSET(umemd.mig_read_fd, &readfds)) {
> > +        do {
> > +            ret = postcopy_incoming_umem_ram_load();
> > +            if (ret < 0) {
> > +                return ret;
> > +            }
> > +        } while (umemd.mig_read != NULL &&
> > +                 qemu_pending_size(umemd.mig_read) > 0);
> > +    }
> > +    if (umemd.from_qemu_fd >= 0 && FD_ISSET(umemd.from_qemu_fd, &readfds)) {
> > +        do {
> > +            ret = postcopy_incoming_umem_handle_qemu();
> > +            if (ret == -EAGAIN) {
> > +                break;
> > +            }
> > +        } while (umemd.from_qemu != NULL &&
> > +                 qemu_pending_size(umemd.from_qemu) > 0);
> > +    }
> > +
> > +    if (umemd.mig_write != NULL) {
> > +        nonblock_fflush(umemd.mig_write);
> > +    }
> > +    if (umemd.to_qemu != NULL) {
> > +        if (!(umemd.state & UMEM_STATE_QUIT_QUEUED)) {
> > +            postcopy_incoming_umem_send_pages_present();
> > +        }
> > +        nonblock_fflush(umemd.to_qemu);
> > +        if ((umemd.state & UMEM_STATE_QUIT_QUEUED) &&
> > +            nonblock_pending_size(umemd.to_qemu) == 0) {
> > +            DPRINTF("|= UMEM_STATE_QUIT_SENT\n");
> > +            qemu_fclose(umemd.to_qemu->file);
> > +            umemd.to_qemu = NULL;
> > +            fd_close(&umemd.to_qemu_fd);
> > +            umemd.state |= UMEM_STATE_QUIT_SENT;
> > +        }
> > +    }
> > +
> > +    return (umemd.state & UMEM_STATE_END_MASK) == UMEM_STATE_END_MASK;
> > +}
> > +
> > +static void postcopy_incoming_umemd(void)
> > +{
> > +    ram_addr_t last_ram_offset;
> > +    int nbits;
> > +    RAMBlock *block;
> > +    int ret;
> > +
> > +    qemu_daemon(1, 1);
> > +    signal(SIGPIPE, SIG_IGN);
> > +    DPRINTF("daemon pid: %d\n", getpid());
> > +
> > +    umemd.page_request.pgoffs = g_new(__u64, MAX_REQUESTS);
> > +    umemd.page_cached.pgoffs =
> > +        g_new(__u64, MAX_REQUESTS *
> > +              (TARGET_PAGE_SIZE >= umemd.host_page_size ?
> > +               1: umemd.nr_host_pages_per_target_page));
> > +    umemd.target_pgoffs =
> > +        g_new(uint64_t, MAX_REQUESTS *
> > +              MAX(umemd.nr_host_pages_per_target_page,
> > +                  umemd.nr_target_pages_per_host_page));
> > +    umemd.present_request = g_malloc(umem_pages_size(MAX_PRESENT_REQUESTS));
> > +    umemd.present_request->nr = 0;
> > +
> > +    last_ram_offset = qemu_last_ram_offset();
> > +    nbits = last_ram_offset >> TARGET_PAGE_BITS;
> > +    umemd.phys_requested = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> > +    umemd.phys_received = g_new0(unsigned long, BITS_TO_LONGS(nbits));
> > +    umemd.last_block_read = NULL;
> > +    umemd.last_block_write = NULL;
> > +
> > +    QLIST_FOREACH(block, &ram_list.blocks, next) {
> > +        UMem *umem = block->umem;
> > +        umem->umem = NULL;      /* umem mapping area has VM_DONT_COPY flag,
> > +                                   so we lost those mappings by fork */
> > +        block->host = umem_map_shmem(umem);
> > +        umem_close_shmem(umem);
> > +    }
> > +    umem_daemon_ready(umemd.to_qemu_fd);
> > +    umemd.to_qemu = qemu_fopen_nonblock(umemd.to_qemu_fd);
> > +
> > +    /* wait for qemu to disown migration_fd */
> > +    umem_daemon_wait_for_qemu(umemd.from_qemu_fd);
> > +    umemd.from_qemu = qemu_fopen_pipe(umemd.from_qemu_fd);
> > +
> > +    DPRINTF("entering umemd main loop\n");
> > +    for (;;) {
> > +        ret = postcopy_incoming_umemd_main_loop();
> > +        if (ret != 0) {
> > +            break;
> > +        }
> > +    }
> > +    DPRINTF("exiting umemd main loop\n");
> > +
> > +    /* This daemon forked from qemu and the parent qemu is still running.
> > +     * Cleanups of linked libraries like SDL should not be triggered,
> > +     * otherwise the parent qemu may use resources which was already freed.
> > +     */
> > +    fflush(stdout);
> > +    fflush(stderr);
> > +    _exit(ret < 0? EXIT_FAILURE: 0);
> > +}
> > diff --git a/migration-tcp.c b/migration-tcp.c
> > index cf6a9b8..aa35050 100644
> > --- a/migration-tcp.c
> > +++ b/migration-tcp.c
> > @@ -63,18 +63,25 @@ static void tcp_wait_for_connect(void *opaque)
> >      } while (ret == -1 && (socket_error()) == EINTR);
> >  
> >      if (ret < 0) {
> > -        migrate_fd_error(s);
> > -        return;
> > +        goto error_out;
> >      }
> >  
> >      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
> >  
> > -    if (val == 0)
> > +    if (val == 0) {
> > +        ret = postcopy_outgoing_create_read_socket(s);
> > +        if (ret < 0) {
> > +            goto error_out;
> > +        }
> >          migrate_fd_connect(s);
> > -    else {
> > +    } else {
> >          DPRINTF("error connecting %d\n", val);
> > -        migrate_fd_error(s);
> > +        goto error_out;
> >      }
> > +    return;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> >  }
> >  
> >  int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
> > @@ -112,11 +119,19 @@ int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
> >  
> >      if (ret < 0) {
> >          DPRINTF("connect failed\n");
> > -        migrate_fd_error(s);
> > -        return ret;
> > +        goto error_out;
> > +    }
> > +
> > +    ret = postcopy_outgoing_create_read_socket(s);
> > +    if (ret < 0) {
> > +        goto error_out;
> >      }
> >      migrate_fd_connect(s);
> >      return 0;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> > +    return ret;
> >  }
> >  
> >  static void tcp_accept_incoming_migration(void *opaque)
> > @@ -145,7 +160,15 @@ static void tcp_accept_incoming_migration(void *opaque)
> >      }
> >  
> >      process_incoming_migration(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_fork_umemd(c, f);
> > +    }
> >      qemu_fclose(f);
> > +    if (incoming_postcopy) {
> > +        /* now socket is disowned.
> > +           So tell umem server that it's safe to use it */
> > +        postcopy_incoming_qemu_ready();
> > +    }
> >  out:
> >      close(c);
> >  out2:
> > diff --git a/migration-unix.c b/migration-unix.c
> > index dfcf203..3707505 100644
> > --- a/migration-unix.c
> > +++ b/migration-unix.c
> > @@ -69,12 +69,20 @@ static void unix_wait_for_connect(void *opaque)
> >  
> >      qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
> >  
> > -    if (val == 0)
> > +    if (val == 0) {
> > +        ret = postcopy_outgoing_create_read_socket(s);
> > +        if (ret < 0) {
> > +            goto error_out;
> > +        }
> >          migrate_fd_connect(s);
> > -    else {
> > +    } else {
> >          DPRINTF("error connecting %d\n", val);
> > -        migrate_fd_error(s);
> > +        goto error_out;
> >      }
> > +    return;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> >  }
> >  
> >  int unix_start_outgoing_migration(MigrationState *s, const char *path)
> > @@ -109,11 +117,19 @@ int unix_start_outgoing_migration(MigrationState *s, const char *path)
> >  
> >      if (ret < 0) {
> >          DPRINTF("connect failed\n");
> > -        migrate_fd_error(s);
> > -        return ret;
> > +        goto error_out;
> > +    }
> > +
> > +    ret = postcopy_outgoing_create_read_socket(s);
> > +    if (ret < 0) {
> > +        goto error_out;
> >      }
> >      migrate_fd_connect(s);
> >      return 0;
> > +
> > +error_out:
> > +    migrate_fd_error(s);
> > +    return ret;
> >  }
> >  
> >  static void unix_accept_incoming_migration(void *opaque)
> > @@ -142,7 +158,13 @@ static void unix_accept_incoming_migration(void *opaque)
> >      }
> >  
> >      process_incoming_migration(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_fork_umemd(c, f);
> > +    }
> >      qemu_fclose(f);
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_qemu_ready();
> > +    }
> >  out:
> >      close(c);
> >  out2:
> > diff --git a/migration.c b/migration.c
> > index 0149ab3..51efe44 100644
> > --- a/migration.c
> > +++ b/migration.c
> > @@ -39,6 +39,11 @@ enum {
> >      MIG_STATE_COMPLETED,
> >  };
> >  
> > +enum {
> > +    MIG_SUBSTATE_PRECOPY,
> > +    MIG_SUBSTATE_POSTCOPY,
> > +};
> > +
> >  #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
> >  
> >  static NotifierList migration_state_notifiers =
> > @@ -255,6 +260,18 @@ static void migrate_fd_put_ready(void *opaque)
> >          return;
> >      }
> >  
> > +    if (s->substate == MIG_SUBSTATE_POSTCOPY) {
> > +        /* PRINTF("postcopy background\n"); */
> > +        ret = postcopy_outgoing_ram_save_background(s->mon, s->file,
> > +                                                    s->postcopy);
> > +        if (ret > 0) {
> > +            migrate_fd_completed(s);
> > +        } else if (ret < 0) {
> > +            migrate_fd_error(s);
> > +        }
> > +        return;
> > +    }
> > +
> >      DPRINTF("iterate\n");
> >      ret = qemu_savevm_state_iterate(s->mon, s->file);
> >      if (ret < 0) {
> > @@ -265,6 +282,19 @@ static void migrate_fd_put_ready(void *opaque)
> >          DPRINTF("done iterating\n");
> >          vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
> >  
> > +        if (s->params.postcopy) {
> > +            if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
> > +                migrate_fd_error(s);
> > +                if (old_vm_running) {
> > +                    vm_start();
> > +                }
> > +                return;
> > +            }
> > +            s->substate = MIG_SUBSTATE_POSTCOPY;
> > +            s->postcopy = postcopy_outgoing_begin(s);
> > +            return;
> > +        }
> > +
> >          if (qemu_savevm_state_complete(s->mon, s->file) < 0) {
> >              migrate_fd_error(s);
> >          } else {
> > @@ -357,6 +387,7 @@ void migrate_fd_connect(MigrationState *s)
> >      int ret;
> >  
> >      s->state = MIG_STATE_ACTIVE;
> > +    s->substate = MIG_SUBSTATE_PRECOPY;
> >      s->file = qemu_fopen_ops_buffered(s,
> >                                        s->bandwidth_limit,
> >                                        migrate_fd_put_buffer,
> > diff --git a/migration.h b/migration.h
> > index 90ae362..2809e99 100644
> > --- a/migration.h
> > +++ b/migration.h
> > @@ -40,6 +40,12 @@ struct MigrationState
> >      int (*write)(MigrationState *s, const void *buff, size_t size);
> >      void *opaque;
> >      MigrationParams params;
> > +
> > +    /* for postcopy */
> > +    int substate;              /* precopy or postcopy */
> > +    int fd_read;
> > +    QEMUFile *file_read;        /* connection from the detination */
> > +    void *postcopy;
> >  };
> >  
> >  void process_incoming_migration(QEMUFile *f);
> > @@ -86,6 +92,7 @@ uint64_t ram_bytes_remaining(void);
> >  uint64_t ram_bytes_transferred(void);
> >  uint64_t ram_bytes_total(void);
> >  
> > +void ram_save_set_params(const MigrationParams *params, void *opaque);
> >  void sort_ram_list(void);
> >  int ram_save_block(QEMUFile *f);
> >  void ram_save_memory_set_dirty(void);
> > @@ -107,7 +114,30 @@ void migrate_add_blocker(Error *reason);
> >   */
> >  void migrate_del_blocker(Error *reason);
> >  
> > +/* For outgoing postcopy */
> > +int postcopy_outgoing_create_read_socket(MigrationState *s);
> > +int postcopy_outgoing_ram_save_live(Monitor *mon,
> > +                                    QEMUFile *f, int stage, void *opaque);
> > +void *postcopy_outgoing_begin(MigrationState *s);
> > +int postcopy_outgoing_ram_save_background(Monitor *mon, QEMUFile *f,
> > +                                          void *postcopy);
> > +
> > +/* For incoming postcopy */
> >  extern bool incoming_postcopy;
> >  extern unsigned long incoming_postcopy_flags;
> >  
> > +int postcopy_incoming_init(const char *incoming, bool incoming_postcopy);
> > +void postcopy_incoming_ram_alloc(const char *name,
> > +                                 size_t size, uint8_t **hostp, UMem **umemp);
> > +void postcopy_incoming_ram_free(UMem *umem);
> > +void postcopy_incoming_prepare(void);
> > +
> > +int postcopy_incoming_ram_load(QEMUFile *f, void *opaque, int version_id);
> > +void postcopy_incoming_fork_umemd(int mig_read_fd, QEMUFile *mig_read);
> > +void postcopy_incoming_qemu_ready(void);
> > +void postcopy_incoming_qemu_cleanup(void);
> > +#ifdef NEED_CPU_H
> > +void postcopy_incoming_qemu_pages_unmapped(ram_addr_t addr, ram_addr_t size);
> > +#endif
> > +
> >  #endif
> > diff --git a/qemu-common.h b/qemu-common.h
> > index 725922b..d74a8c9 100644
> > --- a/qemu-common.h
> > +++ b/qemu-common.h
> > @@ -17,6 +17,7 @@ typedef struct DeviceState DeviceState;
> >  
> >  struct Monitor;
> >  typedef struct Monitor Monitor;
> > +typedef struct UMem UMem;
> >  
> >  /* we put basic includes here to avoid repeating them in device drivers */
> >  #include <stdlib.h>
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 5c5b8f3..19e20f9 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -2510,7 +2510,10 @@ DEF("postcopy-flags", HAS_ARG, QEMU_OPTION_postcopy_flags,
> >      "-postcopy-flags unsigned-int(flags)\n"
> >      "	                flags for postcopy incoming migration\n"
> >      "                   when -incoming and -postcopy are specified.\n"
> > -    "                   This is for benchmark/debug purpose (default: 0)\n",
> > +    "                   This is for benchmark/debug purpose (default: 0)\n"
> > +    "                   Currently supprted flags are\n"
> > +    "                   1: enable fault request from umemd to qemu\n"
> > +    "                      (default: disabled)\n",
> >      QEMU_ARCH_ALL)
> >  STEXI
> >  @item -postcopy-flags int
> 
> Can you move umem.h and umem.h to a separate patch please ,
> this patch
> > diff --git a/umem.c b/umem.c
> > new file mode 100644
> > index 0000000..b7be006
> > --- /dev/null
> > +++ b/umem.c
> > @@ -0,0 +1,379 @@
> > +/*
> > + * umem.c: user process backed memory module for postcopy livemigration
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +
> > +#include <linux/umem.h>
> > +
> > +#include "bitops.h"
> > +#include "sysemu.h"
> > +#include "hw/hw.h"
> > +#include "umem.h"
> > +
> > +//#define DEBUG_UMEM
> > +#ifdef DEBUG_UMEM
> > +#include <sys/syscall.h>
> > +#define DPRINTF(format, ...)                                            \
> > +    do {                                                                \
> > +        printf("%d:%ld %s:%d "format, getpid(), syscall(SYS_gettid),    \
> > +               __func__, __LINE__, ## __VA_ARGS__);                     \
> > +    } while (0)
> > +#else
> > +#define DPRINTF(format, ...)    do { } while (0)
> > +#endif
> > +
> > +#define DEV_UMEM        "/dev/umem"
> > +
> > +struct UMemDev {
> > +    int fd;
> > +    int page_shift;
> > +};
> > +
> > +UMemDev *umem_dev_new(void)
> > +{
> > +    UMemDev *umem_dev;
> > +    int umem_dev_fd = open(DEV_UMEM, O_RDWR);
> > +    if (umem_dev_fd < 0) {
> > +        perror("can't open "DEV_UMEM);
> > +        abort();
> > +    }
> > +
> > +    umem_dev = g_new(UMemDev, 1);
> > +    umem_dev->fd = umem_dev_fd;
> > +    umem_dev->page_shift = ffs(getpagesize()) - 1;
> > +    return umem_dev;
> > +}
> > +
> > +void umem_dev_destroy(UMemDev *dev)
> > +{
> > +    close(dev->fd);
> > +    g_free(dev);
> > +}
> > +
> > +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name)
> > +{
> > +    struct umem_create create = {
> > +        .size = size,
> > +        .async_req_max = 0,
> > +        .sync_req_max = 0,
> > +    };
> > +    UMem *umem;
> > +
> > +    snprintf(create.name.id, sizeof(create.name.id),
> > +             "pid-%"PRId64, (uint64_t)getpid());
> > +    create.name.id[UMEM_ID_MAX - 1] = 0;
> > +    strncpy(create.name.name, name, sizeof(create.name.name));
> > +    create.name.name[UMEM_NAME_MAX - 1] = 0;
> > +
> > +    assert((size % getpagesize()) == 0);
> > +    if (ioctl(dev->fd, UMEM_DEV_CREATE_UMEM, &create) < 0) {
> > +        perror("UMEM_DEV_CREATE_UMEM");
> > +        abort();
> > +    }
> > +    if (ftruncate(create.shmem_fd, create.size) < 0) {
> > +        perror("truncate(\"shmem_fd\")");
> > +        abort();
> > +    }
> > +
> > +    umem = g_new(UMem, 1);
> > +    umem->nbits = 0;
> > +    umem->nsets = 0;
> > +    umem->faulted = NULL;
> > +    umem->page_shift = dev->page_shift;
> > +    umem->fd = create.umem_fd;
> > +    umem->shmem_fd = create.shmem_fd;
> > +    umem->size = create.size;
> > +    umem->umem = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
> > +                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +    if (umem->umem == MAP_FAILED) {
> > +        perror("mmap(UMem) failed");
> > +        abort();
> > +    }
> > +    return umem;
> > +}
> > +
> > +void umem_mmap(UMem *umem)
> > +{
> > +    void *ret = mmap(umem->umem, umem->size,
> > +                     PROT_EXEC | PROT_READ | PROT_WRITE,
> > +                     MAP_PRIVATE | MAP_FIXED, umem->fd, 0);
> > +    if (ret == MAP_FAILED) {
> > +        perror("umem_mmap(UMem) failed");
> > +        abort();
> > +    }
> > +}
> > +
> > +void umem_destroy(UMem *umem)
> > +{
> > +    if (umem->fd != -1) {
> > +        close(umem->fd);
> > +    }
> > +    if (umem->shmem_fd != -1) {
> > +        close(umem->shmem_fd);
> > +    }
> > +    g_free(umem->faulted);
> > +    g_free(umem);
> > +}
> > +
> > +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request)
> > +{
> > +    if (ioctl(umem->fd, UMEM_GET_PAGE_REQUEST, page_request)) {
> > +        perror("daemon: UMEM_GET_PAGE_REQUEST");
> > +        abort();
> > +    }
> > +}
> > +
> > +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached)
> > +{
> > +    if (ioctl(umem->fd, UMEM_MARK_PAGE_CACHED, page_cached)) {
> > +        perror("daemon: UMEM_MARK_PAGE_CACHED");
> > +        abort();
> > +    }
> > +}
> > +
> > +void umem_unmap(UMem *umem)
> > +{
> > +    munmap(umem->umem, umem->size);
> > +    umem->umem = NULL;
> > +}
> > +
> > +void umem_close(UMem *umem)
> > +{
> > +    close(umem->fd);
> > +    umem->fd = -1;
> > +}
> > +
> > +void *umem_map_shmem(UMem *umem)
> > +{
> > +    umem->nbits = umem->size >> umem->page_shift;
> > +    umem->nsets = 0;
> > +    umem->faulted = g_new0(unsigned long, BITS_TO_LONGS(umem->nbits));
> > +
> > +    umem->shmem = mmap(NULL, umem->size, PROT_READ | PROT_WRITE, MAP_SHARED,
> > +                       umem->shmem_fd, 0);
> > +    if (umem->shmem == MAP_FAILED) {
> > +        perror("daemon: mmap(\"shmem\")");
> > +        abort();
> > +    }
> > +    return umem->shmem;
> > +}
> > +
> > +void umem_unmap_shmem(UMem *umem)
> > +{
> > +    munmap(umem->shmem, umem->size);
> > +    umem->shmem = NULL;
> > +}
> > +
> > +void umem_remove_shmem(UMem *umem, size_t offset, size_t size)
> > +{
> > +    int s = offset >> umem->page_shift;
> > +    int e = (offset + size) >> umem->page_shift;
> > +    int i;
> > +
> > +    for (i = s; i < e; i++) {
> > +        if (!test_and_set_bit(i, umem->faulted)) {
> > +            umem->nsets++;
> > +#if defined(CONFIG_MADVISE) && defined(MADV_REMOVE)
> > +            madvise(umem->shmem + offset, size, MADV_REMOVE);
> > +#endif
> > +        }
> > +    }
> > +}
> > +
> > +void umem_close_shmem(UMem *umem)
> > +{
> > +    close(umem->shmem_fd);
> > +    umem->shmem_fd = -1;
> > +}
> > +
> > +/***************************************************************************/
> > +/* qemu <-> umem daemon communication */
> > +
> > +size_t umem_pages_size(uint64_t nr)
> > +{
> > +    return sizeof(struct umem_pages) + nr * sizeof(uint64_t);
> > +}
> > +
> > +static void umem_write_cmd(int fd, uint8_t cmd)
> > +{
> > +    DPRINTF("write cmd %c\n", cmd);
> > +
> > +    for (;;) {
> > +        ssize_t ret = write(fd, &cmd, 1);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            } else if (errno == EPIPE) {
> > +                perror("pipe");
> > +                DPRINTF("write cmd %c %zd %d: pipe is closed\n",
> > +                        cmd, ret, errno);
> > +                break;
> > +            }
> > +
> > +            perror("pipe");
> > +            DPRINTF("write cmd %c %zd %d\n", cmd, ret, errno);
> > +            abort();
> > +        }
> > +
> > +        break;
> > +    }
> > +}
> > +
> > +static void umem_read_cmd(int fd, uint8_t expect)
> > +{
> > +    uint8_t cmd;
> > +    for (;;) {
> > +        ssize_t ret = read(fd, &cmd, 1);
> > +        if (ret == -1) {
> > +            if (errno == EINTR) {
> > +                continue;
> > +            }
> > +            perror("pipe");
> > +            DPRINTF("read error cmd %c %zd %d\n", cmd, ret, errno);
> > +            abort();
> > +        }
> > +
> > +        if (ret == 0) {
> > +            DPRINTF("read cmd %c %zd: pipe is closed\n", cmd, ret);
> > +            abort();
> > +        }
> > +
> > +        break;
> > +    }
> > +
> > +    DPRINTF("read cmd %c\n", cmd);
> > +    if (cmd != expect) {
> > +        DPRINTF("cmd %c expect %d\n", cmd, expect);
> > +        abort();
> > +    }
> > +}
> > +
> > +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset)
> > +{
> > +    int ret;
> > +    uint64_t nr;
> > +    size_t size;
> > +    struct umem_pages *pages;
> > +
> > +    ret = qemu_peek_buffer(f, (uint8_t*)&nr, sizeof(nr), *offset);
> > +    *offset += sizeof(nr);
> > +    DPRINTF("ret %d nr %ld\n", ret, nr);
> > +    if (ret != sizeof(nr) || nr == 0) {
> > +        return NULL;
> > +    }
> > +
> > +    size = umem_pages_size(nr);
> > +    pages = g_malloc(size);
> > +    pages->nr = nr;
> > +    size -= sizeof(pages->nr);
> > +
> > +    ret = qemu_peek_buffer(f, (uint8_t*)pages->pgoffs, size, *offset);
> > +    *offset += size;
> > +    if (ret != size) {
> > +        g_free(pages);
> > +        return NULL;
> > +    }
> > +    return pages;
> > +}
> > +
> > +static void umem_send_pages(QEMUFile *f, const struct umem_pages *pages)
> > +{
> > +    size_t len = umem_pages_size(pages->nr);
> > +    qemu_put_buffer(f, (const uint8_t*)pages, len);
> > +}
> > +
> > +/* umem daemon -> qemu */
> > +void umem_daemon_ready(int to_qemu_fd)
> > +{
> > +    umem_write_cmd(to_qemu_fd, UMEM_DAEMON_READY);
> > +}
> > +
> > +void umem_daemon_quit(QEMUFile *to_qemu)
> > +{
> > +    qemu_put_byte(to_qemu, UMEM_DAEMON_QUIT);
> > +}
> > +
> > +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> > +                                    struct umem_pages *pages)
> > +{
> > +    qemu_put_byte(to_qemu, UMEM_DAEMON_TRIGGER_PAGE_FAULT);
> > +    umem_send_pages(to_qemu, pages);
> > +}
> > +
> > +void umem_daemon_wait_for_qemu(int from_qemu_fd)
> > +{
> > +    umem_read_cmd(from_qemu_fd, UMEM_QEMU_READY);
> > +}
> > +
> > +/* qemu -> umem daemon */
> > +void umem_qemu_wait_for_daemon(int from_umemd_fd)
> > +{
> > +    umem_read_cmd(from_umemd_fd, UMEM_DAEMON_READY);
> > +}
> > +
> > +void umem_qemu_ready(int to_umemd_fd)
> > +{
> > +    umem_write_cmd(to_umemd_fd, UMEM_QEMU_READY);
> > +}
> > +
> > +void umem_qemu_quit(QEMUFile *to_umemd)
> > +{
> > +    qemu_put_byte(to_umemd, UMEM_QEMU_QUIT);
> > +}
> > +
> > +/* qemu side handler */
> > +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> > +                                                int *offset)
> > +{
> > +    uint64_t i;
> > +    int page_shift = ffs(getpagesize()) - 1;
> > +    struct umem_pages *pages = umem_recv_pages(from_umemd, offset);
> > +    if (pages == NULL) {
> > +        return NULL;
> > +    }
> > +
> > +    for (i = 0; i < pages->nr; i++) {
> > +        ram_addr_t addr = pages->pgoffs[i] << page_shift;
> > +
> > +        /* make pages present by forcibly triggering page fault. */
> > +        volatile uint8_t *ram = qemu_get_ram_ptr(addr);
> > +        uint8_t dummy_read = ram[0];
> > +        (void)dummy_read;   /* suppress unused variable warning */
> > +    }
> > +
> > +    return pages;
> > +}
> > +
> > +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> > +                                  const struct umem_pages *pages)
> > +{
> > +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_FAULTED);
> > +    umem_send_pages(to_umemd, pages);
> > +}
> > +
> > +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> > +                                   const struct umem_pages *pages)
> > +{
> > +    qemu_put_byte(to_umemd, UMEM_QEMU_PAGE_UNMAPPED);
> > +    umem_send_pages(to_umemd, pages);
> > +}
> > diff --git a/umem.h b/umem.h
> > new file mode 100644
> > index 0000000..5ca19ef
> > --- /dev/null
> > +++ b/umem.h
> > @@ -0,0 +1,105 @@
> > +/*
> > + * umem.h: user process backed memory module for postcopy livemigration
> > + *
> > + * Copyright (c) 2011
> > + * National Institute of Advanced Industrial Science and Technology
> > + *
> > + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> > + * Author: Isaku Yamahata <yamahata at valinux co jp>
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef QEMU_UMEM_H
> > +#define QEMU_UMEM_H
> > +
> > +#include <linux/umem.h>
> > +
> > +#include "qemu-common.h"
> > +
> > +typedef struct UMemDev UMemDev;
> > +
> > +struct UMem {
> > +    void *umem;
> > +    int fd;
> > +    void *shmem;
> > +    int shmem_fd;
> > +    uint64_t size;
> > +
> > +    /* indexed by host page size */
> > +    int page_shift;
> > +    int nbits;
> > +    int nsets;
> > +    unsigned long *faulted;
> > +};
> > +
> > +UMemDev *umem_dev_new(void);
> > +void umem_dev_destroy(UMemDev *dev);
> > +UMem *umem_dev_create(UMemDev *dev, size_t size, const char *name);
> > +void umem_mmap(UMem *umem);
> > +
> > +void umem_destroy(UMem *umem);
> > +
> > +/* umem device operations */
> > +void umem_get_page_request(UMem *umem, struct umem_page_request *page_request);
> > +void umem_mark_page_cached(UMem *umem, struct umem_page_cached *page_cached);
> > +void umem_unmap(UMem *umem);
> > +void umem_close(UMem *umem);
> > +
> > +/* umem shmem operations */
> > +void *umem_map_shmem(UMem *umem);
> > +void umem_unmap_shmem(UMem *umem);
> > +void umem_remove_shmem(UMem *umem, size_t offset, size_t size);
> > +void umem_close_shmem(UMem *umem);
> > +
> > +/* qemu on source <-> umem daemon communication */
> > +
> > +struct umem_pages {
> > +    uint64_t nr;        /* nr = 0 means completed */
> > +    uint64_t pgoffs[0];
> > +};
> > +
> > +/* daemon -> qemu */
> > +#define UMEM_DAEMON_READY               'R'
> > +#define UMEM_DAEMON_QUIT                'Q'
> > +#define UMEM_DAEMON_TRIGGER_PAGE_FAULT  'T'
> > +#define UMEM_DAEMON_ERROR               'E'
> > +
> > +/* qemu -> daemon */
> > +#define UMEM_QEMU_READY                 'r'
> > +#define UMEM_QEMU_QUIT                  'q'
> > +#define UMEM_QEMU_PAGE_FAULTED          't'
> > +#define UMEM_QEMU_PAGE_UNMAPPED         'u'
> > +
> > +struct umem_pages *umem_recv_pages(QEMUFile *f, int *offset);
> > +size_t umem_pages_size(uint64_t nr);
> > +
> > +/* for umem daemon */
> > +void umem_daemon_ready(int to_qemu_fd);
> > +void umem_daemon_wait_for_qemu(int from_qemu_fd);
> > +void umem_daemon_quit(QEMUFile *to_qemu);
> > +void umem_daemon_send_pages_present(QEMUFile *to_qemu,
> > +                                    struct umem_pages *pages);
> > +
> > +/* for qemu */
> > +void umem_qemu_wait_for_daemon(int from_umemd_fd);
> > +void umem_qemu_ready(int to_umemd_fd);
> > +void umem_qemu_quit(QEMUFile *to_umemd);
> > +struct umem_pages *umem_qemu_trigger_page_fault(QEMUFile *from_umemd,
> > +                                                int *offset);
> > +void umem_qemu_send_pages_present(QEMUFile *to_umemd,
> > +                                  const struct umem_pages *pages);
> > +void umem_qemu_send_pages_unmapped(QEMUFile *to_umemd,
> > +                                   const struct umem_pages *pages);
> > +
> > +#endif /* QEMU_UMEM_H */
> > diff --git a/vl.c b/vl.c
> > index 5430b8c..17427a0 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -3274,8 +3274,12 @@ int main(int argc, char **argv, char **envp)
> >      default_drive(default_sdcard, snapshot, machine->use_scsi,
> >                    IF_SD, 0, SD_OPTS);
> >  
> > -    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID, NULL,
> > -                         ram_save_live, NULL, ram_load, NULL);
> > +    if (postcopy_incoming_init(incoming, incoming_postcopy) < 0) {
> > +        exit(1);
> > +    }
> > +    register_savevm_live(NULL, "ram", 0, RAM_SAVE_VERSION_ID,
> > +                         ram_save_set_params, ram_save_live, NULL,
> > +                         ram_load, NULL);
> >  
> >      if (nb_numa_nodes > 0) {
> >          int i;
> > @@ -3471,6 +3475,9 @@ int main(int argc, char **argv, char **envp)
> >  
> >      if (incoming) {
> >          runstate_set(RUN_STATE_INMIGRATE);
> > +        if (incoming_postcopy) {
> > +            postcopy_incoming_prepare();
> >+        }
> 
> how about moving postcopy_incoming_prepare into qemu_start_incoming_migration ?
> 
> >          int ret = qemu_start_incoming_migration(incoming);
> >          if (ret < 0) {
> >              fprintf(stderr, "Migration failed. Exit code %s(%d), exiting.\n",
> > @@ -3488,6 +3495,9 @@ int main(int argc, char **argv, char **envp)
> >      bdrv_close_all();
> >      pause_all_vcpus();
> >      net_cleanup();
> > +    if (incoming_postcopy) {
> > +        postcopy_incoming_qemu_cleanup();
> > +    }
> >      res_free();
> >  
> >      return 0;
> 
> Orit
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 00/21][RFC] postcopy live migration
  2012-01-01  9:52     ` [Qemu-devel] " Dor Laor
@ 2012-01-04  3:48       ` Michael Roth
  -1 siblings, 0 replies; 88+ messages in thread
From: Michael Roth @ 2012-01-04  3:48 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, Juan Quintela, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Umesh Deshpande

On 01/01/2012 03:52 AM, Dor Laor wrote:
> On 12/30/2011 12:39 AM, Anthony Liguori wrote:
>> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>>> Intro
>>> =====
>>> This patch series implements postcopy live migration.[1]
>>> As discussed at KVM forum 2011, dedicated character device is used for
>>> distributed shared memory between migration source and destination.
>>> Now we can discuss/benchmark/compare with precopy. I believe there are
>>> much rooms for improvement.
>>>
>>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>
>>>
>>> Usage
>>> =====
>>> You need load umem character device on the host before starting
>>> migration.
>>> Postcopy can be used for tcg and kvm accelarator. The implementation
>>> depend
>>> on only linux umem character device. But the driver dependent code is
>>> split
>>> into a file.
>>> I tested only host page size == guest page size case, but the
>>> implementation
>>> allows host page size != guest page size case.
>>>
>>> The following options are added with this patch series.
>>> - incoming part
>>> command line options
>>> -postcopy [-postcopy-flags<flags>]
>>> where flags is for changing behavior for benchmark/debugging
>>> Currently the following flags are available
>>> 0: default
>>> 1: enable touching page request
>>>
>>> example:
>>> qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>>
>>> - outging part
>>> options for migrate command
>>> migrate [-p [-n]] URI
>>> -p: indicate postcopy migration
>>> -n: disable background transferring pages: This is for
>>> benchmark/debugging
>>>
>>> example:
>>> migrate -p -n tcp:<dest ip address>:4444
>>>
>>>
>>> TODO
>>> ====
>>> - benchmark/evaluation. Especially how async page fault affects the
>>> result.
>>
>> I'll review this series next week (Mike/Juan, please also review when
>> you can).
>>
>> But we really need to think hard about whether this is the right thing
>> to take into the tree. I worry a lot about the fact that we don't test
>> pre-copy migration nearly enough and adding a second form just
>> introduces more things to test.
>
> It is an issue but it can't be a merge criteria, Isaku is not blame of
> pre copy live migration lack of testing.
>
> I would say that 90% of issues of live migration problems are not
> related to the pre|post stage but more of issues of device model save
> state. So post-copy shouldn't add a significant regression here.
>
> Probably it will be good to ask every migration patch writer to write an
> additional unit test for migration.
>
>> It's also not clear to me why post-copy is better. If you were going to
>> sit down and explain to someone building a management tool when they
>> should use pre-copy and when they should use post-copy, what would you
>> tell them?
>
> Today, we have a default of max-downtime of 100ms.
> If either the guest work set size or the host networking throughput
> can't match the downtime, migration won't end.
> The mgmt user options are:
> - increase the downtime more and more to an actual stop
> - fail migrate
>
> W/ post-copy there is another option.
> Performance measurements will teach us (probably prior to commit) when
> this stage is valuable. Most likely, we better try first with pre-copy
> and if we can't meet the downtime we can optionally use post-copy.

Umesh's paper seems to already have strong indications that at least 1 
iteration of pre-copy is optimal in terms of downtime, so I wonder if 
we're starting off on the wrong track with the all or nothing approach 
taken with this series?

I only have Umesh's paper to go off (which, granted, notes shadow paging 
(which I guess we effectively have with these patches) as a potential 
improvement to the pseudo swap device used there), but otherwise it 
seems like we'd just get more downtime as the target starts choking on 
network-based page faults.

It's probably not too useful to speculate on performance at this point, 
but I think it'll be easier to get the data (and hit closer to the mark 
suggested by the paper) if we started off assuming that pre-copy should 
still be in play. Maybe something like:

migrate -d tcp:host:123 -p[ostcopy] <after x iterations>

x=0 for post-copy only, no -p for pre-copy-only

Also seems a bit cleaner. And if post-copy proves to be optimal we just 
make -p 0 implied... minor details at this point, but the main thing is 
that integrating better with the pre-copy code will make it easier to 
determine what the sweet spot is and how much we stand to gain.

>
> Here's a paper by Umesh (the migration thread writer):
> http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf
>
> Regards,
> Dor
>
>>
>> Regards,
>>
>> Anthony Liguori
>>
>
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-04  3:48       ` Michael Roth
  0 siblings, 0 replies; 88+ messages in thread
From: Michael Roth @ 2012-01-04  3:48 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, Juan Quintela, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Umesh Deshpande

On 01/01/2012 03:52 AM, Dor Laor wrote:
> On 12/30/2011 12:39 AM, Anthony Liguori wrote:
>> On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
>>> Intro
>>> =====
>>> This patch series implements postcopy live migration.[1]
>>> As discussed at KVM forum 2011, dedicated character device is used for
>>> distributed shared memory between migration source and destination.
>>> Now we can discuss/benchmark/compare with precopy. I believe there are
>>> much rooms for improvement.
>>>
>>> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>
>>>
>>> Usage
>>> =====
>>> You need load umem character device on the host before starting
>>> migration.
>>> Postcopy can be used for tcg and kvm accelarator. The implementation
>>> depend
>>> on only linux umem character device. But the driver dependent code is
>>> split
>>> into a file.
>>> I tested only host page size == guest page size case, but the
>>> implementation
>>> allows host page size != guest page size case.
>>>
>>> The following options are added with this patch series.
>>> - incoming part
>>> command line options
>>> -postcopy [-postcopy-flags<flags>]
>>> where flags is for changing behavior for benchmark/debugging
>>> Currently the following flags are available
>>> 0: default
>>> 1: enable touching page request
>>>
>>> example:
>>> qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>>>
>>> - outging part
>>> options for migrate command
>>> migrate [-p [-n]] URI
>>> -p: indicate postcopy migration
>>> -n: disable background transferring pages: This is for
>>> benchmark/debugging
>>>
>>> example:
>>> migrate -p -n tcp:<dest ip address>:4444
>>>
>>>
>>> TODO
>>> ====
>>> - benchmark/evaluation. Especially how async page fault affects the
>>> result.
>>
>> I'll review this series next week (Mike/Juan, please also review when
>> you can).
>>
>> But we really need to think hard about whether this is the right thing
>> to take into the tree. I worry a lot about the fact that we don't test
>> pre-copy migration nearly enough and adding a second form just
>> introduces more things to test.
>
> It is an issue but it can't be a merge criteria, Isaku is not blame of
> pre copy live migration lack of testing.
>
> I would say that 90% of issues of live migration problems are not
> related to the pre|post stage but more of issues of device model save
> state. So post-copy shouldn't add a significant regression here.
>
> Probably it will be good to ask every migration patch writer to write an
> additional unit test for migration.
>
>> It's also not clear to me why post-copy is better. If you were going to
>> sit down and explain to someone building a management tool when they
>> should use pre-copy and when they should use post-copy, what would you
>> tell them?
>
> Today, we have a default of max-downtime of 100ms.
> If either the guest work set size or the host networking throughput
> can't match the downtime, migration won't end.
> The mgmt user options are:
> - increase the downtime more and more to an actual stop
> - fail migrate
>
> W/ post-copy there is another option.
> Performance measurements will teach us (probably prior to commit) when
> this stage is valuable. Most likely, we better try first with pre-copy
> and if we can't meet the downtime we can optionally use post-copy.

Umesh's paper seems to already have strong indications that at least 1 
iteration of pre-copy is optimal in terms of downtime, so I wonder if 
we're starting off on the wrong track with the all or nothing approach 
taken with this series?

I only have Umesh's paper to go off (which, granted, notes shadow paging 
(which I guess we effectively have with these patches) as a potential 
improvement to the pseudo swap device used there), but otherwise it 
seems like we'd just get more downtime as the target starts choking on 
network-based page faults.

It's probably not too useful to speculate on performance at this point, 
but I think it'll be easier to get the data (and hit closer to the mark 
suggested by the paper) if we started off assuming that pre-copy should 
still be in play. Maybe something like:

migrate -d tcp:host:123 -p[ostcopy] <after x iterations>

x=0 for post-copy only, no -p for pre-copy-only

Also seems a bit cleaner. And if post-copy proves to be optimal we just 
make -p 0 implied... minor details at this point, but the main thing is 
that integrating better with the pre-copy code will make it easier to 
determine what the sweet spot is and how much we stand to gain.

>
> Here's a paper by Umesh (the migration thread writer):
> http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf
>
> Regards,
> Dor
>
>>
>> Regards,
>>
>> Anthony Liguori
>>
>
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
  2011-12-29 22:39   ` [Qemu-devel] " Anthony Liguori
@ 2012-01-04  3:51     ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, qemu-devel, t.hirofuchi, satoshi.itoh, Michael Roth, Juan Quintela

On Thu, Dec 29, 2011 at 04:39:52PM -0600, Anthony Liguori wrote:
>> TODO
>> ====
>> - benchmark/evaluation. Especially how async page fault affects the result.
>
> I'll review this series next week (Mike/Juan, please also review when you can).
>
> But we really need to think hard about whether this is the right thing to 
> take into the tree.  I worry a lot about the fact that we don't test 
> pre-copy migration nearly enough and adding a second form just introduces 
> more things to test.
>
> It's also not clear to me why post-copy is better.  If you were going to 
> sit down and explain to someone building a management tool when they 
> should use pre-copy and when they should use post-copy, what would you 
> tell them?

The concrete patch and its benchmark/evaluation result will help much for
making better discussion/decision (whatever decision we will make).

My answer is, follow the same policy for block device case.
It supports block migration/copy-on-read/image streaming/live block copy...
(some of them are under development, though)

Seriously, we'll learn the best practice through evaluation/making experiences.

thanks,
-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-04  3:51     ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, Juan Quintela, t.hirofuchi, satoshi.itoh, Michael Roth, qemu-devel

On Thu, Dec 29, 2011 at 04:39:52PM -0600, Anthony Liguori wrote:
>> TODO
>> ====
>> - benchmark/evaluation. Especially how async page fault affects the result.
>
> I'll review this series next week (Mike/Juan, please also review when you can).
>
> But we really need to think hard about whether this is the right thing to 
> take into the tree.  I worry a lot about the fact that we don't test 
> pre-copy migration nearly enough and adding a second form just introduces 
> more things to test.
>
> It's also not clear to me why post-copy is better.  If you were going to 
> sit down and explain to someone building a management tool when they 
> should use pre-copy and when they should use post-copy, what would you 
> tell them?

The concrete patch and its benchmark/evaluation result will help much for
making better discussion/decision (whatever decision we will make).

My answer is, follow the same policy for block device case.
It supports block migration/copy-on-read/image streaming/live block copy...
(some of them are under development, though)

Seriously, we'll learn the best practice through evaluation/making experiences.

thanks,
-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??:  [PATCH 00/21][RFC] postcopy live migration
       [not found] ` <BLU0-SMTP161AC380D472854F48E33A5BC9A0@phx.gbl>
@ 2012-01-11  2:45     ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-11  2:45 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> Hello all!

Hi, thank you for detailed report. The procedure you've tried looks
good basically. Some comments below.


> I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> patched it correctly
> but it still didn't make sense and I got the same scenario as before
> outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
>  
> I think I should show what I do more clearly and hope somebody can figure out
> the problem
> 
>  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> 
>        ./configure --target-list=x86_64-softmmu --enable-kvm --enable-postcopy
> --enable-debug
>        make
>        make install
> 
>  ・ 2, outgoing qemu:
> 
> qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> -machine accel=kvm
> incoming qemu:
> qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> 
>  ・ 3, outgoing node:
> 
> migrate -d -p -n tcp:(incoming node ip):8888
>  
> result:
> 
>  ・ outgoing qemu:
> 
> info status: VM-status: paused (finish-migrate);
> 
>  ・ incoming qemu:
> 
> can't type any more and can't kill the process(qemu-system-x86)
>  
> I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> 
>  ・ outgoing qemu:
> 
> (qemu) migration-tcp: connect completed
> migration: beginning savevm
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> migration: iterate
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> migration: done iterating
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> 
>  ・ incoming qemu:
> 
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 4872:4872 postcopy_incoming_ram_load:1057: done
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS

There should be only single EOS line. Just copy & past miss?


> from the result:
> It didn't get to the "successfully loaded vm state"
> So it still in the qemu_loadvm_state, and I found it's in
> cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> stuck

Can you please track it down one more step?
Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
block.(backtrace by the debugger would be best.)

If possible, can you please test with more simplified configuration.
i.e. drop device as much as possible i.e. no usbdevice, no disk...
So the debug will be simplified.

thanks,

> Does anyone give some advises on the problem?
> Thanks very much~
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2011-12-29 09:25
> To: kvm; qemu-devel
> CC: yamahata; t.hirofuchi; satoshi.itoh
> Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> Intro
> =====
> This patch series implements postcopy live migration.[1]
> As discussed at KVM forum 2011, dedicated character device is used for
> distributed shared memory between migration source and destination.
> Now we can discuss/benchmark/compare with precopy. I believe there are
> much rooms for improvement.
>  
> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>  
>  
> Usage
> =====
> You need load umem character device on the host before starting migration.
> Postcopy can be used for tcg and kvm accelarator. The implementation depend
> on only linux umem character device. But the driver dependent code is split
> into a file.
> I tested only host page size == guest page size case, but the implementation
> allows host page size != guest page size case.
>  
> The following options are added with this patch series.
> - incoming part
>   command line options
>   -postcopy [-postcopy-flags <flags>]
>   where flags is for changing behavior for benchmark/debugging
>   Currently the following flags are available
>   0: default
>   1: enable touching page request
>  
>   example:
>   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>  
> - outging part
>   options for migrate command 
>   migrate [-p [-n]] URI
>   -p: indicate postcopy migration
>   -n: disable background transferring pages: This is for benchmark/debugging
>  
>   example:
>   migrate -p -n tcp:<dest ip address>:4444
>  
>  
> TODO
> ====
> - benchmark/evaluation. Especially how async page fault affects the result.
> - improve/optimization
>   At the moment at least what I'm aware of is
>   - touching pages in incoming qemu process by fd handler seems suboptimal.
>     creating dedicated thread?
>   - making incoming socket non-blocking
>   - outgoing handler seems suboptimal causing latency.
> - catch up memory API change
> - consider on FUSE/CUSE possibility
> - and more...
>  
> basic postcopy work flow
> ========================
>         qemu on the destination
>               |
>               V
>         open(/dev/umem)
>               |
>               V
>         UMEM_DEV_CREATE_UMEM
>               |
>               V
>         Here we have two file descriptors to
>         umem device and shmem file
>               |
>               |                                  umemd
>               |                                  daemon on the destination
>               |
>               V    create pipe to communicate
>         fork()---------------------------------------,
>               |                                      |
>               V                                      |
>         close(socket)                                V
>         close(shmem)                              mmap(shmem file)
>               |                                      |
>               V                                      V
>         mmap(umem device) for guest RAM           close(shmem file)
>               |                                      |
>         close(umem device)                           |
>               |                                      |
>               V                                      |
>         wait for ready from daemon <----pipe-----send ready message
>               |                                      |
>               |                                 Here the daemon takes over 
>         send ok------------pipe---------------> the owner of the socket    
>               |         to the source              
>               V                                      |
>         entering post copy stage                     |
>         start guest execution                        |
>               |                                      |
>               V                                      V
>         access guest RAM                          UMEM_GET_PAGE_REQUEST
>               |                                      |
>               V                                      V
>         page fault ------------------------------>page offset is returned
>         block                                        |
>                                                      V
>                                                   pull page from the source
>                                                   write the page contents
>                                                   to the shmem.
>                                                      |
>                                                      V
>         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
>         the fault handler returns the page
>         page fault is resolved
>               |
>               |                                   pages can be sent
>               |                                   backgroundly
>               |                                      |
>               |                                      V
>               |                                   UMEM_MARK_PAGE_CACHED
>               |                                      |
>               V                                      V
>         The specified pages<-----pipe------------request to touch pages
>         are made present by                          |
>         touching guest RAM.                          |
>               |                                      |
>               V                                      V
>              reply-------------pipe-------------> release the cached page
>               |                                   madvise(MADV_REMOVE)
>               |                                      |
>               V                                      V
>  
>                  all the pages are pulled from the source
>  
>               |                                      |
>               V                                      V
>         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
>        (note: I'm not sure if this can be implemented or not)
>               |                                      |
>               V                                      V
>         migration completes                        exit()
>  
>  
>  
> Isaku Yamahata (21):
>   arch_init: export sort_ram_list() and ram_save_block()
>   arch_init: export RAM_SAVE_xxx flags for postcopy
>   arch_init/ram_save: introduce constant for ram save version = 4
>   arch_init: refactor host_from_stream_offset()
>   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
>   arch_init: refactor ram_save_block()
>   arch_init/ram_save_live: factor out ram_save_limit
>   arch_init/ram_load: refactor ram_load
>   exec.c: factor out qemu_get_ram_ptr()
>   exec.c: export last_ram_offset()
>   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
>   savevm: qemu_pending_size() to return pending buffered size
>   savevm, buffered_file: introduce method to drain buffer of buffered
>     file
>   migration: export migrate_fd_completed() and migrate_fd_cleanup()
>   migration: factor out parameters into MigrationParams
>   umem.h: import Linux umem.h
>   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
>   configure: add CONFIG_POSTCOPY option
>   postcopy: introduce -postcopy and -postcopy-flags option
>   postcopy outgoing: add -p and -n option to migrate command
>   postcopy: implement postcopy livemigration
>  
>  Makefile.target                 |    4 +
>  arch_init.c                     |  260 ++++---
>  arch_init.h                     |   20 +
>  block-migration.c               |    8 +-
>  buffered_file.c                 |   20 +-
>  buffered_file.h                 |    1 +
>  configure                       |   12 +
>  cpu-all.h                       |    9 +
>  exec-obsolete.h                 |    1 +
>  exec.c                          |   75 +-
>  hmp-commands.hx                 |   12 +-
>  hw/hw.h                         |    7 +-
>  linux-headers/linux/umem.h      |   83 ++
>  migration-exec.c                |    8 +
>  migration-fd.c                  |   30 +
>  migration-postcopy-stub.c       |   77 ++
>  migration-postcopy.c            | 1891 +++++++++++++++++++++++++++++++++++++++
>  migration-tcp.c                 |   37 +-
>  migration-unix.c                |   32 +-
>  migration.c                     |   53 +-
>  migration.h                     |   49 +-
>  qemu-common.h                   |    2 +
>  qemu-options.hx                 |   25 +
>  qmp-commands.hx                 |   10 +-
>  savevm.c                        |   31 +-
>  scripts/update-linux-headers.sh |    2 +-
>  sysemu.h                        |    4 +-
>  umem.c                          |  379 ++++++++
>  umem.h                          |  105 +++
>  vl.c                            |   20 +-
>  30 files changed, 3086 insertions(+), 181 deletions(-)
>  create mode 100644 linux-headers/linux/umem.h
>  create mode 100644 migration-postcopy-stub.c
>  create mode 100644 migration-postcopy.c
>  create mode 100644 umem.c
>  create mode 100644 umem.h
>  
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??:  [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-11  2:45     ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-11  2:45 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> Hello all!

Hi, thank you for detailed report. The procedure you've tried looks
good basically. Some comments below.


> I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> patched it correctly
> but it still didn't make sense and I got the same scenario as before
> outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
>  
> I think I should show what I do more clearly and hope somebody can figure out
> the problem
> 
>  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> 
>        ./configure --target-list=x86_64-softmmu --enable-kvm --enable-postcopy
> --enable-debug
>        make
>        make install
> 
>  ・ 2, outgoing qemu:
> 
> qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> -machine accel=kvm
> incoming qemu:
> qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> 
>  ・ 3, outgoing node:
> 
> migrate -d -p -n tcp:(incoming node ip):8888
>  
> result:
> 
>  ・ outgoing qemu:
> 
> info status: VM-status: paused (finish-migrate);
> 
>  ・ incoming qemu:
> 
> can't type any more and can't kill the process(qemu-system-x86)
>  
> I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> 
>  ・ outgoing qemu:
> 
> (qemu) migration-tcp: connect completed
> migration: beginning savevm
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> migration: iterate
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> migration: done iterating
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> 
>  ・ incoming qemu:
> 
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 4872:4872 postcopy_incoming_ram_load:1057: done
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS

There should be only single EOS line. Just copy & past miss?


> from the result:
> It didn't get to the "successfully loaded vm state"
> So it still in the qemu_loadvm_state, and I found it's in
> cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> stuck

Can you please track it down one more step?
Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
block.(backtrace by the debugger would be best.)

If possible, can you please test with more simplified configuration.
i.e. drop device as much as possible i.e. no usbdevice, no disk...
So the debug will be simplified.

thanks,

> Does anyone give some advises on the problem?
> Thanks very much~
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2011-12-29 09:25
> To: kvm; qemu-devel
> CC: yamahata; t.hirofuchi; satoshi.itoh
> Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> Intro
> =====
> This patch series implements postcopy live migration.[1]
> As discussed at KVM forum 2011, dedicated character device is used for
> distributed shared memory between migration source and destination.
> Now we can discuss/benchmark/compare with precopy. I believe there are
> much rooms for improvement.
>  
> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>  
>  
> Usage
> =====
> You need load umem character device on the host before starting migration.
> Postcopy can be used for tcg and kvm accelarator. The implementation depend
> on only linux umem character device. But the driver dependent code is split
> into a file.
> I tested only host page size == guest page size case, but the implementation
> allows host page size != guest page size case.
>  
> The following options are added with this patch series.
> - incoming part
>   command line options
>   -postcopy [-postcopy-flags <flags>]
>   where flags is for changing behavior for benchmark/debugging
>   Currently the following flags are available
>   0: default
>   1: enable touching page request
>  
>   example:
>   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>  
> - outging part
>   options for migrate command 
>   migrate [-p [-n]] URI
>   -p: indicate postcopy migration
>   -n: disable background transferring pages: This is for benchmark/debugging
>  
>   example:
>   migrate -p -n tcp:<dest ip address>:4444
>  
>  
> TODO
> ====
> - benchmark/evaluation. Especially how async page fault affects the result.
> - improve/optimization
>   At the moment at least what I'm aware of is
>   - touching pages in incoming qemu process by fd handler seems suboptimal.
>     creating dedicated thread?
>   - making incoming socket non-blocking
>   - outgoing handler seems suboptimal causing latency.
> - catch up memory API change
> - consider on FUSE/CUSE possibility
> - and more...
>  
> basic postcopy work flow
> ========================
>         qemu on the destination
>               |
>               V
>         open(/dev/umem)
>               |
>               V
>         UMEM_DEV_CREATE_UMEM
>               |
>               V
>         Here we have two file descriptors to
>         umem device and shmem file
>               |
>               |                                  umemd
>               |                                  daemon on the destination
>               |
>               V    create pipe to communicate
>         fork()---------------------------------------,
>               |                                      |
>               V                                      |
>         close(socket)                                V
>         close(shmem)                              mmap(shmem file)
>               |                                      |
>               V                                      V
>         mmap(umem device) for guest RAM           close(shmem file)
>               |                                      |
>         close(umem device)                           |
>               |                                      |
>               V                                      |
>         wait for ready from daemon <----pipe-----send ready message
>               |                                      |
>               |                                 Here the daemon takes over 
>         send ok------------pipe---------------> the owner of the socket    
>               |         to the source              
>               V                                      |
>         entering post copy stage                     |
>         start guest execution                        |
>               |                                      |
>               V                                      V
>         access guest RAM                          UMEM_GET_PAGE_REQUEST
>               |                                      |
>               V                                      V
>         page fault ------------------------------>page offset is returned
>         block                                        |
>                                                      V
>                                                   pull page from the source
>                                                   write the page contents
>                                                   to the shmem.
>                                                      |
>                                                      V
>         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
>         the fault handler returns the page
>         page fault is resolved
>               |
>               |                                   pages can be sent
>               |                                   backgroundly
>               |                                      |
>               |                                      V
>               |                                   UMEM_MARK_PAGE_CACHED
>               |                                      |
>               V                                      V
>         The specified pages<-----pipe------------request to touch pages
>         are made present by                          |
>         touching guest RAM.                          |
>               |                                      |
>               V                                      V
>              reply-------------pipe-------------> release the cached page
>               |                                   madvise(MADV_REMOVE)
>               |                                      |
>               V                                      V
>  
>                  all the pages are pulled from the source
>  
>               |                                      |
>               V                                      V
>         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
>        (note: I'm not sure if this can be implemented or not)
>               |                                      |
>               V                                      V
>         migration completes                        exit()
>  
>  
>  
> Isaku Yamahata (21):
>   arch_init: export sort_ram_list() and ram_save_block()
>   arch_init: export RAM_SAVE_xxx flags for postcopy
>   arch_init/ram_save: introduce constant for ram save version = 4
>   arch_init: refactor host_from_stream_offset()
>   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
>   arch_init: refactor ram_save_block()
>   arch_init/ram_save_live: factor out ram_save_limit
>   arch_init/ram_load: refactor ram_load
>   exec.c: factor out qemu_get_ram_ptr()
>   exec.c: export last_ram_offset()
>   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
>   savevm: qemu_pending_size() to return pending buffered size
>   savevm, buffered_file: introduce method to drain buffer of buffered
>     file
>   migration: export migrate_fd_completed() and migrate_fd_cleanup()
>   migration: factor out parameters into MigrationParams
>   umem.h: import Linux umem.h
>   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
>   configure: add CONFIG_POSTCOPY option
>   postcopy: introduce -postcopy and -postcopy-flags option
>   postcopy outgoing: add -p and -n option to migrate command
>   postcopy: implement postcopy livemigration
>  
>  Makefile.target                 |    4 +
>  arch_init.c                     |  260 ++++---
>  arch_init.h                     |   20 +
>  block-migration.c               |    8 +-
>  buffered_file.c                 |   20 +-
>  buffered_file.h                 |    1 +
>  configure                       |   12 +
>  cpu-all.h                       |    9 +
>  exec-obsolete.h                 |    1 +
>  exec.c                          |   75 +-
>  hmp-commands.hx                 |   12 +-
>  hw/hw.h                         |    7 +-
>  linux-headers/linux/umem.h      |   83 ++
>  migration-exec.c                |    8 +
>  migration-fd.c                  |   30 +
>  migration-postcopy-stub.c       |   77 ++
>  migration-postcopy.c            | 1891 +++++++++++++++++++++++++++++++++++++++
>  migration-tcp.c                 |   37 +-
>  migration-unix.c                |   32 +-
>  migration.c                     |   53 +-
>  migration.h                     |   49 +-
>  qemu-common.h                   |    2 +
>  qemu-options.hx                 |   25 +
>  qmp-commands.hx                 |   10 +-
>  savevm.c                        |   31 +-
>  scripts/update-linux-headers.sh |    2 +-
>  sysemu.h                        |    4 +-
>  umem.c                          |  379 ++++++++
>  umem.h                          |  105 +++
>  vl.c                            |   20 +-
>  30 files changed, 3086 insertions(+), 181 deletions(-)
>  create mode 100644 linux-headers/linux/umem.h
>  create mode 100644 migration-postcopy-stub.c
>  create mode 100644 migration-postcopy.c
>  create mode 100644 umem.c
>  create mode 100644 umem.h
>  
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??:  [PATCH 00/21][RFC] postcopy live migration
  2012-01-11  2:45     ` [Qemu-devel] " Isaku Yamahata
@ 2012-01-12  8:29       ` thfbjyddx
  -1 siblings, 0 replies; 88+ messages in thread
From: thfbjyddx @ 2012-01-12  8:29 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 15647 bytes --]

Hi , I've dug more thess days

> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 4872:4872 postcopy_incoming_ram_load:1057: done
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS

There should be only single EOS line. Just copy & past miss?

There must be two EOS for one is coming from postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
I think in postcopy the ram_save_live in the iterate part can be ignore
so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?


Can you please track it down one more step?
Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
block.(backtrace by the debugger would be best.)

it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
so it gets stuck

when I check the EOS problem 
I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART); and qemu_put_be32(f, se->section_id)
 (I think this is a wrong way to fix it and I don't know how it get through)
and leave just the se->save_live_state in the qemu_savevm_state_iterate
it didn't get stuck at kvm_put_msrs()
but it has some other error
(qemu) migration-tcp: Attempting to start an incoming migration
migration-tcp: accepted migration
2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
2126:2126 postcopy_incoming_ram_load:1057: done
migration: successfully loaded vm state
2126:2126 postcopy_incoming_fork_umemd:1069: fork
2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
Can't find block !
2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
and at the same time , the destination node didn't show the EOS

so I still can't solve the stuck problem

Thanks for your help~!



Tommy

From: Isaku Yamahata
Date: 2012-01-11 10:45
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> Hello all!

Hi, thank you for detailed report. The procedure you've tried looks
good basically. Some comments below.

> I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> patched it correctly
> but it still didn't make sense and I got the same scenario as before
> outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
>  
> I think I should show what I do more clearly and hope somebody can figure out
> the problem
> 
>  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> 
>        ./configure --target-list=x86_64-softmmu --enable-kvm --enable-postcopy
> --enable-debug
>        make
>        make install
> 
>  ・ 2, outgoing qemu:
> 
> qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> -machine accel=kvm
> incoming qemu:
> qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> 
>  ・ 3, outgoing node:
> 
> migrate -d -p -n tcp:(incoming node ip):8888
>  
> result:
> 
>  ・ outgoing qemu:
> 
> info status: VM-status: paused (finish-migrate);
> 
>  ・ incoming qemu:
> 
> can't type any more and can't kill the process(qemu-system-x86)
>  
> I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> 
>  ・ outgoing qemu:
> 
> (qemu) migration-tcp: connect completed
> migration: beginning savevm
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> migration: iterate
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> migration: done iterating
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> 
>  ・ incoming qemu:
> 
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 4872:4872 postcopy_incoming_ram_load:1057: done
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS

There should be only single EOS line. Just copy & past miss?


> from the result:
> It didn't get to the "successfully loaded vm state"
> So it still in the qemu_loadvm_state, and I found it's in
> cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> stuck

Can you please track it down one more step?
Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
block.(backtrace by the debugger would be best.)

If possible, can you please test with more simplified configuration.
i.e. drop device as much as possible i.e. no usbdevice, no disk...
So the debug will be simplified.

thanks,

> Does anyone give some advises on the problem?
> Thanks very much~
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2011-12-29 09:25
> To: kvm; qemu-devel
> CC: yamahata; t.hirofuchi; satoshi.itoh
> Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> Intro
> =====
> This patch series implements postcopy live migration.[1]
> As discussed at KVM forum 2011, dedicated character device is used for
> distributed shared memory between migration source and destination.
> Now we can discuss/benchmark/compare with precopy. I believe there are
> much rooms for improvement.
>  
> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>  
>  
> Usage
> =====
> You need load umem character device on the host before starting migration.
> Postcopy can be used for tcg and kvm accelarator. The implementation depend
> on only linux umem character device. But the driver dependent code is split
> into a file.
> I tested only host page size == guest page size case, but the implementation
> allows host page size != guest page size case.
>  
> The following options are added with this patch series.
> - incoming part
>   command line options
>   -postcopy [-postcopy-flags <flags>]
>   where flags is for changing behavior for benchmark/debugging
>   Currently the following flags are available
>   0: default
>   1: enable touching page request
>  
>   example:
>   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>  
> - outging part
>   options for migrate command 
>   migrate [-p [-n]] URI
>   -p: indicate postcopy migration
>   -n: disable background transferring pages: This is for benchmark/debugging
>  
>   example:
>   migrate -p -n tcp:<dest ip address>:4444
>  
>  
> TODO
> ====
> - benchmark/evaluation. Especially how async page fault affects the result.
> - improve/optimization
>   At the moment at least what I'm aware of is
>   - touching pages in incoming qemu process by fd handler seems suboptimal.
>     creating dedicated thread?
>   - making incoming socket non-blocking
>   - outgoing handler seems suboptimal causing latency.
> - catch up memory API change
> - consider on FUSE/CUSE possibility
> - and more...
>  
> basic postcopy work flow
> ========================
>         qemu on the destination
>               |
>               V
>         open(/dev/umem)
>               |
>               V
>         UMEM_DEV_CREATE_UMEM
>               |
>               V
>         Here we have two file descriptors to
>         umem device and shmem file
>               |
>               |                                  umemd
>               |                                  daemon on the destination
>               |
>               V    create pipe to communicate
>         fork()---------------------------------------,
>               |                                      |
>               V                                      |
>         close(socket)                                V
>         close(shmem)                              mmap(shmem file)
>               |                                      |
>               V                                      V
>         mmap(umem device) for guest RAM           close(shmem file)
>               |                                      |
>         close(umem device)                           |
>               |                                      |
>               V                                      |
>         wait for ready from daemon <----pipe-----send ready message
>               |                                      |
>               |                                 Here the daemon takes over 
>         send ok------------pipe---------------> the owner of the socket    
>               |         to the source              
>               V                                      |
>         entering post copy stage                     |
>         start guest execution                        |
>               |                                      |
>               V                                      V
>         access guest RAM                          UMEM_GET_PAGE_REQUEST
>               |                                      |
>               V                                      V
>         page fault ------------------------------>page offset is returned
>         block                                        |
>                                                      V
>                                                   pull page from the source
>                                                   write the page contents
>                                                   to the shmem.
>                                                      |
>                                                      V
>         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
>         the fault handler returns the page
>         page fault is resolved
>               |
>               |                                   pages can be sent
>               |                                   backgroundly
>               |                                      |
>               |                                      V
>               |                                   UMEM_MARK_PAGE_CACHED
>               |                                      |
>               V                                      V
>         The specified pages<-----pipe------------request to touch pages
>         are made present by                          |
>         touching guest RAM.                          |
>               |                                      |
>               V                                      V
>              reply-------------pipe-------------> release the cached page
>               |                                   madvise(MADV_REMOVE)
>               |                                      |
>               V                                      V
>  
>                  all the pages are pulled from the source
>  
>               |                                      |
>               V                                      V
>         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
>        (note: I'm not sure if this can be implemented or not)
>               |                                      |
>               V                                      V
>         migration completes                        exit()
>  
>  
>  
> Isaku Yamahata (21):
>   arch_init: export sort_ram_list() and ram_save_block()
>   arch_init: export RAM_SAVE_xxx flags for postcopy
>   arch_init/ram_save: introduce constant for ram save version = 4
>   arch_init: refactor host_from_stream_offset()
>   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
>   arch_init: refactor ram_save_block()
>   arch_init/ram_save_live: factor out ram_save_limit
>   arch_init/ram_load: refactor ram_load
>   exec.c: factor out qemu_get_ram_ptr()
>   exec.c: export last_ram_offset()
>   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
>   savevm: qemu_pending_size() to return pending buffered size
>   savevm, buffered_file: introduce method to drain buffer of buffered
>     file
>   migration: export migrate_fd_completed() and migrate_fd_cleanup()
>   migration: factor out parameters into MigrationParams
>   umem.h: import Linux umem.h
>   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
>   configure: add CONFIG_POSTCOPY option
>   postcopy: introduce -postcopy and -postcopy-flags option
>   postcopy outgoing: add -p and -n option to migrate command
>   postcopy: implement postcopy livemigration
>  
>  Makefile.target                 |    4 +
>  arch_init.c                     |  260 ++++---
>  arch_init.h                     |   20 +
>  block-migration.c               |    8 +-
>  buffered_file.c                 |   20 +-
>  buffered_file.h                 |    1 +
>  configure                       |   12 +
>  cpu-all.h                       |    9 +
>  exec-obsolete.h                 |    1 +
>  exec.c                          |   75 +-
>  hmp-commands.hx                 |   12 +-
>  hw/hw.h                         |    7 +-
>  linux-headers/linux/umem.h      |   83 ++
>  migration-exec.c                |    8 +
>  migration-fd.c                  |   30 +
>  migration-postcopy-stub.c       |   77 ++
>  migration-postcopy.c            | 1891 +++++++++++++++++++++++++++++++++++++++
>  migration-tcp.c                 |   37 +-
>  migration-unix.c                |   32 +-
>  migration.c                     |   53 +-
>  migration.h                     |   49 +-
>  qemu-common.h                   |    2 +
>  qemu-options.hx                 |   25 +
>  qmp-commands.hx                 |   10 +-
>  savevm.c                        |   31 +-
>  scripts/update-linux-headers.sh |    2 +-
>  sysemu.h                        |    4 +-
>  umem.c                          |  379 ++++++++
>  umem.h                          |  105 +++
>  vl.c                            |   20 +-
>  30 files changed, 3086 insertions(+), 181 deletions(-)
>  create mode 100644 linux-headers/linux/umem.h
>  create mode 100644 migration-postcopy-stub.c
>  create mode 100644 migration-postcopy.c
>  create mode 100644 umem.c
>  create mode 100644 umem.h
>  
>  
>  

-- 
yamahata

[-- Attachment #2: Type: text/html, Size: 46327 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??:  [PATCH 00/21][RFC] postcopy live migration
@ 2012-01-12  8:29       ` thfbjyddx
  0 siblings, 0 replies; 88+ messages in thread
From: thfbjyddx @ 2012-01-12  8:29 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 15647 bytes --]

Hi , I've dug more thess days

> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 4872:4872 postcopy_incoming_ram_load:1057: done
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS

There should be only single EOS line. Just copy & past miss?

There must be two EOS for one is coming from postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
I think in postcopy the ram_save_live in the iterate part can be ignore
so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?


Can you please track it down one more step?
Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
block.(backtrace by the debugger would be best.)

it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
so it gets stuck

when I check the EOS problem 
I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART); and qemu_put_be32(f, se->section_id)
 (I think this is a wrong way to fix it and I don't know how it get through)
and leave just the se->save_live_state in the qemu_savevm_state_iterate
it didn't get stuck at kvm_put_msrs()
but it has some other error
(qemu) migration-tcp: Attempting to start an incoming migration
migration-tcp: accepted migration
2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
2126:2126 postcopy_incoming_ram_load:1057: done
migration: successfully loaded vm state
2126:2126 postcopy_incoming_fork_umemd:1069: fork
2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
Can't find block !
2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
and at the same time , the destination node didn't show the EOS

so I still can't solve the stuck problem

Thanks for your help~!



Tommy

From: Isaku Yamahata
Date: 2012-01-11 10:45
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> Hello all!

Hi, thank you for detailed report. The procedure you've tried looks
good basically. Some comments below.

> I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> patched it correctly
> but it still didn't make sense and I got the same scenario as before
> outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
>  
> I think I should show what I do more clearly and hope somebody can figure out
> the problem
> 
>  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> 
>        ./configure --target-list=x86_64-softmmu --enable-kvm --enable-postcopy
> --enable-debug
>        make
>        make install
> 
>  ・ 2, outgoing qemu:
> 
> qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> -machine accel=kvm
> incoming qemu:
> qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> 
>  ・ 3, outgoing node:
> 
> migrate -d -p -n tcp:(incoming node ip):8888
>  
> result:
> 
>  ・ outgoing qemu:
> 
> info status: VM-status: paused (finish-migrate);
> 
>  ・ incoming qemu:
> 
> can't type any more and can't kill the process(qemu-system-x86)
>  
> I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> 
>  ・ outgoing qemu:
> 
> (qemu) migration-tcp: connect completed
> migration: beginning savevm
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> migration: iterate
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> migration: done iterating
> 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> 
>  ・ incoming qemu:
> 
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 4872:4872 postcopy_incoming_ram_load:1057: done
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS
> 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> 4872:4872 postcopy_incoming_ram_load:1037: EOS

There should be only single EOS line. Just copy & past miss?


> from the result:
> It didn't get to the "successfully loaded vm state"
> So it still in the qemu_loadvm_state, and I found it's in
> cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> stuck

Can you please track it down one more step?
Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
block.(backtrace by the debugger would be best.)

If possible, can you please test with more simplified configuration.
i.e. drop device as much as possible i.e. no usbdevice, no disk...
So the debug will be simplified.

thanks,

> Does anyone give some advises on the problem?
> Thanks very much~
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2011-12-29 09:25
> To: kvm; qemu-devel
> CC: yamahata; t.hirofuchi; satoshi.itoh
> Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> Intro
> =====
> This patch series implements postcopy live migration.[1]
> As discussed at KVM forum 2011, dedicated character device is used for
> distributed shared memory between migration source and destination.
> Now we can discuss/benchmark/compare with precopy. I believe there are
> much rooms for improvement.
>  
> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
>  
>  
> Usage
> =====
> You need load umem character device on the host before starting migration.
> Postcopy can be used for tcg and kvm accelarator. The implementation depend
> on only linux umem character device. But the driver dependent code is split
> into a file.
> I tested only host page size == guest page size case, but the implementation
> allows host page size != guest page size case.
>  
> The following options are added with this patch series.
> - incoming part
>   command line options
>   -postcopy [-postcopy-flags <flags>]
>   where flags is for changing behavior for benchmark/debugging
>   Currently the following flags are available
>   0: default
>   1: enable touching page request
>  
>   example:
>   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
>  
> - outging part
>   options for migrate command 
>   migrate [-p [-n]] URI
>   -p: indicate postcopy migration
>   -n: disable background transferring pages: This is for benchmark/debugging
>  
>   example:
>   migrate -p -n tcp:<dest ip address>:4444
>  
>  
> TODO
> ====
> - benchmark/evaluation. Especially how async page fault affects the result.
> - improve/optimization
>   At the moment at least what I'm aware of is
>   - touching pages in incoming qemu process by fd handler seems suboptimal.
>     creating dedicated thread?
>   - making incoming socket non-blocking
>   - outgoing handler seems suboptimal causing latency.
> - catch up memory API change
> - consider on FUSE/CUSE possibility
> - and more...
>  
> basic postcopy work flow
> ========================
>         qemu on the destination
>               |
>               V
>         open(/dev/umem)
>               |
>               V
>         UMEM_DEV_CREATE_UMEM
>               |
>               V
>         Here we have two file descriptors to
>         umem device and shmem file
>               |
>               |                                  umemd
>               |                                  daemon on the destination
>               |
>               V    create pipe to communicate
>         fork()---------------------------------------,
>               |                                      |
>               V                                      |
>         close(socket)                                V
>         close(shmem)                              mmap(shmem file)
>               |                                      |
>               V                                      V
>         mmap(umem device) for guest RAM           close(shmem file)
>               |                                      |
>         close(umem device)                           |
>               |                                      |
>               V                                      |
>         wait for ready from daemon <----pipe-----send ready message
>               |                                      |
>               |                                 Here the daemon takes over 
>         send ok------------pipe---------------> the owner of the socket    
>               |         to the source              
>               V                                      |
>         entering post copy stage                     |
>         start guest execution                        |
>               |                                      |
>               V                                      V
>         access guest RAM                          UMEM_GET_PAGE_REQUEST
>               |                                      |
>               V                                      V
>         page fault ------------------------------>page offset is returned
>         block                                        |
>                                                      V
>                                                   pull page from the source
>                                                   write the page contents
>                                                   to the shmem.
>                                                      |
>                                                      V
>         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
>         the fault handler returns the page
>         page fault is resolved
>               |
>               |                                   pages can be sent
>               |                                   backgroundly
>               |                                      |
>               |                                      V
>               |                                   UMEM_MARK_PAGE_CACHED
>               |                                      |
>               V                                      V
>         The specified pages<-----pipe------------request to touch pages
>         are made present by                          |
>         touching guest RAM.                          |
>               |                                      |
>               V                                      V
>              reply-------------pipe-------------> release the cached page
>               |                                   madvise(MADV_REMOVE)
>               |                                      |
>               V                                      V
>  
>                  all the pages are pulled from the source
>  
>               |                                      |
>               V                                      V
>         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
>        (note: I'm not sure if this can be implemented or not)
>               |                                      |
>               V                                      V
>         migration completes                        exit()
>  
>  
>  
> Isaku Yamahata (21):
>   arch_init: export sort_ram_list() and ram_save_block()
>   arch_init: export RAM_SAVE_xxx flags for postcopy
>   arch_init/ram_save: introduce constant for ram save version = 4
>   arch_init: refactor host_from_stream_offset()
>   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
>   arch_init: refactor ram_save_block()
>   arch_init/ram_save_live: factor out ram_save_limit
>   arch_init/ram_load: refactor ram_load
>   exec.c: factor out qemu_get_ram_ptr()
>   exec.c: export last_ram_offset()
>   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
>   savevm: qemu_pending_size() to return pending buffered size
>   savevm, buffered_file: introduce method to drain buffer of buffered
>     file
>   migration: export migrate_fd_completed() and migrate_fd_cleanup()
>   migration: factor out parameters into MigrationParams
>   umem.h: import Linux umem.h
>   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
>   configure: add CONFIG_POSTCOPY option
>   postcopy: introduce -postcopy and -postcopy-flags option
>   postcopy outgoing: add -p and -n option to migrate command
>   postcopy: implement postcopy livemigration
>  
>  Makefile.target                 |    4 +
>  arch_init.c                     |  260 ++++---
>  arch_init.h                     |   20 +
>  block-migration.c               |    8 +-
>  buffered_file.c                 |   20 +-
>  buffered_file.h                 |    1 +
>  configure                       |   12 +
>  cpu-all.h                       |    9 +
>  exec-obsolete.h                 |    1 +
>  exec.c                          |   75 +-
>  hmp-commands.hx                 |   12 +-
>  hw/hw.h                         |    7 +-
>  linux-headers/linux/umem.h      |   83 ++
>  migration-exec.c                |    8 +
>  migration-fd.c                  |   30 +
>  migration-postcopy-stub.c       |   77 ++
>  migration-postcopy.c            | 1891 +++++++++++++++++++++++++++++++++++++++
>  migration-tcp.c                 |   37 +-
>  migration-unix.c                |   32 +-
>  migration.c                     |   53 +-
>  migration.h                     |   49 +-
>  qemu-common.h                   |    2 +
>  qemu-options.hx                 |   25 +
>  qmp-commands.hx                 |   10 +-
>  savevm.c                        |   31 +-
>  scripts/update-linux-headers.sh |    2 +-
>  sysemu.h                        |    4 +-
>  umem.c                          |  379 ++++++++
>  umem.h                          |  105 +++
>  vl.c                            |   20 +-
>  30 files changed, 3086 insertions(+), 181 deletions(-)
>  create mode 100644 linux-headers/linux/umem.h
>  create mode 100644 migration-postcopy-stub.c
>  create mode 100644 migration-postcopy.c
>  create mode 100644 umem.c
>  create mode 100644 umem.h
>  
>  
>  

-- 
yamahata

[-- Attachment #2: Type: text/html, Size: 46327 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??: [PATCH 00/21][RFC] postcopy live?migration
  2012-01-12  8:29       ` [Qemu-devel] " thfbjyddx
@ 2012-01-12  8:54         ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-12  8:54 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> Hi , I've dug more thess days
>  
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
> There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> I think in postcopy the ram_save_live in the iterate part can be ignore
> so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?

Not so essential.

> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>
> it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> so it gets stuck

Do you know what wchan the process was blocked at?
kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.


> when I check the EOS problem
> I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART); and qemu_put_be32
> (f, se->section_id)
>  (I think this is a wrong way to fix it and I don't know how it get through)
> and leave just the se->save_live_state in the qemu_savevm_state_iterate
> it didn't get stuck at kvm_put_msrs()
> but it has some other error
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 2126:2126 postcopy_incoming_ram_load:1057: done
> migration: successfully loaded vm state
> 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> Can't find block !
> 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> and at the same time , the destination node didn't show the EOS
>  
> so I still can't solve the stuck problem
> Thanks for your help~!
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-11 10:45
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > Hello all!
>  
> Hi, thank you for detailed report. The procedure you've tried looks
> good basically. Some comments below.
>  
> > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > patched it correctly
> > but it still didn't make sense and I got the same scenario as before
> > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> >  
> > I think I should show what I do more clearly and hope somebody can figure out
> > the problem
> > 
> >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > 
> >        ./configure --target-list=
> x86_64-softmmu --enable-kvm --enable-postcopy
> > --enable-debug
> >        make
> >        make install
> > 
> >  ・ 2, outgoing qemu:
> > 
> > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > -machine accel=kvm
> > incoming qemu:
> > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > 
> >  ・ 3, outgoing node:
> > 
> > migrate -d -p -n tcp:(incoming node ip):8888
> >  
> > result:
> > 
> >  ・ outgoing qemu:
> > 
> > info status: VM-status: paused (finish-migrate);
> > 
> >  ・ incoming qemu:
> > 
> > can't type any more and can't kill the process(qemu-system-x86)
> >  
> > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > 
> >  ・ outgoing qemu:
> > 
> > (qemu) migration-tcp: connect completed
> > migration: beginning savevm
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > migration: iterate
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > migration: done iterating
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > 
> >  ・ incoming qemu:
> > 
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
>  
> > from the result:
> > It didn't get to the "successfully loaded vm state"
> > So it still in the qemu_loadvm_state, and I found it's in
> > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > stuck
>  
> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>  
> If possible, can you please test with more simplified configuration.
> i.e. drop device as much as possible i.e. no usbdevice, no disk...
> So the debug will be simplified.
>  
> thanks,
>  
> > Does anyone give some advises on the problem?
> > Thanks very much~
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2011-12-29 09:25
> > To: kvm; qemu-devel
> > CC: yamahata; t.hirofuchi; satoshi.itoh
> > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > Intro
> > =====
> > This patch series implements postcopy live migration.[1]
> > As discussed at KVM forum 2011, dedicated character device is used for
> > distributed shared memory between migration source and destination.
> > Now we can discuss/benchmark/compare with precopy. I believe there are
> > much rooms for improvement.
> >  
> > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> >  
> >  
> > Usage
> > =====
> > You need load umem character device on the host before starting migration.
> > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > on only linux umem character device. But the driver dependent code is split
> > into a file.
> > I tested only host page size == guest page size case, but the implementation
> > allows host page size != guest page size case.
> >  
> > The following options are added with this patch series.
> > - incoming part
> >   command line options
> >   -postcopy [-postcopy-flags <flags>]
> >   where flags is for changing behavior for benchmark/debugging
> >   Currently the following flags are available
> >   0: default
> >   1: enable touching page request
> >  
> >   example:
> >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> >  
> > - outging part
> >   options for migrate command 
> >   migrate [-p [-n]] URI
> >   -p: indicate postcopy migration
> >   -n: disable background transferring pages: This is for benchmark/debugging
> >  
> >   example:
> >   migrate -p -n tcp:<dest ip address>:4444
> >  
> >  
> > TODO
> > ====
> > - benchmark/evaluation. Especially how async page fault affects the result.
> > - improve/optimization
> >   At the moment at least what I'm aware of is
> >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> >     creating dedicated thread?
> >   - making incoming socket non-blocking
> >   - outgoing handler seems suboptimal causing latency.
> > - catch up memory API change
> > - consider on FUSE/CUSE possibility
> > - and more...
> >  
> > basic postcopy work flow
> > ========================
> >         qemu on the destination
> >               |
> >               V
> >         open(/dev/umem)
> >               |
> >               V
> >         UMEM_DEV_CREATE_UMEM
> >               |
> >               V
> >         Here we have two file descriptors to
> >         umem device and shmem file
> >               |
> >               |                                  umemd
> >               |                                  daemon on the destination
> >               |
> >               V    create pipe to communicate
> >         fork()---------------------------------------,
> >               |                                      |
> >               V                                      |
> >         close(socket)                                V
> >         close(shmem)                              mmap(shmem file)
> >               |                                      |
> >               V                                      V
> >         mmap(umem device) for guest RAM           close(shmem file)
> >               |                                      |
> >         close(umem device)                           |
> >               |                                      |
> >               V                                      |
> >         wait for ready from daemon <----pipe-----send ready message
> >               |                                      |
> >               |                                 Here the daemon takes over 
> >         send ok------------pipe---------------> the owner of the socket    
> >               |         to the source              
> >               V                                      |
> >         entering post copy stage                     |
> >         start guest execution                        |
> >               |                                      |
> >               V                                      V
> >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> >               |                                      |
> >               V                                      V
> >         page fault ------------------------------>page offset is returned
> >         block                                        |
> >                                                      V
> >                                                   pull page from the source
> >                                                   write the page contents
> >                                                   to the shmem.
> >                                                      |
> >                                                      V
> >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> >         the fault handler returns the page
> >         page fault is resolved
> >               |
> >               |                                   pages can be sent
> >               |                                   backgroundly
> >               |                                      |
> >               |                                      V
> >               |                                   UMEM_MARK_PAGE_CACHED
> >               |                                      |
> >               V                                      V
> >         The specified pages<-----pipe------------request to touch pages
> >         are made present by                          |
> >         touching guest RAM.                          |
> >               |                                      |
> >               V                                      V
> >              reply-------------pipe-------------> release the cached page
> >               |                                   madvise(MADV_REMOVE)
> >               |                                      |
> >               V                                      V
> >  
> >                  all the pages are pulled from the source
> >  
> >               |                                      |
> >               V                                      V
> >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> >        (note: I'm not sure if this can be implemented or not)
> >               |                                      |
> >               V                                      V
> >         migration completes                        exit()
> >  
> >  
> >  
> > Isaku Yamahata (21):
> >   arch_init: export sort_ram_list() and ram_save_block()
> >   arch_init: export RAM_SAVE_xxx flags for postcopy
> >   arch_init/ram_save: introduce constant for ram save version = 4
> >   arch_init: refactor host_from_stream_offset()
> >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> >   arch_init: refactor ram_save_block()
> >   arch_init/ram_save_live: factor out ram_save_limit
> >   arch_init/ram_load: refactor ram_load
> >   exec.c: factor out qemu_get_ram_ptr()
> >   exec.c: export last_ram_offset()
> >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> >   savevm: qemu_pending_size() to return pending buffered size
> >   savevm, buffered_file: introduce method to drain buffer of buffered
> >     file
> >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> >   migration: factor out parameters into MigrationParams
> >   umem.h: import Linux umem.h
> >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> >   configure: add CONFIG_POSTCOPY option
> >   postcopy: introduce -postcopy and -postcopy-flags option
> >   postcopy outgoing: add -p and -n option to migrate command
> >   postcopy: implement postcopy livemigration
> >  
> >  Makefile.target                 |    4 +
> >  arch_init.c                     |  260 ++++---
> >  arch_init.h                     |   20 +
> >  block-migration.c               |    8 +-
> >  buffered_file.c                 |   20 +-
> >  buffered_file.h                 |    1 +
> >  configure                       |   12 +
> >  cpu-all.h                       |    9 +
> >  exec-obsolete.h                 |    1 +
> >  exec.c                          |   75 +-
> >  hmp-commands.hx                 |   12 +-
> >  hw/hw.h                         |    7 +-
> >  linux-headers/linux/umem.h      |   83 ++
> >  migration-exec.c                |    8 +
> >  migration-fd.c                  |   30 +
> >  migration-postcopy-stub.c       |   77 ++
> >  migration-postcopy.c            |
>  1891 +++++++++++++++++++++++++++++++++++++++
> >  migration-tcp.c                 |   37 +-
> >  migration-unix.c                |   32 +-
> >  migration.c                     |   53 +-
> >  migration.h                     |   49 +-
> >  qemu-common.h                   |    2 +
> >  qemu-options.hx                 |   25 +
> >  qmp-commands.hx                 |   10 +-
> >  savevm.c                        |   31 +-
> >  scripts/update-linux-headers.sh |    2 +-
> >  sysemu.h                        |    4 +-
> >  umem.c                          |  379 ++++++++
> >  umem.h                          |  105 +++
> >  vl.c                            |   20 +-
> >  30 files changed, 3086 insertions(+), 181 deletions(-)
> >  create mode 100644 linux-headers/linux/umem.h
> >  create mode 100644 migration-postcopy-stub.c
> >  create mode 100644 migration-postcopy.c
> >  create mode 100644 umem.c
> >  create mode 100644 umem.h
> >  
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 00/21][RFC] postcopy live?migration
@ 2012-01-12  8:54         ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-12  8:54 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> Hi , I've dug more thess days
>  
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
> There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> I think in postcopy the ram_save_live in the iterate part can be ignore
> so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?

Not so essential.

> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>
> it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> so it gets stuck

Do you know what wchan the process was blocked at?
kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.


> when I check the EOS problem
> I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART); and qemu_put_be32
> (f, se->section_id)
>  (I think this is a wrong way to fix it and I don't know how it get through)
> and leave just the se->save_live_state in the qemu_savevm_state_iterate
> it didn't get stuck at kvm_put_msrs()
> but it has some other error
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 2126:2126 postcopy_incoming_ram_load:1057: done
> migration: successfully loaded vm state
> 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> Can't find block !
> 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> and at the same time , the destination node didn't show the EOS
>  
> so I still can't solve the stuck problem
> Thanks for your help~!
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-11 10:45
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > Hello all!
>  
> Hi, thank you for detailed report. The procedure you've tried looks
> good basically. Some comments below.
>  
> > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > patched it correctly
> > but it still didn't make sense and I got the same scenario as before
> > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> >  
> > I think I should show what I do more clearly and hope somebody can figure out
> > the problem
> > 
> >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > 
> >        ./configure --target-list=
> x86_64-softmmu --enable-kvm --enable-postcopy
> > --enable-debug
> >        make
> >        make install
> > 
> >  ・ 2, outgoing qemu:
> > 
> > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > -machine accel=kvm
> > incoming qemu:
> > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > 
> >  ・ 3, outgoing node:
> > 
> > migrate -d -p -n tcp:(incoming node ip):8888
> >  
> > result:
> > 
> >  ・ outgoing qemu:
> > 
> > info status: VM-status: paused (finish-migrate);
> > 
> >  ・ incoming qemu:
> > 
> > can't type any more and can't kill the process(qemu-system-x86)
> >  
> > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > 
> >  ・ outgoing qemu:
> > 
> > (qemu) migration-tcp: connect completed
> > migration: beginning savevm
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > migration: iterate
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > migration: done iterating
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > 
> >  ・ incoming qemu:
> > 
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
>  
> > from the result:
> > It didn't get to the "successfully loaded vm state"
> > So it still in the qemu_loadvm_state, and I found it's in
> > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > stuck
>  
> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>  
> If possible, can you please test with more simplified configuration.
> i.e. drop device as much as possible i.e. no usbdevice, no disk...
> So the debug will be simplified.
>  
> thanks,
>  
> > Does anyone give some advises on the problem?
> > Thanks very much~
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2011-12-29 09:25
> > To: kvm; qemu-devel
> > CC: yamahata; t.hirofuchi; satoshi.itoh
> > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > Intro
> > =====
> > This patch series implements postcopy live migration.[1]
> > As discussed at KVM forum 2011, dedicated character device is used for
> > distributed shared memory between migration source and destination.
> > Now we can discuss/benchmark/compare with precopy. I believe there are
> > much rooms for improvement.
> >  
> > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> >  
> >  
> > Usage
> > =====
> > You need load umem character device on the host before starting migration.
> > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > on only linux umem character device. But the driver dependent code is split
> > into a file.
> > I tested only host page size == guest page size case, but the implementation
> > allows host page size != guest page size case.
> >  
> > The following options are added with this patch series.
> > - incoming part
> >   command line options
> >   -postcopy [-postcopy-flags <flags>]
> >   where flags is for changing behavior for benchmark/debugging
> >   Currently the following flags are available
> >   0: default
> >   1: enable touching page request
> >  
> >   example:
> >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> >  
> > - outging part
> >   options for migrate command 
> >   migrate [-p [-n]] URI
> >   -p: indicate postcopy migration
> >   -n: disable background transferring pages: This is for benchmark/debugging
> >  
> >   example:
> >   migrate -p -n tcp:<dest ip address>:4444
> >  
> >  
> > TODO
> > ====
> > - benchmark/evaluation. Especially how async page fault affects the result.
> > - improve/optimization
> >   At the moment at least what I'm aware of is
> >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> >     creating dedicated thread?
> >   - making incoming socket non-blocking
> >   - outgoing handler seems suboptimal causing latency.
> > - catch up memory API change
> > - consider on FUSE/CUSE possibility
> > - and more...
> >  
> > basic postcopy work flow
> > ========================
> >         qemu on the destination
> >               |
> >               V
> >         open(/dev/umem)
> >               |
> >               V
> >         UMEM_DEV_CREATE_UMEM
> >               |
> >               V
> >         Here we have two file descriptors to
> >         umem device and shmem file
> >               |
> >               |                                  umemd
> >               |                                  daemon on the destination
> >               |
> >               V    create pipe to communicate
> >         fork()---------------------------------------,
> >               |                                      |
> >               V                                      |
> >         close(socket)                                V
> >         close(shmem)                              mmap(shmem file)
> >               |                                      |
> >               V                                      V
> >         mmap(umem device) for guest RAM           close(shmem file)
> >               |                                      |
> >         close(umem device)                           |
> >               |                                      |
> >               V                                      |
> >         wait for ready from daemon <----pipe-----send ready message
> >               |                                      |
> >               |                                 Here the daemon takes over 
> >         send ok------------pipe---------------> the owner of the socket    
> >               |         to the source              
> >               V                                      |
> >         entering post copy stage                     |
> >         start guest execution                        |
> >               |                                      |
> >               V                                      V
> >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> >               |                                      |
> >               V                                      V
> >         page fault ------------------------------>page offset is returned
> >         block                                        |
> >                                                      V
> >                                                   pull page from the source
> >                                                   write the page contents
> >                                                   to the shmem.
> >                                                      |
> >                                                      V
> >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> >         the fault handler returns the page
> >         page fault is resolved
> >               |
> >               |                                   pages can be sent
> >               |                                   backgroundly
> >               |                                      |
> >               |                                      V
> >               |                                   UMEM_MARK_PAGE_CACHED
> >               |                                      |
> >               V                                      V
> >         The specified pages<-----pipe------------request to touch pages
> >         are made present by                          |
> >         touching guest RAM.                          |
> >               |                                      |
> >               V                                      V
> >              reply-------------pipe-------------> release the cached page
> >               |                                   madvise(MADV_REMOVE)
> >               |                                      |
> >               V                                      V
> >  
> >                  all the pages are pulled from the source
> >  
> >               |                                      |
> >               V                                      V
> >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> >        (note: I'm not sure if this can be implemented or not)
> >               |                                      |
> >               V                                      V
> >         migration completes                        exit()
> >  
> >  
> >  
> > Isaku Yamahata (21):
> >   arch_init: export sort_ram_list() and ram_save_block()
> >   arch_init: export RAM_SAVE_xxx flags for postcopy
> >   arch_init/ram_save: introduce constant for ram save version = 4
> >   arch_init: refactor host_from_stream_offset()
> >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> >   arch_init: refactor ram_save_block()
> >   arch_init/ram_save_live: factor out ram_save_limit
> >   arch_init/ram_load: refactor ram_load
> >   exec.c: factor out qemu_get_ram_ptr()
> >   exec.c: export last_ram_offset()
> >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> >   savevm: qemu_pending_size() to return pending buffered size
> >   savevm, buffered_file: introduce method to drain buffer of buffered
> >     file
> >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> >   migration: factor out parameters into MigrationParams
> >   umem.h: import Linux umem.h
> >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> >   configure: add CONFIG_POSTCOPY option
> >   postcopy: introduce -postcopy and -postcopy-flags option
> >   postcopy outgoing: add -p and -n option to migrate command
> >   postcopy: implement postcopy livemigration
> >  
> >  Makefile.target                 |    4 +
> >  arch_init.c                     |  260 ++++---
> >  arch_init.h                     |   20 +
> >  block-migration.c               |    8 +-
> >  buffered_file.c                 |   20 +-
> >  buffered_file.h                 |    1 +
> >  configure                       |   12 +
> >  cpu-all.h                       |    9 +
> >  exec-obsolete.h                 |    1 +
> >  exec.c                          |   75 +-
> >  hmp-commands.hx                 |   12 +-
> >  hw/hw.h                         |    7 +-
> >  linux-headers/linux/umem.h      |   83 ++
> >  migration-exec.c                |    8 +
> >  migration-fd.c                  |   30 +
> >  migration-postcopy-stub.c       |   77 ++
> >  migration-postcopy.c            |
>  1891 +++++++++++++++++++++++++++++++++++++++
> >  migration-tcp.c                 |   37 +-
> >  migration-unix.c                |   32 +-
> >  migration.c                     |   53 +-
> >  migration.h                     |   49 +-
> >  qemu-common.h                   |    2 +
> >  qemu-options.hx                 |   25 +
> >  qmp-commands.hx                 |   10 +-
> >  savevm.c                        |   31 +-
> >  scripts/update-linux-headers.sh |    2 +-
> >  sysemu.h                        |    4 +-
> >  umem.c                          |  379 ++++++++
> >  umem.h                          |  105 +++
> >  vl.c                            |   20 +-
> >  30 files changed, 3086 insertions(+), 181 deletions(-)
> >  create mode 100644 linux-headers/linux/umem.h
> >  create mode 100644 migration-postcopy-stub.c
> >  create mode 100644 migration-postcopy.c
> >  create mode 100644 umem.c
> >  create mode 100644 umem.h
> >  
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??: [PATCH 00/21][RFC] postcopy live?migration
  2012-01-12  8:54         ` [Qemu-devel] " Isaku Yamahata
@ 2012-01-12 13:26           ` thfbjyddx
  -1 siblings, 0 replies; 88+ messages in thread
From: thfbjyddx @ 2012-01-12 13:26 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 17124 bytes --]


Do you know what wchan the process was blocked at?
kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.

It's
WCHAN              COMMAND
umem_fault------qemu-system-x86





Tommy

From: Isaku Yamahata
Date: 2012-01-12 16:54
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> Hi , I've dug more thess days
>  
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
> There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> I think in postcopy the ram_save_live in the iterate part can be ignore
> so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?

Not so essential.

> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>
> it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> so it gets stuck

Do you know what wchan the process was blocked at?
kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.


> when I check the EOS problem
> I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART); and qemu_put_be32
> (f, se->section_id)
>  (I think this is a wrong way to fix it and I don't know how it get through)
> and leave just the se->save_live_state in the qemu_savevm_state_iterate
> it didn't get stuck at kvm_put_msrs()
> but it has some other error
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 2126:2126 postcopy_incoming_ram_load:1057: done
> migration: successfully loaded vm state
> 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> Can't find block !
> 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> and at the same time , the destination node didn't show the EOS
>  
> so I still can't solve the stuck problem
> Thanks for your help~!
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-11 10:45
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > Hello all!
>  
> Hi, thank you for detailed report. The procedure you've tried looks
> good basically. Some comments below.
>  
> > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > patched it correctly
> > but it still didn't make sense and I got the same scenario as before
> > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> >  
> > I think I should show what I do more clearly and hope somebody can figure out
> > the problem
> > 
> >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > 
> >        ./configure --target-list=
> x86_64-softmmu --enable-kvm --enable-postcopy
> > --enable-debug
> >        make
> >        make install
> > 
> >  ・ 2, outgoing qemu:
> > 
> > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > -machine accel=kvm
> > incoming qemu:
> > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > 
> >  ・ 3, outgoing node:
> > 
> > migrate -d -p -n tcp:(incoming node ip):8888
> >  
> > result:
> > 
> >  ・ outgoing qemu:
> > 
> > info status: VM-status: paused (finish-migrate);
> > 
> >  ・ incoming qemu:
> > 
> > can't type any more and can't kill the process(qemu-system-x86)
> >  
> > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > 
> >  ・ outgoing qemu:
> > 
> > (qemu) migration-tcp: connect completed
> > migration: beginning savevm
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > migration: iterate
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > migration: done iterating
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > 
> >  ・ incoming qemu:
> > 
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
>  
> > from the result:
> > It didn't get to the "successfully loaded vm state"
> > So it still in the qemu_loadvm_state, and I found it's in
> > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > stuck
>  
> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>  
> If possible, can you please test with more simplified configuration.
> i.e. drop device as much as possible i.e. no usbdevice, no disk...
> So the debug will be simplified.
>  
> thanks,
>  
> > Does anyone give some advises on the problem?
> > Thanks very much~
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2011-12-29 09:25
> > To: kvm; qemu-devel
> > CC: yamahata; t.hirofuchi; satoshi.itoh
> > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > Intro
> > =====
> > This patch series implements postcopy live migration.[1]
> > As discussed at KVM forum 2011, dedicated character device is used for
> > distributed shared memory between migration source and destination.
> > Now we can discuss/benchmark/compare with precopy. I believe there are
> > much rooms for improvement.
> >  
> > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> >  
> >  
> > Usage
> > =====
> > You need load umem character device on the host before starting migration.
> > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > on only linux umem character device. But the driver dependent code is split
> > into a file.
> > I tested only host page size == guest page size case, but the implementation
> > allows host page size != guest page size case.
> >  
> > The following options are added with this patch series.
> > - incoming part
> >   command line options
> >   -postcopy [-postcopy-flags <flags>]
> >   where flags is for changing behavior for benchmark/debugging
> >   Currently the following flags are available
> >   0: default
> >   1: enable touching page request
> >  
> >   example:
> >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> >  
> > - outging part
> >   options for migrate command 
> >   migrate [-p [-n]] URI
> >   -p: indicate postcopy migration
> >   -n: disable background transferring pages: This is for benchmark/debugging
> >  
> >   example:
> >   migrate -p -n tcp:<dest ip address>:4444
> >  
> >  
> > TODO
> > ====
> > - benchmark/evaluation. Especially how async page fault affects the result.
> > - improve/optimization
> >   At the moment at least what I'm aware of is
> >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> >     creating dedicated thread?
> >   - making incoming socket non-blocking
> >   - outgoing handler seems suboptimal causing latency.
> > - catch up memory API change
> > - consider on FUSE/CUSE possibility
> > - and more...
> >  
> > basic postcopy work flow
> > ========================
> >         qemu on the destination
> >               |
> >               V
> >         open(/dev/umem)
> >               |
> >               V
> >         UMEM_DEV_CREATE_UMEM
> >               |
> >               V
> >         Here we have two file descriptors to
> >         umem device and shmem file
> >               |
> >               |                                  umemd
> >               |                                  daemon on the destination
> >               |
> >               V    create pipe to communicate
> >         fork()---------------------------------------,
> >               |                                      |
> >               V                                      |
> >         close(socket)                                V
> >         close(shmem)                              mmap(shmem file)
> >               |                                      |
> >               V                                      V
> >         mmap(umem device) for guest RAM           close(shmem file)
> >               |                                      |
> >         close(umem device)                           |
> >               |                                      |
> >               V                                      |
> >         wait for ready from daemon <----pipe-----send ready message
> >               |                                      |
> >               |                                 Here the daemon takes over 
> >         send ok------------pipe---------------> the owner of the socket    
> >               |         to the source              
> >               V                                      |
> >         entering post copy stage                     |
> >         start guest execution                        |
> >               |                                      |
> >               V                                      V
> >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> >               |                                      |
> >               V                                      V
> >         page fault ------------------------------>page offset is returned
> >         block                                        |
> >                                                      V
> >                                                   pull page from the source
> >                                                   write the page contents
> >                                                   to the shmem.
> >                                                      |
> >                                                      V
> >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> >         the fault handler returns the page
> >         page fault is resolved
> >               |
> >               |                                   pages can be sent
> >               |                                   backgroundly
> >               |                                      |
> >               |                                      V
> >               |                                   UMEM_MARK_PAGE_CACHED
> >               |                                      |
> >               V                                      V
> >         The specified pages<-----pipe------------request to touch pages
> >         are made present by                          |
> >         touching guest RAM.                          |
> >               |                                      |
> >               V                                      V
> >              reply-------------pipe-------------> release the cached page
> >               |                                   madvise(MADV_REMOVE)
> >               |                                      |
> >               V                                      V
> >  
> >                  all the pages are pulled from the source
> >  
> >               |                                      |
> >               V                                      V
> >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> >        (note: I'm not sure if this can be implemented or not)
> >               |                                      |
> >               V                                      V
> >         migration completes                        exit()
> >  
> >  
> >  
> > Isaku Yamahata (21):
> >   arch_init: export sort_ram_list() and ram_save_block()
> >   arch_init: export RAM_SAVE_xxx flags for postcopy
> >   arch_init/ram_save: introduce constant for ram save version = 4
> >   arch_init: refactor host_from_stream_offset()
> >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> >   arch_init: refactor ram_save_block()
> >   arch_init/ram_save_live: factor out ram_save_limit
> >   arch_init/ram_load: refactor ram_load
> >   exec.c: factor out qemu_get_ram_ptr()
> >   exec.c: export last_ram_offset()
> >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> >   savevm: qemu_pending_size() to return pending buffered size
> >   savevm, buffered_file: introduce method to drain buffer of buffered
> >     file
> >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> >   migration: factor out parameters into MigrationParams
> >   umem.h: import Linux umem.h
> >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> >   configure: add CONFIG_POSTCOPY option
> >   postcopy: introduce -postcopy and -postcopy-flags option
> >   postcopy outgoing: add -p and -n option to migrate command
> >   postcopy: implement postcopy livemigration
> >  
> >  Makefile.target                 |    4 +
> >  arch_init.c                     |  260 ++++---
> >  arch_init.h                     |   20 +
> >  block-migration.c               |    8 +-
> >  buffered_file.c                 |   20 +-
> >  buffered_file.h                 |    1 +
> >  configure                       |   12 +
> >  cpu-all.h                       |    9 +
> >  exec-obsolete.h                 |    1 +
> >  exec.c                          |   75 +-
> >  hmp-commands.hx                 |   12 +-
> >  hw/hw.h                         |    7 +-
> >  linux-headers/linux/umem.h      |   83 ++
> >  migration-exec.c                |    8 +
> >  migration-fd.c                  |   30 +
> >  migration-postcopy-stub.c       |   77 ++
> >  migration-postcopy.c            |
>  1891 +++++++++++++++++++++++++++++++++++++++
> >  migration-tcp.c                 |   37 +-
> >  migration-unix.c                |   32 +-
> >  migration.c                     |   53 +-
> >  migration.h                     |   49 +-
> >  qemu-common.h                   |    2 +
> >  qemu-options.hx                 |   25 +
> >  qmp-commands.hx                 |   10 +-
> >  savevm.c                        |   31 +-
> >  scripts/update-linux-headers.sh |    2 +-
> >  sysemu.h                        |    4 +-
> >  umem.c                          |  379 ++++++++
> >  umem.h                          |  105 +++
> >  vl.c                            |   20 +-
> >  30 files changed, 3086 insertions(+), 181 deletions(-)
> >  create mode 100644 linux-headers/linux/umem.h
> >  create mode 100644 migration-postcopy-stub.c
> >  create mode 100644 migration-postcopy.c
> >  create mode 100644 umem.c
> >  create mode 100644 umem.h
> >  
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

[-- Attachment #2: Type: text/html, Size: 51521 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 00/21][RFC] postcopy live?migration
@ 2012-01-12 13:26           ` thfbjyddx
  0 siblings, 0 replies; 88+ messages in thread
From: thfbjyddx @ 2012-01-12 13:26 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 17124 bytes --]


Do you know what wchan the process was blocked at?
kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.

It's
WCHAN              COMMAND
umem_fault------qemu-system-x86





Tommy

From: Isaku Yamahata
Date: 2012-01-12 16:54
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> Hi , I've dug more thess days
>  
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
> There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> I think in postcopy the ram_save_live in the iterate part can be ignore
> so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?

Not so essential.

> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>
> it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> so it gets stuck

Do you know what wchan the process was blocked at?
kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.


> when I check the EOS problem
> I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART); and qemu_put_be32
> (f, se->section_id)
>  (I think this is a wrong way to fix it and I don't know how it get through)
> and leave just the se->save_live_state in the qemu_savevm_state_iterate
> it didn't get stuck at kvm_put_msrs()
> but it has some other error
> (qemu) migration-tcp: Attempting to start an incoming migration
> migration-tcp: accepted migration
> 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> 2126:2126 postcopy_incoming_ram_load:1057: done
> migration: successfully loaded vm state
> 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> Can't find block !
> 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> and at the same time , the destination node didn't show the EOS
>  
> so I still can't solve the stuck problem
> Thanks for your help~!
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-11 10:45
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > Hello all!
>  
> Hi, thank you for detailed report. The procedure you've tried looks
> good basically. Some comments below.
>  
> > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > patched it correctly
> > but it still didn't make sense and I got the same scenario as before
> > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> >  
> > I think I should show what I do more clearly and hope somebody can figure out
> > the problem
> > 
> >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > 
> >        ./configure --target-list=
> x86_64-softmmu --enable-kvm --enable-postcopy
> > --enable-debug
> >        make
> >        make install
> > 
> >  ・ 2, outgoing qemu:
> > 
> > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > -machine accel=kvm
> > incoming qemu:
> > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > 
> >  ・ 3, outgoing node:
> > 
> > migrate -d -p -n tcp:(incoming node ip):8888
> >  
> > result:
> > 
> >  ・ outgoing qemu:
> > 
> > info status: VM-status: paused (finish-migrate);
> > 
> >  ・ incoming qemu:
> > 
> > can't type any more and can't kill the process(qemu-system-x86)
> >  
> > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > 
> >  ・ outgoing qemu:
> > 
> > (qemu) migration-tcp: connect completed
> > migration: beginning savevm
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > migration: iterate
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > migration: done iterating
> > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > 
> >  ・ incoming qemu:
> > 
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 4872:4872 postcopy_incoming_ram_load:1057: done
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > 4872:4872 postcopy_incoming_ram_load:1037: EOS
>  
> There should be only single EOS line. Just copy & past miss?
>  
>  
> > from the result:
> > It didn't get to the "successfully loaded vm state"
> > So it still in the qemu_loadvm_state, and I found it's in
> > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > stuck
>  
> Can you please track it down one more step?
> Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> block.(backtrace by the debugger would be best.)
>  
> If possible, can you please test with more simplified configuration.
> i.e. drop device as much as possible i.e. no usbdevice, no disk...
> So the debug will be simplified.
>  
> thanks,
>  
> > Does anyone give some advises on the problem?
> > Thanks very much~
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2011-12-29 09:25
> > To: kvm; qemu-devel
> > CC: yamahata; t.hirofuchi; satoshi.itoh
> > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > Intro
> > =====
> > This patch series implements postcopy live migration.[1]
> > As discussed at KVM forum 2011, dedicated character device is used for
> > distributed shared memory between migration source and destination.
> > Now we can discuss/benchmark/compare with precopy. I believe there are
> > much rooms for improvement.
> >  
> > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> >  
> >  
> > Usage
> > =====
> > You need load umem character device on the host before starting migration.
> > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > on only linux umem character device. But the driver dependent code is split
> > into a file.
> > I tested only host page size == guest page size case, but the implementation
> > allows host page size != guest page size case.
> >  
> > The following options are added with this patch series.
> > - incoming part
> >   command line options
> >   -postcopy [-postcopy-flags <flags>]
> >   where flags is for changing behavior for benchmark/debugging
> >   Currently the following flags are available
> >   0: default
> >   1: enable touching page request
> >  
> >   example:
> >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> >  
> > - outging part
> >   options for migrate command 
> >   migrate [-p [-n]] URI
> >   -p: indicate postcopy migration
> >   -n: disable background transferring pages: This is for benchmark/debugging
> >  
> >   example:
> >   migrate -p -n tcp:<dest ip address>:4444
> >  
> >  
> > TODO
> > ====
> > - benchmark/evaluation. Especially how async page fault affects the result.
> > - improve/optimization
> >   At the moment at least what I'm aware of is
> >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> >     creating dedicated thread?
> >   - making incoming socket non-blocking
> >   - outgoing handler seems suboptimal causing latency.
> > - catch up memory API change
> > - consider on FUSE/CUSE possibility
> > - and more...
> >  
> > basic postcopy work flow
> > ========================
> >         qemu on the destination
> >               |
> >               V
> >         open(/dev/umem)
> >               |
> >               V
> >         UMEM_DEV_CREATE_UMEM
> >               |
> >               V
> >         Here we have two file descriptors to
> >         umem device and shmem file
> >               |
> >               |                                  umemd
> >               |                                  daemon on the destination
> >               |
> >               V    create pipe to communicate
> >         fork()---------------------------------------,
> >               |                                      |
> >               V                                      |
> >         close(socket)                                V
> >         close(shmem)                              mmap(shmem file)
> >               |                                      |
> >               V                                      V
> >         mmap(umem device) for guest RAM           close(shmem file)
> >               |                                      |
> >         close(umem device)                           |
> >               |                                      |
> >               V                                      |
> >         wait for ready from daemon <----pipe-----send ready message
> >               |                                      |
> >               |                                 Here the daemon takes over 
> >         send ok------------pipe---------------> the owner of the socket    
> >               |         to the source              
> >               V                                      |
> >         entering post copy stage                     |
> >         start guest execution                        |
> >               |                                      |
> >               V                                      V
> >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> >               |                                      |
> >               V                                      V
> >         page fault ------------------------------>page offset is returned
> >         block                                        |
> >                                                      V
> >                                                   pull page from the source
> >                                                   write the page contents
> >                                                   to the shmem.
> >                                                      |
> >                                                      V
> >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> >         the fault handler returns the page
> >         page fault is resolved
> >               |
> >               |                                   pages can be sent
> >               |                                   backgroundly
> >               |                                      |
> >               |                                      V
> >               |                                   UMEM_MARK_PAGE_CACHED
> >               |                                      |
> >               V                                      V
> >         The specified pages<-----pipe------------request to touch pages
> >         are made present by                          |
> >         touching guest RAM.                          |
> >               |                                      |
> >               V                                      V
> >              reply-------------pipe-------------> release the cached page
> >               |                                   madvise(MADV_REMOVE)
> >               |                                      |
> >               V                                      V
> >  
> >                  all the pages are pulled from the source
> >  
> >               |                                      |
> >               V                                      V
> >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> >        (note: I'm not sure if this can be implemented or not)
> >               |                                      |
> >               V                                      V
> >         migration completes                        exit()
> >  
> >  
> >  
> > Isaku Yamahata (21):
> >   arch_init: export sort_ram_list() and ram_save_block()
> >   arch_init: export RAM_SAVE_xxx flags for postcopy
> >   arch_init/ram_save: introduce constant for ram save version = 4
> >   arch_init: refactor host_from_stream_offset()
> >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> >   arch_init: refactor ram_save_block()
> >   arch_init/ram_save_live: factor out ram_save_limit
> >   arch_init/ram_load: refactor ram_load
> >   exec.c: factor out qemu_get_ram_ptr()
> >   exec.c: export last_ram_offset()
> >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> >   savevm: qemu_pending_size() to return pending buffered size
> >   savevm, buffered_file: introduce method to drain buffer of buffered
> >     file
> >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> >   migration: factor out parameters into MigrationParams
> >   umem.h: import Linux umem.h
> >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> >   configure: add CONFIG_POSTCOPY option
> >   postcopy: introduce -postcopy and -postcopy-flags option
> >   postcopy outgoing: add -p and -n option to migrate command
> >   postcopy: implement postcopy livemigration
> >  
> >  Makefile.target                 |    4 +
> >  arch_init.c                     |  260 ++++---
> >  arch_init.h                     |   20 +
> >  block-migration.c               |    8 +-
> >  buffered_file.c                 |   20 +-
> >  buffered_file.h                 |    1 +
> >  configure                       |   12 +
> >  cpu-all.h                       |    9 +
> >  exec-obsolete.h                 |    1 +
> >  exec.c                          |   75 +-
> >  hmp-commands.hx                 |   12 +-
> >  hw/hw.h                         |    7 +-
> >  linux-headers/linux/umem.h      |   83 ++
> >  migration-exec.c                |    8 +
> >  migration-fd.c                  |   30 +
> >  migration-postcopy-stub.c       |   77 ++
> >  migration-postcopy.c            |
>  1891 +++++++++++++++++++++++++++++++++++++++
> >  migration-tcp.c                 |   37 +-
> >  migration-unix.c                |   32 +-
> >  migration.c                     |   53 +-
> >  migration.h                     |   49 +-
> >  qemu-common.h                   |    2 +
> >  qemu-options.hx                 |   25 +
> >  qmp-commands.hx                 |   10 +-
> >  savevm.c                        |   31 +-
> >  scripts/update-linux-headers.sh |    2 +-
> >  sysemu.h                        |    4 +-
> >  umem.c                          |  379 ++++++++
> >  umem.h                          |  105 +++
> >  vl.c                            |   20 +-
> >  30 files changed, 3086 insertions(+), 181 deletions(-)
> >  create mode 100644 linux-headers/linux/umem.h
> >  create mode 100644 migration-postcopy-stub.c
> >  create mode 100644 migration-postcopy.c
> >  create mode 100644 umem.c
> >  create mode 100644 umem.h
> >  
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

[-- Attachment #2: Type: text/html, Size: 51521 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 21/21] postcopy: implement postcopy livemigration
  2012-01-04  3:29       ` [Qemu-devel] " Isaku Yamahata
@ 2012-01-12 14:15         ` Avi Kivity
  -1 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2012-01-12 14:15 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: kvm, qemu-devel, t.hirofuchi, satoshi.itoh

On 01/04/2012 05:29 AM, Isaku Yamahata wrote:
> On Thu, Dec 29, 2011 at 06:06:10PM +0200, Avi Kivity wrote:
> > On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > > This patch implements postcopy livemigration.
> > >
> > >  
> > > +/* RAM is allocated via umem for postcopy incoming mode */
> > > +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> > > +
> > >  typedef struct RAMBlock {
> > >      uint8_t *host;
> > >      ram_addr_t offset;
> > > @@ -485,6 +488,10 @@ typedef struct RAMBlock {
> > >  #if defined(__linux__) && !defined(TARGET_S390X)
> > >      int fd;
> > >  #endif
> > > +
> > > +#ifdef CONFIG_POSTCOPY
> > > +    UMem *umem;    /* for incoming postcopy mode */
> > > +#endif
> > >  } RAMBlock;
> > 
> > Is it possible to implement this via the MemoryListener API (which
> > replaces CPUPhysMemoryClient)?  This is how kvm, vhost, and xen manage
> > their memory tables.
>
> I'm afraid no. Those three you listed above are for outgoing part,
> but this case is for incoming part. The requirement is quite different
> from those three. What is needed is
> - get the corresponding RAMBlock and UMem from (id, idlen)
> - hook ram_alloc/ram_free (or RAM api corresponding)
>

Okay.  We'll need more hooks then.  Xen already hooks qemu_ram_alloc(),
so there's more than one user.

But don't spend time on this; this area is in flux due to the memory
API, so any effort will be wasted.  I'll look at adding those hooks,
either before or after postcopy is merged.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] [PATCH 21/21] postcopy: implement postcopy livemigration
@ 2012-01-12 14:15         ` Avi Kivity
  0 siblings, 0 replies; 88+ messages in thread
From: Avi Kivity @ 2012-01-12 14:15 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 01/04/2012 05:29 AM, Isaku Yamahata wrote:
> On Thu, Dec 29, 2011 at 06:06:10PM +0200, Avi Kivity wrote:
> > On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > > This patch implements postcopy livemigration.
> > >
> > >  
> > > +/* RAM is allocated via umem for postcopy incoming mode */
> > > +#define RAM_POSTCOPY_UMEM_MASK  (1 << 1)
> > > +
> > >  typedef struct RAMBlock {
> > >      uint8_t *host;
> > >      ram_addr_t offset;
> > > @@ -485,6 +488,10 @@ typedef struct RAMBlock {
> > >  #if defined(__linux__) && !defined(TARGET_S390X)
> > >      int fd;
> > >  #endif
> > > +
> > > +#ifdef CONFIG_POSTCOPY
> > > +    UMem *umem;    /* for incoming postcopy mode */
> > > +#endif
> > >  } RAMBlock;
> > 
> > Is it possible to implement this via the MemoryListener API (which
> > replaces CPUPhysMemoryClient)?  This is how kvm, vhost, and xen manage
> > their memory tables.
>
> I'm afraid no. Those three you listed above are for outgoing part,
> but this case is for incoming part. The requirement is quite different
> from those three. What is needed is
> - get the corresponding RAMBlock and UMem from (id, idlen)
> - hook ram_alloc/ram_free (or RAM api corresponding)
>

Okay.  We'll need more hooks then.  Xen already hooks qemu_ram_alloc(),
so there's more than one user.

But don't spend time on this; this area is in flux due to the memory
API, so any effort will be wasted.  I'll look at adding those hooks,
either before or after postcopy is merged.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??: [PATCH 00/21][RFC] postcopy live?migration
  2012-01-12 13:26           ` [Qemu-devel] " thfbjyddx
@ 2012-01-16  6:51             ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-16  6:51 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

Thank you for your info.
I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
Your kernel enables KVM paravirt_ops, right?

Although I'm preparing the next path series including the fixes,
you can also try postcopy by disabling paravirt_ops or disabling kvm
(use tcg i.e. -machine accel:tcg).

thanks,


On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
>  
> Do you know what wchan the process was blocked at?
> kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
>  
> It's
> WCHAN              COMMAND
> umem_fault------qemu-system-x86
>  
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-12 16:54
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > Hi , I've dug more thess days
> >  
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> >  
> > There should be only single EOS line. Just copy & past miss?
> >  
> > There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > I think in postcopy the ram_save_live in the iterate part can be ignore
> > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
>  
> Not so essential.
>  
> > Can you please track it down one more step?
> > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > block.(backtrace by the debugger would be best.)
> >
> > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> > so it gets stuck
>  
> Do you know what wchan the process was blocked at?
> kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
>  
>  
> > when I check the EOS problem
> > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
>  and qemu_put_be32
> > (f, se->section_id)
> >  (I think this is a wrong way to fix it and I don't know how it get through)
> > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > it didn't get stuck at kvm_put_msrs()
> > but it has some other error
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 2126:2126 postcopy_incoming_ram_load:1057: done
> > migration: successfully loaded vm state
> > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > Can't find block !
> > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > and at the same time , the destination node didn't show the EOS
> >  
> > so I still can't solve the stuck problem
> > Thanks for your help~!
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-11 10:45
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > Hello all!
> >  
> > Hi, thank you for detailed report. The procedure you've tried looks
> > good basically. Some comments below.
> >  
> > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > > patched it correctly
> > > but it still didn't make sense and I got the same scenario as before
> > > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> > >  
> > >
>  I think I should show what I do more clearly and hope somebody can figure out
> > > the problem
> > > 
> > >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > > 
> > >        ./configure --target-list=
> > x86_64-softmmu --enable-kvm --enable-postcopy
> > > --enable-debug
> > >        make
> > >        make install
> > > 
> > >  ・ 2, outgoing qemu:
> > > 
> > > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > -machine accel=kvm
> > > incoming qemu:
> > > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > 
> > >  ・ 3, outgoing node:
> > > 
> > > migrate -d -p -n tcp:(incoming node ip):8888
> > >  
> > > result:
> > > 
> > >  ・ outgoing qemu:
> > > 
> > > info status: VM-status: paused (finish-migrate);
> > > 
> > >  ・ incoming qemu:
> > > 
> > > can't type any more and can't kill the process(qemu-system-x86)
> > >  
> > > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > 
> > >  ・ outgoing qemu:
> > > 
> > > (qemu) migration-tcp: connect completed
> > > migration: beginning savevm
> > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > migration: iterate
> > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > migration: done iterating
> > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > 
> > >  ・ incoming qemu:
> > > 
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> >  
> > There should be only single EOS line. Just copy & past miss?
> >  
> >  
> > > from the result:
> > > It didn't get to the "successfully loaded vm state"
> > > So it still in the qemu_loadvm_state, and I found it's in
> > > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > > stuck
> >  
> > Can you please track it down one more step?
> > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > block.(backtrace by the debugger would be best.)
> >  
> > If possible, can you please test with more simplified configuration.
> > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > So the debug will be simplified.
> >  
> > thanks,
> >  
> > > Does anyone give some advises on the problem?
> > > Thanks very much~
> > >  
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > ━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2011-12-29 09:25
> > > To: kvm; qemu-devel
> > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > Intro
> > > =====
> > > This patch series implements postcopy live migration.[1]
> > > As discussed at KVM forum 2011, dedicated character device is used for
> > > distributed shared memory between migration source and destination.
> > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > much rooms for improvement.
> > >  
> > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > >  
> > >  
> > > Usage
> > > =====
> > > You need load umem character device on the host before starting migration.
> > > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > on only linux umem character device. But the driver dependent code is split
> > > into a file.
> > > I tested only host page size ==
>  guest page size case, but the implementation
> > > allows host page size != guest page size case.
> > >  
> > > The following options are added with this patch series.
> > > - incoming part
> > >   command line options
> > >   -postcopy [-postcopy-flags <flags>]
> > >   where flags is for changing behavior for benchmark/debugging
> > >   Currently the following flags are available
> > >   0: default
> > >   1: enable touching page request
> > >  
> > >   example:
> > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > >  
> > > - outging part
> > >   options for migrate command 
> > >   migrate [-p [-n]] URI
> > >   -p: indicate postcopy migration
> > >   -n: disable background transferring pages: This is for benchmark/
> debugging
> > >  
> > >   example:
> > >   migrate -p -n tcp:<dest ip address>:4444
> > >  
> > >  
> > > TODO
> > > ====
> > > - benchmark/evaluation. Especially how async page fault affects the result.
> > > - improve/optimization
> > >   At the moment at least what I'm aware of is
> > >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> > >     creating dedicated thread?
> > >   - making incoming socket non-blocking
> > >   - outgoing handler seems suboptimal causing latency.
> > > - catch up memory API change
> > > - consider on FUSE/CUSE possibility
> > > - and more...
> > >  
> > > basic postcopy work flow
> > > ========================
> > >         qemu on the destination
> > >               |
> > >               V
> > >         open(/dev/umem)
> > >               |
> > >               V
> > >         UMEM_DEV_CREATE_UMEM
> > >               |
> > >               V
> > >         Here we have two file descriptors to
> > >         umem device and shmem file
> > >               |
> > >               |                                  umemd
> > >               |                                  daemon on the destination
> > >               |
> > >               V    create pipe to communicate
> > >         fork()---------------------------------------,
> > >               |                                      |
> > >               V                                      |
> > >         close(socket)                                V
> > >         close(shmem)                              mmap(shmem file)
> > >               |                                      |
> > >               V                                      V
> > >         mmap(umem device) for guest RAM           close(shmem file)
> > >               |                                      |
> > >         close(umem device)                           |
> > >               |                                      |
> > >               V                                      |
> > >         wait for ready from daemon <----pipe-----send ready message
> > >               |                                      |
> > >               |                                 Here the daemon takes over 
> > >         send ok------------pipe---------------> the owner of the socket    
> > >               |         to the source              
> > >               V                                      |
> > >         entering post copy stage                     |
> > >         start guest execution                        |
> > >               |                                      |
> > >               V                                      V
> > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > >               |                                      |
> > >               V                                      V
> > >         page fault ------------------------------>page offset is returned
> > >         block                                        |
> > >                                                      V
> > >                                                   pull page from the source
> > >                                                   write the page contents
> > >                                                   to the shmem.
> > >                                                      |
> > >                                                      V
> > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > >         the fault handler returns the page
> > >         page fault is resolved
> > >               |
> > >               |                                   pages can be sent
> > >               |                                   backgroundly
> > >               |                                      |
> > >               |                                      V
> > >               |                                   UMEM_MARK_PAGE_CACHED
> > >               |                                      |
> > >               V                                      V
> > >         The specified pages<-----pipe------------request to touch pages
> > >         are made present by                          |
> > >         touching guest RAM.                          |
> > >               |                                      |
> > >               V                                      V
> > >              reply-------------pipe-------------> release the cached page
> > >               |                                   madvise(MADV_REMOVE)
> > >               |                                      |
> > >               V                                      V
> > >  
> > >                  all the pages are pulled from the source
> > >  
> > >               |                                      |
> > >               V                                      V
> > >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> > >        (note: I'm not sure if this can be implemented or not)
> > >               |                                      |
> > >               V                                      V
> > >         migration completes                        exit()
> > >  
> > >  
> > >  
> > > Isaku Yamahata (21):
> > >   arch_init: export sort_ram_list() and ram_save_block()
> > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > >   arch_init/ram_save: introduce constant for ram save version = 4
> > >   arch_init: refactor host_from_stream_offset()
> > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > >   arch_init: refactor ram_save_block()
> > >   arch_init/ram_save_live: factor out ram_save_limit
> > >   arch_init/ram_load: refactor ram_load
> > >   exec.c: factor out qemu_get_ram_ptr()
> > >   exec.c: export last_ram_offset()
> > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > >   savevm: qemu_pending_size() to return pending buffered size
> > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > >     file
> > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > >   migration: factor out parameters into MigrationParams
> > >   umem.h: import Linux umem.h
> > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > >   configure: add CONFIG_POSTCOPY option
> > >   postcopy: introduce -postcopy and -postcopy-flags option
> > >   postcopy outgoing: add -p and -n option to migrate command
> > >   postcopy: implement postcopy livemigration
> > >  
> > >  Makefile.target                 |    4 +
> > >  arch_init.c                     |  260 ++++---
> > >  arch_init.h                     |   20 +
> > >  block-migration.c               |    8 +-
> > >  buffered_file.c                 |   20 +-
> > >  buffered_file.h                 |    1 +
> > >  configure                       |   12 +
> > >  cpu-all.h                       |    9 +
> > >  exec-obsolete.h                 |    1 +
> > >  exec.c                          |   75 +-
> > >  hmp-commands.hx                 |   12 +-
> > >  hw/hw.h                         |    7 +-
> > >  linux-headers/linux/umem.h      |   83 ++
> > >  migration-exec.c                |    8 +
> > >  migration-fd.c                  |   30 +
> > >  migration-postcopy-stub.c       |   77 ++
> > >  migration-postcopy.c            |
> >  1891 +++++++++++++++++++++++++++++++++++++++
> > >  migration-tcp.c                 |   37 +-
> > >  migration-unix.c                |   32 +-
> > >  migration.c                     |   53 +-
> > >  migration.h                     |   49 +-
> > >  qemu-common.h                   |    2 +
> > >  qemu-options.hx                 |   25 +
> > >  qmp-commands.hx                 |   10 +-
> > >  savevm.c                        |   31 +-
> > >  scripts/update-linux-headers.sh |    2 +-
> > >  sysemu.h                        |    4 +-
> > >  umem.c                          |  379 ++++++++
> > >  umem.h                          |  105 +++
> > >  vl.c                            |   20 +-
> > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > >  create mode 100644 linux-headers/linux/umem.h
> > >  create mode 100644 migration-postcopy-stub.c
> > >  create mode 100644 migration-postcopy.c
> > >  create mode 100644 umem.c
> > >  create mode 100644 umem.h
> > >  
> > >  
> > >  
> >  
> > -- 
> > yamahata
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 00/21][RFC] postcopy live?migration
@ 2012-01-16  6:51             ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-16  6:51 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

Thank you for your info.
I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
Your kernel enables KVM paravirt_ops, right?

Although I'm preparing the next path series including the fixes,
you can also try postcopy by disabling paravirt_ops or disabling kvm
(use tcg i.e. -machine accel:tcg).

thanks,


On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
>  
> Do you know what wchan the process was blocked at?
> kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
>  
> It's
> WCHAN              COMMAND
> umem_fault------qemu-system-x86
>  
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-12 16:54
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > Hi , I've dug more thess days
> >  
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> >  
> > There should be only single EOS line. Just copy & past miss?
> >  
> > There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > I think in postcopy the ram_save_live in the iterate part can be ignore
> > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
>  
> Not so essential.
>  
> > Can you please track it down one more step?
> > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > block.(backtrace by the debugger would be best.)
> >
> > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> > so it gets stuck
>  
> Do you know what wchan the process was blocked at?
> kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
>  
>  
> > when I check the EOS problem
> > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
>  and qemu_put_be32
> > (f, se->section_id)
> >  (I think this is a wrong way to fix it and I don't know how it get through)
> > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > it didn't get stuck at kvm_put_msrs()
> > but it has some other error
> > (qemu) migration-tcp: Attempting to start an incoming migration
> > migration-tcp: accepted migration
> > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > 2126:2126 postcopy_incoming_ram_load:1057: done
> > migration: successfully loaded vm state
> > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > Can't find block !
> > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > and at the same time , the destination node didn't show the EOS
> >  
> > so I still can't solve the stuck problem
> > Thanks for your help~!
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-11 10:45
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > Hello all!
> >  
> > Hi, thank you for detailed report. The procedure you've tried looks
> > good basically. Some comments below.
> >  
> > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > > patched it correctly
> > > but it still didn't make sense and I got the same scenario as before
> > > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> > >  
> > >
>  I think I should show what I do more clearly and hope somebody can figure out
> > > the problem
> > > 
> > >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > > 
> > >        ./configure --target-list=
> > x86_64-softmmu --enable-kvm --enable-postcopy
> > > --enable-debug
> > >        make
> > >        make install
> > > 
> > >  ・ 2, outgoing qemu:
> > > 
> > > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > -machine accel=kvm
> > > incoming qemu:
> > > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > 
> > >  ・ 3, outgoing node:
> > > 
> > > migrate -d -p -n tcp:(incoming node ip):8888
> > >  
> > > result:
> > > 
> > >  ・ outgoing qemu:
> > > 
> > > info status: VM-status: paused (finish-migrate);
> > > 
> > >  ・ incoming qemu:
> > > 
> > > can't type any more and can't kill the process(qemu-system-x86)
> > >  
> > > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > 
> > >  ・ outgoing qemu:
> > > 
> > > (qemu) migration-tcp: connect completed
> > > migration: beginning savevm
> > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > migration: iterate
> > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > migration: done iterating
> > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > 
> > >  ・ incoming qemu:
> > > 
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> >  
> > There should be only single EOS line. Just copy & past miss?
> >  
> >  
> > > from the result:
> > > It didn't get to the "successfully loaded vm state"
> > > So it still in the qemu_loadvm_state, and I found it's in
> > > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > > stuck
> >  
> > Can you please track it down one more step?
> > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > block.(backtrace by the debugger would be best.)
> >  
> > If possible, can you please test with more simplified configuration.
> > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > So the debug will be simplified.
> >  
> > thanks,
> >  
> > > Does anyone give some advises on the problem?
> > > Thanks very much~
> > >  
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > ━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2011-12-29 09:25
> > > To: kvm; qemu-devel
> > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > Intro
> > > =====
> > > This patch series implements postcopy live migration.[1]
> > > As discussed at KVM forum 2011, dedicated character device is used for
> > > distributed shared memory between migration source and destination.
> > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > much rooms for improvement.
> > >  
> > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > >  
> > >  
> > > Usage
> > > =====
> > > You need load umem character device on the host before starting migration.
> > > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > on only linux umem character device. But the driver dependent code is split
> > > into a file.
> > > I tested only host page size ==
>  guest page size case, but the implementation
> > > allows host page size != guest page size case.
> > >  
> > > The following options are added with this patch series.
> > > - incoming part
> > >   command line options
> > >   -postcopy [-postcopy-flags <flags>]
> > >   where flags is for changing behavior for benchmark/debugging
> > >   Currently the following flags are available
> > >   0: default
> > >   1: enable touching page request
> > >  
> > >   example:
> > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > >  
> > > - outging part
> > >   options for migrate command 
> > >   migrate [-p [-n]] URI
> > >   -p: indicate postcopy migration
> > >   -n: disable background transferring pages: This is for benchmark/
> debugging
> > >  
> > >   example:
> > >   migrate -p -n tcp:<dest ip address>:4444
> > >  
> > >  
> > > TODO
> > > ====
> > > - benchmark/evaluation. Especially how async page fault affects the result.
> > > - improve/optimization
> > >   At the moment at least what I'm aware of is
> > >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> > >     creating dedicated thread?
> > >   - making incoming socket non-blocking
> > >   - outgoing handler seems suboptimal causing latency.
> > > - catch up memory API change
> > > - consider on FUSE/CUSE possibility
> > > - and more...
> > >  
> > > basic postcopy work flow
> > > ========================
> > >         qemu on the destination
> > >               |
> > >               V
> > >         open(/dev/umem)
> > >               |
> > >               V
> > >         UMEM_DEV_CREATE_UMEM
> > >               |
> > >               V
> > >         Here we have two file descriptors to
> > >         umem device and shmem file
> > >               |
> > >               |                                  umemd
> > >               |                                  daemon on the destination
> > >               |
> > >               V    create pipe to communicate
> > >         fork()---------------------------------------,
> > >               |                                      |
> > >               V                                      |
> > >         close(socket)                                V
> > >         close(shmem)                              mmap(shmem file)
> > >               |                                      |
> > >               V                                      V
> > >         mmap(umem device) for guest RAM           close(shmem file)
> > >               |                                      |
> > >         close(umem device)                           |
> > >               |                                      |
> > >               V                                      |
> > >         wait for ready from daemon <----pipe-----send ready message
> > >               |                                      |
> > >               |                                 Here the daemon takes over 
> > >         send ok------------pipe---------------> the owner of the socket    
> > >               |         to the source              
> > >               V                                      |
> > >         entering post copy stage                     |
> > >         start guest execution                        |
> > >               |                                      |
> > >               V                                      V
> > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > >               |                                      |
> > >               V                                      V
> > >         page fault ------------------------------>page offset is returned
> > >         block                                        |
> > >                                                      V
> > >                                                   pull page from the source
> > >                                                   write the page contents
> > >                                                   to the shmem.
> > >                                                      |
> > >                                                      V
> > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > >         the fault handler returns the page
> > >         page fault is resolved
> > >               |
> > >               |                                   pages can be sent
> > >               |                                   backgroundly
> > >               |                                      |
> > >               |                                      V
> > >               |                                   UMEM_MARK_PAGE_CACHED
> > >               |                                      |
> > >               V                                      V
> > >         The specified pages<-----pipe------------request to touch pages
> > >         are made present by                          |
> > >         touching guest RAM.                          |
> > >               |                                      |
> > >               V                                      V
> > >              reply-------------pipe-------------> release the cached page
> > >               |                                   madvise(MADV_REMOVE)
> > >               |                                      |
> > >               V                                      V
> > >  
> > >                  all the pages are pulled from the source
> > >  
> > >               |                                      |
> > >               V                                      V
> > >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> > >        (note: I'm not sure if this can be implemented or not)
> > >               |                                      |
> > >               V                                      V
> > >         migration completes                        exit()
> > >  
> > >  
> > >  
> > > Isaku Yamahata (21):
> > >   arch_init: export sort_ram_list() and ram_save_block()
> > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > >   arch_init/ram_save: introduce constant for ram save version = 4
> > >   arch_init: refactor host_from_stream_offset()
> > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > >   arch_init: refactor ram_save_block()
> > >   arch_init/ram_save_live: factor out ram_save_limit
> > >   arch_init/ram_load: refactor ram_load
> > >   exec.c: factor out qemu_get_ram_ptr()
> > >   exec.c: export last_ram_offset()
> > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > >   savevm: qemu_pending_size() to return pending buffered size
> > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > >     file
> > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > >   migration: factor out parameters into MigrationParams
> > >   umem.h: import Linux umem.h
> > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > >   configure: add CONFIG_POSTCOPY option
> > >   postcopy: introduce -postcopy and -postcopy-flags option
> > >   postcopy outgoing: add -p and -n option to migrate command
> > >   postcopy: implement postcopy livemigration
> > >  
> > >  Makefile.target                 |    4 +
> > >  arch_init.c                     |  260 ++++---
> > >  arch_init.h                     |   20 +
> > >  block-migration.c               |    8 +-
> > >  buffered_file.c                 |   20 +-
> > >  buffered_file.h                 |    1 +
> > >  configure                       |   12 +
> > >  cpu-all.h                       |    9 +
> > >  exec-obsolete.h                 |    1 +
> > >  exec.c                          |   75 +-
> > >  hmp-commands.hx                 |   12 +-
> > >  hw/hw.h                         |    7 +-
> > >  linux-headers/linux/umem.h      |   83 ++
> > >  migration-exec.c                |    8 +
> > >  migration-fd.c                  |   30 +
> > >  migration-postcopy-stub.c       |   77 ++
> > >  migration-postcopy.c            |
> >  1891 +++++++++++++++++++++++++++++++++++++++
> > >  migration-tcp.c                 |   37 +-
> > >  migration-unix.c                |   32 +-
> > >  migration.c                     |   53 +-
> > >  migration.h                     |   49 +-
> > >  qemu-common.h                   |    2 +
> > >  qemu-options.hx                 |   25 +
> > >  qmp-commands.hx                 |   10 +-
> > >  savevm.c                        |   31 +-
> > >  scripts/update-linux-headers.sh |    2 +-
> > >  sysemu.h                        |    4 +-
> > >  umem.c                          |  379 ++++++++
> > >  umem.h                          |  105 +++
> > >  vl.c                            |   20 +-
> > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > >  create mode 100644 linux-headers/linux/umem.h
> > >  create mode 100644 migration-postcopy-stub.c
> > >  create mode 100644 migration-postcopy.c
> > >  create mode 100644 umem.c
> > >  create mode 100644 umem.h
> > >  
> > >  
> > >  
> >  
> > -- 
> > yamahata
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??: [PATCH 00/21][RFC] postcopy live?migration
  2012-01-16  6:51             ` [Qemu-devel] " Isaku Yamahata
@ 2012-01-16 10:17               ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-16 10:17 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Mon, Jan 16, 2012 at 03:51:16PM +0900, Isaku Yamahata wrote:
> Thank you for your info.
> I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
> Your kernel enables KVM paravirt_ops, right?
> 
> Although I'm preparing the next path series including the fixes,
> you can also try postcopy by disabling paravirt_ops or disabling kvm
> (use tcg i.e. -machine accel:tcg).

Disabling KVM pv clock would be ok.
Passing no-kvmclock to guest kernel disables it.

> thanks,
> 
> 
> On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> > It's
> > WCHAN              COMMAND
> > umem_fault------qemu-system-x86
> >  
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-12 16:54
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> > On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > > Hi , I've dug more thess days
> > >  
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > > There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > > I think in postcopy the ram_save_live in the iterate part can be ignore
> > > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
> >  
> > Not so essential.
> >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >
> > > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> > > so it gets stuck
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> >  
> > > when I check the EOS problem
> > > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
> >  and qemu_put_be32
> > > (f, se->section_id)
> > >  (I think this is a wrong way to fix it and I don't know how it get through)
> > > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > > it didn't get stuck at kvm_put_msrs()
> > > but it has some other error
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 2126:2126 postcopy_incoming_ram_load:1057: done
> > > migration: successfully loaded vm state
> > > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > > 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > > Can't find block !
> > > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > > and at the same time , the destination node didn't show the EOS
> > >  
> > > so I still can't solve the stuck problem
> > > Thanks for your help~!
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2012-01-11 10:45
> > > To: thfbjyddx
> > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > > Hello all!
> > >  
> > > Hi, thank you for detailed report. The procedure you've tried looks
> > > good basically. Some comments below.
> > >  
> > > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > > > patched it correctly
> > > > but it still didn't make sense and I got the same scenario as before
> > > > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> > > >  
> > > >
> >  I think I should show what I do more clearly and hope somebody can figure out
> > > > the problem
> > > > 
> > > >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > > > 
> > > >        ./configure --target-list=
> > > x86_64-softmmu --enable-kvm --enable-postcopy
> > > > --enable-debug
> > > >        make
> > > >        make install
> > > > 
> > > >  ・ 2, outgoing qemu:
> > > > 
> > > > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > > -machine accel=kvm
> > > > incoming qemu:
> > > > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > > 
> > > >  ・ 3, outgoing node:
> > > > 
> > > > migrate -d -p -n tcp:(incoming node ip):8888
> > > >  
> > > > result:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > info status: VM-status: paused (finish-migrate);
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > can't type any more and can't kill the process(qemu-system-x86)
> > > >  
> > > > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > (qemu) migration-tcp: connect completed
> > > > migration: beginning savevm
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > > migration: iterate
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > > migration: done iterating
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > >  
> > > > from the result:
> > > > It didn't get to the "successfully loaded vm state"
> > > > So it still in the qemu_loadvm_state, and I found it's in
> > > > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > > > stuck
> > >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >  
> > > If possible, can you please test with more simplified configuration.
> > > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > > So the debug will be simplified.
> > >  
> > > thanks,
> > >  
> > > > Does anyone give some advises on the problem?
> > > > Thanks very much~
> > > >  
> > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > ━
> > > > Tommy
> > > >  
> > > > From: Isaku Yamahata
> > > > Date: 2011-12-29 09:25
> > > > To: kvm; qemu-devel
> > > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > > Intro
> > > > =====
> > > > This patch series implements postcopy live migration.[1]
> > > > As discussed at KVM forum 2011, dedicated character device is used for
> > > > distributed shared memory between migration source and destination.
> > > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > > much rooms for improvement.
> > > >  
> > > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > > >  
> > > >  
> > > > Usage
> > > > =====
> > > > You need load umem character device on the host before starting migration.
> > > > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > > on only linux umem character device. But the driver dependent code is split
> > > > into a file.
> > > > I tested only host page size ==
> >  guest page size case, but the implementation
> > > > allows host page size != guest page size case.
> > > >  
> > > > The following options are added with this patch series.
> > > > - incoming part
> > > >   command line options
> > > >   -postcopy [-postcopy-flags <flags>]
> > > >   where flags is for changing behavior for benchmark/debugging
> > > >   Currently the following flags are available
> > > >   0: default
> > > >   1: enable touching page request
> > > >  
> > > >   example:
> > > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > > >  
> > > > - outging part
> > > >   options for migrate command 
> > > >   migrate [-p [-n]] URI
> > > >   -p: indicate postcopy migration
> > > >   -n: disable background transferring pages: This is for benchmark/
> > debugging
> > > >  
> > > >   example:
> > > >   migrate -p -n tcp:<dest ip address>:4444
> > > >  
> > > >  
> > > > TODO
> > > > ====
> > > > - benchmark/evaluation. Especially how async page fault affects the result.
> > > > - improve/optimization
> > > >   At the moment at least what I'm aware of is
> > > >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> > > >     creating dedicated thread?
> > > >   - making incoming socket non-blocking
> > > >   - outgoing handler seems suboptimal causing latency.
> > > > - catch up memory API change
> > > > - consider on FUSE/CUSE possibility
> > > > - and more...
> > > >  
> > > > basic postcopy work flow
> > > > ========================
> > > >         qemu on the destination
> > > >               |
> > > >               V
> > > >         open(/dev/umem)
> > > >               |
> > > >               V
> > > >         UMEM_DEV_CREATE_UMEM
> > > >               |
> > > >               V
> > > >         Here we have two file descriptors to
> > > >         umem device and shmem file
> > > >               |
> > > >               |                                  umemd
> > > >               |                                  daemon on the destination
> > > >               |
> > > >               V    create pipe to communicate
> > > >         fork()---------------------------------------,
> > > >               |                                      |
> > > >               V                                      |
> > > >         close(socket)                                V
> > > >         close(shmem)                              mmap(shmem file)
> > > >               |                                      |
> > > >               V                                      V
> > > >         mmap(umem device) for guest RAM           close(shmem file)
> > > >               |                                      |
> > > >         close(umem device)                           |
> > > >               |                                      |
> > > >               V                                      |
> > > >         wait for ready from daemon <----pipe-----send ready message
> > > >               |                                      |
> > > >               |                                 Here the daemon takes over 
> > > >         send ok------------pipe---------------> the owner of the socket    
> > > >               |         to the source              
> > > >               V                                      |
> > > >         entering post copy stage                     |
> > > >         start guest execution                        |
> > > >               |                                      |
> > > >               V                                      V
> > > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > > >               |                                      |
> > > >               V                                      V
> > > >         page fault ------------------------------>page offset is returned
> > > >         block                                        |
> > > >                                                      V
> > > >                                                   pull page from the source
> > > >                                                   write the page contents
> > > >                                                   to the shmem.
> > > >                                                      |
> > > >                                                      V
> > > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > > >         the fault handler returns the page
> > > >         page fault is resolved
> > > >               |
> > > >               |                                   pages can be sent
> > > >               |                                   backgroundly
> > > >               |                                      |
> > > >               |                                      V
> > > >               |                                   UMEM_MARK_PAGE_CACHED
> > > >               |                                      |
> > > >               V                                      V
> > > >         The specified pages<-----pipe------------request to touch pages
> > > >         are made present by                          |
> > > >         touching guest RAM.                          |
> > > >               |                                      |
> > > >               V                                      V
> > > >              reply-------------pipe-------------> release the cached page
> > > >               |                                   madvise(MADV_REMOVE)
> > > >               |                                      |
> > > >               V                                      V
> > > >  
> > > >                  all the pages are pulled from the source
> > > >  
> > > >               |                                      |
> > > >               V                                      V
> > > >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> > > >        (note: I'm not sure if this can be implemented or not)
> > > >               |                                      |
> > > >               V                                      V
> > > >         migration completes                        exit()
> > > >  
> > > >  
> > > >  
> > > > Isaku Yamahata (21):
> > > >   arch_init: export sort_ram_list() and ram_save_block()
> > > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > > >   arch_init/ram_save: introduce constant for ram save version = 4
> > > >   arch_init: refactor host_from_stream_offset()
> > > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > > >   arch_init: refactor ram_save_block()
> > > >   arch_init/ram_save_live: factor out ram_save_limit
> > > >   arch_init/ram_load: refactor ram_load
> > > >   exec.c: factor out qemu_get_ram_ptr()
> > > >   exec.c: export last_ram_offset()
> > > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > > >   savevm: qemu_pending_size() to return pending buffered size
> > > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > > >     file
> > > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > > >   migration: factor out parameters into MigrationParams
> > > >   umem.h: import Linux umem.h
> > > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > > >   configure: add CONFIG_POSTCOPY option
> > > >   postcopy: introduce -postcopy and -postcopy-flags option
> > > >   postcopy outgoing: add -p and -n option to migrate command
> > > >   postcopy: implement postcopy livemigration
> > > >  
> > > >  Makefile.target                 |    4 +
> > > >  arch_init.c                     |  260 ++++---
> > > >  arch_init.h                     |   20 +
> > > >  block-migration.c               |    8 +-
> > > >  buffered_file.c                 |   20 +-
> > > >  buffered_file.h                 |    1 +
> > > >  configure                       |   12 +
> > > >  cpu-all.h                       |    9 +
> > > >  exec-obsolete.h                 |    1 +
> > > >  exec.c                          |   75 +-
> > > >  hmp-commands.hx                 |   12 +-
> > > >  hw/hw.h                         |    7 +-
> > > >  linux-headers/linux/umem.h      |   83 ++
> > > >  migration-exec.c                |    8 +
> > > >  migration-fd.c                  |   30 +
> > > >  migration-postcopy-stub.c       |   77 ++
> > > >  migration-postcopy.c            |
> > >  1891 +++++++++++++++++++++++++++++++++++++++
> > > >  migration-tcp.c                 |   37 +-
> > > >  migration-unix.c                |   32 +-
> > > >  migration.c                     |   53 +-
> > > >  migration.h                     |   49 +-
> > > >  qemu-common.h                   |    2 +
> > > >  qemu-options.hx                 |   25 +
> > > >  qmp-commands.hx                 |   10 +-
> > > >  savevm.c                        |   31 +-
> > > >  scripts/update-linux-headers.sh |    2 +-
> > > >  sysemu.h                        |    4 +-
> > > >  umem.c                          |  379 ++++++++
> > > >  umem.h                          |  105 +++
> > > >  vl.c                            |   20 +-
> > > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > > >  create mode 100644 linux-headers/linux/umem.h
> > > >  create mode 100644 migration-postcopy-stub.c
> > > >  create mode 100644 migration-postcopy.c
> > > >  create mode 100644 umem.c
> > > >  create mode 100644 umem.h
> > > >  
> > > >  
> > > >  
> > >  
> > > -- 
> > > yamahata
> > >  
> > >  
> >  
> > -- 
> > yamahata
> >  
> >  
> 
> -- 
> yamahata
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 00/21][RFC] postcopy live?migration
@ 2012-01-16 10:17               ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-01-16 10:17 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Mon, Jan 16, 2012 at 03:51:16PM +0900, Isaku Yamahata wrote:
> Thank you for your info.
> I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
> Your kernel enables KVM paravirt_ops, right?
> 
> Although I'm preparing the next path series including the fixes,
> you can also try postcopy by disabling paravirt_ops or disabling kvm
> (use tcg i.e. -machine accel:tcg).

Disabling KVM pv clock would be ok.
Passing no-kvmclock to guest kernel disables it.

> thanks,
> 
> 
> On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> > It's
> > WCHAN              COMMAND
> > umem_fault------qemu-system-x86
> >  
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-12 16:54
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> > On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > > Hi , I've dug more thess days
> > >  
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > > There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > > I think in postcopy the ram_save_live in the iterate part can be ignore
> > > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
> >  
> > Not so essential.
> >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >
> > > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> > > so it gets stuck
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> >  
> > > when I check the EOS problem
> > > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
> >  and qemu_put_be32
> > > (f, se->section_id)
> > >  (I think this is a wrong way to fix it and I don't know how it get through)
> > > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > > it didn't get stuck at kvm_put_msrs()
> > > but it has some other error
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 2126:2126 postcopy_incoming_ram_load:1057: done
> > > migration: successfully loaded vm state
> > > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > > 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > > Can't find block !
> > > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > > and at the same time , the destination node didn't show the EOS
> > >  
> > > so I still can't solve the stuck problem
> > > Thanks for your help~!
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2012-01-11 10:45
> > > To: thfbjyddx
> > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > > Hello all!
> > >  
> > > Hi, thank you for detailed report. The procedure you've tried looks
> > > good basically. Some comments below.
> > >  
> > > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > > > patched it correctly
> > > > but it still didn't make sense and I got the same scenario as before
> > > > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> > > >  
> > > >
> >  I think I should show what I do more clearly and hope somebody can figure out
> > > > the problem
> > > > 
> > > >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > > > 
> > > >        ./configure --target-list=
> > > x86_64-softmmu --enable-kvm --enable-postcopy
> > > > --enable-debug
> > > >        make
> > > >        make install
> > > > 
> > > >  ・ 2, outgoing qemu:
> > > > 
> > > > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > > -machine accel=kvm
> > > > incoming qemu:
> > > > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > > 
> > > >  ・ 3, outgoing node:
> > > > 
> > > > migrate -d -p -n tcp:(incoming node ip):8888
> > > >  
> > > > result:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > info status: VM-status: paused (finish-migrate);
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > can't type any more and can't kill the process(qemu-system-x86)
> > > >  
> > > > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > (qemu) migration-tcp: connect completed
> > > > migration: beginning savevm
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > > migration: iterate
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > > migration: done iterating
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > >  
> > > > from the result:
> > > > It didn't get to the "successfully loaded vm state"
> > > > So it still in the qemu_loadvm_state, and I found it's in
> > > > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > > > stuck
> > >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >  
> > > If possible, can you please test with more simplified configuration.
> > > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > > So the debug will be simplified.
> > >  
> > > thanks,
> > >  
> > > > Does anyone give some advises on the problem?
> > > > Thanks very much~
> > > >  
> > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > ━
> > > > Tommy
> > > >  
> > > > From: Isaku Yamahata
> > > > Date: 2011-12-29 09:25
> > > > To: kvm; qemu-devel
> > > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > > Intro
> > > > =====
> > > > This patch series implements postcopy live migration.[1]
> > > > As discussed at KVM forum 2011, dedicated character device is used for
> > > > distributed shared memory between migration source and destination.
> > > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > > much rooms for improvement.
> > > >  
> > > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > > >  
> > > >  
> > > > Usage
> > > > =====
> > > > You need load umem character device on the host before starting migration.
> > > > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > > on only linux umem character device. But the driver dependent code is split
> > > > into a file.
> > > > I tested only host page size ==
> >  guest page size case, but the implementation
> > > > allows host page size != guest page size case.
> > > >  
> > > > The following options are added with this patch series.
> > > > - incoming part
> > > >   command line options
> > > >   -postcopy [-postcopy-flags <flags>]
> > > >   where flags is for changing behavior for benchmark/debugging
> > > >   Currently the following flags are available
> > > >   0: default
> > > >   1: enable touching page request
> > > >  
> > > >   example:
> > > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > > >  
> > > > - outging part
> > > >   options for migrate command 
> > > >   migrate [-p [-n]] URI
> > > >   -p: indicate postcopy migration
> > > >   -n: disable background transferring pages: This is for benchmark/
> > debugging
> > > >  
> > > >   example:
> > > >   migrate -p -n tcp:<dest ip address>:4444
> > > >  
> > > >  
> > > > TODO
> > > > ====
> > > > - benchmark/evaluation. Especially how async page fault affects the result.
> > > > - improve/optimization
> > > >   At the moment at least what I'm aware of is
> > > >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> > > >     creating dedicated thread?
> > > >   - making incoming socket non-blocking
> > > >   - outgoing handler seems suboptimal causing latency.
> > > > - catch up memory API change
> > > > - consider on FUSE/CUSE possibility
> > > > - and more...
> > > >  
> > > > basic postcopy work flow
> > > > ========================
> > > >         qemu on the destination
> > > >               |
> > > >               V
> > > >         open(/dev/umem)
> > > >               |
> > > >               V
> > > >         UMEM_DEV_CREATE_UMEM
> > > >               |
> > > >               V
> > > >         Here we have two file descriptors to
> > > >         umem device and shmem file
> > > >               |
> > > >               |                                  umemd
> > > >               |                                  daemon on the destination
> > > >               |
> > > >               V    create pipe to communicate
> > > >         fork()---------------------------------------,
> > > >               |                                      |
> > > >               V                                      |
> > > >         close(socket)                                V
> > > >         close(shmem)                              mmap(shmem file)
> > > >               |                                      |
> > > >               V                                      V
> > > >         mmap(umem device) for guest RAM           close(shmem file)
> > > >               |                                      |
> > > >         close(umem device)                           |
> > > >               |                                      |
> > > >               V                                      |
> > > >         wait for ready from daemon <----pipe-----send ready message
> > > >               |                                      |
> > > >               |                                 Here the daemon takes over 
> > > >         send ok------------pipe---------------> the owner of the socket    
> > > >               |         to the source              
> > > >               V                                      |
> > > >         entering post copy stage                     |
> > > >         start guest execution                        |
> > > >               |                                      |
> > > >               V                                      V
> > > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > > >               |                                      |
> > > >               V                                      V
> > > >         page fault ------------------------------>page offset is returned
> > > >         block                                        |
> > > >                                                      V
> > > >                                                   pull page from the source
> > > >                                                   write the page contents
> > > >                                                   to the shmem.
> > > >                                                      |
> > > >                                                      V
> > > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > > >         the fault handler returns the page
> > > >         page fault is resolved
> > > >               |
> > > >               |                                   pages can be sent
> > > >               |                                   backgroundly
> > > >               |                                      |
> > > >               |                                      V
> > > >               |                                   UMEM_MARK_PAGE_CACHED
> > > >               |                                      |
> > > >               V                                      V
> > > >         The specified pages<-----pipe------------request to touch pages
> > > >         are made present by                          |
> > > >         touching guest RAM.                          |
> > > >               |                                      |
> > > >               V                                      V
> > > >              reply-------------pipe-------------> release the cached page
> > > >               |                                   madvise(MADV_REMOVE)
> > > >               |                                      |
> > > >               V                                      V
> > > >  
> > > >                  all the pages are pulled from the source
> > > >  
> > > >               |                                      |
> > > >               V                                      V
> > > >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> > > >        (note: I'm not sure if this can be implemented or not)
> > > >               |                                      |
> > > >               V                                      V
> > > >         migration completes                        exit()
> > > >  
> > > >  
> > > >  
> > > > Isaku Yamahata (21):
> > > >   arch_init: export sort_ram_list() and ram_save_block()
> > > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > > >   arch_init/ram_save: introduce constant for ram save version = 4
> > > >   arch_init: refactor host_from_stream_offset()
> > > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > > >   arch_init: refactor ram_save_block()
> > > >   arch_init/ram_save_live: factor out ram_save_limit
> > > >   arch_init/ram_load: refactor ram_load
> > > >   exec.c: factor out qemu_get_ram_ptr()
> > > >   exec.c: export last_ram_offset()
> > > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > > >   savevm: qemu_pending_size() to return pending buffered size
> > > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > > >     file
> > > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > > >   migration: factor out parameters into MigrationParams
> > > >   umem.h: import Linux umem.h
> > > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > > >   configure: add CONFIG_POSTCOPY option
> > > >   postcopy: introduce -postcopy and -postcopy-flags option
> > > >   postcopy outgoing: add -p and -n option to migrate command
> > > >   postcopy: implement postcopy livemigration
> > > >  
> > > >  Makefile.target                 |    4 +
> > > >  arch_init.c                     |  260 ++++---
> > > >  arch_init.h                     |   20 +
> > > >  block-migration.c               |    8 +-
> > > >  buffered_file.c                 |   20 +-
> > > >  buffered_file.h                 |    1 +
> > > >  configure                       |   12 +
> > > >  cpu-all.h                       |    9 +
> > > >  exec-obsolete.h                 |    1 +
> > > >  exec.c                          |   75 +-
> > > >  hmp-commands.hx                 |   12 +-
> > > >  hw/hw.h                         |    7 +-
> > > >  linux-headers/linux/umem.h      |   83 ++
> > > >  migration-exec.c                |    8 +
> > > >  migration-fd.c                  |   30 +
> > > >  migration-postcopy-stub.c       |   77 ++
> > > >  migration-postcopy.c            |
> > >  1891 +++++++++++++++++++++++++++++++++++++++
> > > >  migration-tcp.c                 |   37 +-
> > > >  migration-unix.c                |   32 +-
> > > >  migration.c                     |   53 +-
> > > >  migration.h                     |   49 +-
> > > >  qemu-common.h                   |    2 +
> > > >  qemu-options.hx                 |   25 +
> > > >  qmp-commands.hx                 |   10 +-
> > > >  savevm.c                        |   31 +-
> > > >  scripts/update-linux-headers.sh |    2 +-
> > > >  sysemu.h                        |    4 +-
> > > >  umem.c                          |  379 ++++++++
> > > >  umem.h                          |  105 +++
> > > >  vl.c                            |   20 +-
> > > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > > >  create mode 100644 linux-headers/linux/umem.h
> > > >  create mode 100644 migration-postcopy-stub.c
> > > >  create mode 100644 migration-postcopy.c
> > > >  create mode 100644 umem.c
> > > >  create mode 100644 umem.h
> > > >  
> > > >  
> > > >  
> > >  
> > > -- 
> > > yamahata
> > >  
> > >  
> >  
> > -- 
> > yamahata
> >  
> >  
> 
> -- 
> yamahata
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??: [PATCH 00/21][RFC] postcopy live?migration
  2012-01-16 10:17               ` [Qemu-devel] " Isaku Yamahata
@ 2012-03-12  8:36                 ` thfbjyddx
  -1 siblings, 0 replies; 88+ messages in thread
From: thfbjyddx @ 2012-03-12  8:36 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh


[-- Attachment #1.1: Type: text/plain, Size: 20280 bytes --]

Hi 
Thank you for your repley!
I've tried with -machine accel:tcg and I got this below:

src node:


des node:

and sometimes the last line got 


I'm wondering that why the umemd.mig_read_fd can be read again without the des node's page_req
from the last line on src node, we can see it doesn't get page req. But the des node read the umemd.mig_read_fd again.
I think umemd.mig_read_fd can be read only when the src node send sth to the socket. Is there any other situation?

BTW can you tell me how does the KVM pv clock make the patches work incorrectly?

Thanks




Tommy

From: Isaku Yamahata
Date: 2012-01-16 18:17
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
On Mon, Jan 16, 2012 at 03:51:16PM +0900, Isaku Yamahata wrote:
> Thank you for your info.
> I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
> Your kernel enables KVM paravirt_ops, right?
> 
> Although I'm preparing the next path series including the fixes,
> you can also try postcopy by disabling paravirt_ops or disabling kvm
> (use tcg i.e. -machine accel:tcg).

Disabling KVM pv clock would be ok.
Passing no-kvmclock to guest kernel disables it.

> thanks,
> 
> 
> On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> > It's
> > WCHAN              COMMAND
> > umem_fault------qemu-system-x86
> >  
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-12 16:54
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> > On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > > Hi , I've dug more thess days
> > >  
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > > There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > > I think in postcopy the ram_save_live in the iterate part can be ignore
> > > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
> >  
> > Not so essential.
> >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >
> > > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> > > so it gets stuck
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> >  
> > > when I check the EOS problem
> > > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
> >  and qemu_put_be32
> > > (f, se->section_id)
> > >  (I think this is a wrong way to fix it and I don't know how it get through)
> > > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > > it didn't get stuck at kvm_put_msrs()
> > > but it has some other error
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 2126:2126 postcopy_incoming_ram_load:1057: done
> > > migration: successfully loaded vm state
> > > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > > 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > > Can't find block !
> > > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > > and at the same time , the destination node didn't show the EOS
> > >  
> > > so I still can't solve the stuck problem
> > > Thanks for your help~!
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2012-01-11 10:45
> > > To: thfbjyddx
> > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > > Hello all!
> > >  
> > > Hi, thank you for detailed report. The procedure you've tried looks
> > > good basically. Some comments below.
> > >  
> > > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > > > patched it correctly
> > > > but it still didn't make sense and I got the same scenario as before
> > > > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> > > >  
> > > >
> >  I think I should show what I do more clearly and hope somebody can figure out
> > > > the problem
> > > > 
> > > >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > > > 
> > > >        ./configure --target-list=
> > > x86_64-softmmu --enable-kvm --enable-postcopy
> > > > --enable-debug
> > > >        make
> > > >        make install
> > > > 
> > > >  ・ 2, outgoing qemu:
> > > > 
> > > > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > > -machine accel=kvm
> > > > incoming qemu:
> > > > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > > 
> > > >  ・ 3, outgoing node:
> > > > 
> > > > migrate -d -p -n tcp:(incoming node ip):8888
> > > >  
> > > > result:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > info status: VM-status: paused (finish-migrate);
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > can't type any more and can't kill the process(qemu-system-x86)
> > > >  
> > > > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > (qemu) migration-tcp: connect completed
> > > > migration: beginning savevm
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > > migration: iterate
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > > migration: done iterating
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > >  
> > > > from the result:
> > > > It didn't get to the "successfully loaded vm state"
> > > > So it still in the qemu_loadvm_state, and I found it's in
> > > > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > > > stuck
> > >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >  
> > > If possible, can you please test with more simplified configuration.
> > > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > > So the debug will be simplified.
> > >  
> > > thanks,
> > >  
> > > > Does anyone give some advises on the problem?
> > > > Thanks very much~
> > > >  
> > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > ━
> > > > Tommy
> > > >  
> > > > From: Isaku Yamahata
> > > > Date: 2011-12-29 09:25
> > > > To: kvm; qemu-devel
> > > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > > Intro
> > > > =====
> > > > This patch series implements postcopy live migration.[1]
> > > > As discussed at KVM forum 2011, dedicated character device is used for
> > > > distributed shared memory between migration source and destination.
> > > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > > much rooms for improvement.
> > > >  
> > > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > > >  
> > > >  
> > > > Usage
> > > > =====
> > > > You need load umem character device on the host before starting migration.
> > > > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > > on only linux umem character device. But the driver dependent code is split
> > > > into a file.
> > > > I tested only host page size ==
> >  guest page size case, but the implementation
> > > > allows host page size != guest page size case.
> > > >  
> > > > The following options are added with this patch series.
> > > > - incoming part
> > > >   command line options
> > > >   -postcopy [-postcopy-flags <flags>]
> > > >   where flags is for changing behavior for benchmark/debugging
> > > >   Currently the following flags are available
> > > >   0: default
> > > >   1: enable touching page request
> > > >  
> > > >   example:
> > > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > > >  
> > > > - outging part
> > > >   options for migrate command 
> > > >   migrate [-p [-n]] URI
> > > >   -p: indicate postcopy migration
> > > >   -n: disable background transferring pages: This is for benchmark/
> > debugging
> > > >  
> > > >   example:
> > > >   migrate -p -n tcp:<dest ip address>:4444
> > > >  
> > > >  
> > > > TODO
> > > > ====
> > > > - benchmark/evaluation. Especially how async page fault affects the result.
> > > > - improve/optimization
> > > >   At the moment at least what I'm aware of is
> > > >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> > > >     creating dedicated thread?
> > > >   - making incoming socket non-blocking
> > > >   - outgoing handler seems suboptimal causing latency.
> > > > - catch up memory API change
> > > > - consider on FUSE/CUSE possibility
> > > > - and more...
> > > >  
> > > > basic postcopy work flow
> > > > ========================
> > > >         qemu on the destination
> > > >               |
> > > >               V
> > > >         open(/dev/umem)
> > > >               |
> > > >               V
> > > >         UMEM_DEV_CREATE_UMEM
> > > >               |
> > > >               V
> > > >         Here we have two file descriptors to
> > > >         umem device and shmem file
> > > >               |
> > > >               |                                  umemd
> > > >               |                                  daemon on the destination
> > > >               |
> > > >               V    create pipe to communicate
> > > >         fork()---------------------------------------,
> > > >               |                                      |
> > > >               V                                      |
> > > >         close(socket)                                V
> > > >         close(shmem)                              mmap(shmem file)
> > > >               |                                      |
> > > >               V                                      V
> > > >         mmap(umem device) for guest RAM           close(shmem file)
> > > >               |                                      |
> > > >         close(umem device)                           |
> > > >               |                                      |
> > > >               V                                      |
> > > >         wait for ready from daemon <----pipe-----send ready message
> > > >               |                                      |
> > > >               |                                 Here the daemon takes over 
> > > >         send ok------------pipe---------------> the owner of the socket    
> > > >               |         to the source              
> > > >               V                                      |
> > > >         entering post copy stage                     |
> > > >         start guest execution                        |
> > > >               |                                      |
> > > >               V                                      V
> > > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > > >               |                                      |
> > > >               V                                      V
> > > >         page fault ------------------------------>page offset is returned
> > > >         block                                        |
> > > >                                                      V
> > > >                                                   pull page from the source
> > > >                                                   write the page contents
> > > >                                                   to the shmem.
> > > >                                                      |
> > > >                                                      V
> > > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > > >         the fault handler returns the page
> > > >         page fault is resolved
> > > >               |
> > > >               |                                   pages can be sent
> > > >               |                                   backgroundly
> > > >               |                                      |
> > > >               |                                      V
> > > >               |                                   UMEM_MARK_PAGE_CACHED
> > > >               |                                      |
> > > >               V                                      V
> > > >         The specified pages<-----pipe------------request to touch pages
> > > >         are made present by                          |
> > > >         touching guest RAM.                          |
> > > >               |                                      |
> > > >               V                                      V
> > > >              reply-------------pipe-------------> release the cached page
> > > >               |                                   madvise(MADV_REMOVE)
> > > >               |                                      |
> > > >               V                                      V
> > > >  
> > > >                  all the pages are pulled from the source
> > > >  
> > > >               |                                      |
> > > >               V                                      V
> > > >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> > > >        (note: I'm not sure if this can be implemented or not)
> > > >               |                                      |
> > > >               V                                      V
> > > >         migration completes                        exit()
> > > >  
> > > >  
> > > >  
> > > > Isaku Yamahata (21):
> > > >   arch_init: export sort_ram_list() and ram_save_block()
> > > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > > >   arch_init/ram_save: introduce constant for ram save version = 4
> > > >   arch_init: refactor host_from_stream_offset()
> > > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > > >   arch_init: refactor ram_save_block()
> > > >   arch_init/ram_save_live: factor out ram_save_limit
> > > >   arch_init/ram_load: refactor ram_load
> > > >   exec.c: factor out qemu_get_ram_ptr()
> > > >   exec.c: export last_ram_offset()
> > > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > > >   savevm: qemu_pending_size() to return pending buffered size
> > > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > > >     file
> > > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > > >   migration: factor out parameters into MigrationParams
> > > >   umem.h: import Linux umem.h
> > > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > > >   configure: add CONFIG_POSTCOPY option
> > > >   postcopy: introduce -postcopy and -postcopy-flags option
> > > >   postcopy outgoing: add -p and -n option to migrate command
> > > >   postcopy: implement postcopy livemigration
> > > >  
> > > >  Makefile.target                 |    4 +
> > > >  arch_init.c                     |  260 ++++---
> > > >  arch_init.h                     |   20 +
> > > >  block-migration.c               |    8 +-
> > > >  buffered_file.c                 |   20 +-
> > > >  buffered_file.h                 |    1 +
> > > >  configure                       |   12 +
> > > >  cpu-all.h                       |    9 +
> > > >  exec-obsolete.h                 |    1 +
> > > >  exec.c                          |   75 +-
> > > >  hmp-commands.hx                 |   12 +-
> > > >  hw/hw.h                         |    7 +-
> > > >  linux-headers/linux/umem.h      |   83 ++
> > > >  migration-exec.c                |    8 +
> > > >  migration-fd.c                  |   30 +
> > > >  migration-postcopy-stub.c       |   77 ++
> > > >  migration-postcopy.c            |
> > >  1891 +++++++++++++++++++++++++++++++++++++++
> > > >  migration-tcp.c                 |   37 +-
> > > >  migration-unix.c                |   32 +-
> > > >  migration.c                     |   53 +-
> > > >  migration.h                     |   49 +-
> > > >  qemu-common.h                   |    2 +
> > > >  qemu-options.hx                 |   25 +
> > > >  qmp-commands.hx                 |   10 +-
> > > >  savevm.c                        |   31 +-
> > > >  scripts/update-linux-headers.sh |    2 +-
> > > >  sysemu.h                        |    4 +-
> > > >  umem.c                          |  379 ++++++++
> > > >  umem.h                          |  105 +++
> > > >  vl.c                            |   20 +-
> > > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > > >  create mode 100644 linux-headers/linux/umem.h
> > > >  create mode 100644 migration-postcopy-stub.c
> > > >  create mode 100644 migration-postcopy.c
> > > >  create mode 100644 umem.c
> > > >  create mode 100644 umem.h
> > > >  
> > > >  
> > > >  
> > >  
> > > -- 
> > > yamahata
> > >  
> > >  
> >  
> > -- 
> > yamahata
> >  
> >  
> 
> -- 
> yamahata
> 

-- 
yamahata

[-- Attachment #1.2: Type: text/html, Size: 62659 bytes --]

[-- Attachment #2: Catch.jpg --]
[-- Type: image/jpeg, Size: 169899 bytes --]

[-- Attachment #3: CatchE90F.jpg --]
[-- Type: image/jpeg, Size: 256595 bytes --]

[-- Attachment #4: Catch89D3.jpg --]
[-- Type: image/jpeg, Size: 60718 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 00/21][RFC] postcopy live?migration
@ 2012-03-12  8:36                 ` thfbjyddx
  0 siblings, 0 replies; 88+ messages in thread
From: thfbjyddx @ 2012-03-12  8:36 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh


[-- Attachment #1.1: Type: text/plain, Size: 20280 bytes --]

Hi 
Thank you for your repley!
I've tried with -machine accel:tcg and I got this below:

src node:


des node:

and sometimes the last line got 


I'm wondering that why the umemd.mig_read_fd can be read again without the des node's page_req
from the last line on src node, we can see it doesn't get page req. But the des node read the umemd.mig_read_fd again.
I think umemd.mig_read_fd can be read only when the src node send sth to the socket. Is there any other situation?

BTW can you tell me how does the KVM pv clock make the patches work incorrectly?

Thanks




Tommy

From: Isaku Yamahata
Date: 2012-01-16 18:17
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
On Mon, Jan 16, 2012 at 03:51:16PM +0900, Isaku Yamahata wrote:
> Thank you for your info.
> I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
> Your kernel enables KVM paravirt_ops, right?
> 
> Although I'm preparing the next path series including the fixes,
> you can also try postcopy by disabling paravirt_ops or disabling kvm
> (use tcg i.e. -machine accel:tcg).

Disabling KVM pv clock would be ok.
Passing no-kvmclock to guest kernel disables it.

> thanks,
> 
> 
> On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> > It's
> > WCHAN              COMMAND
> > umem_fault------qemu-system-x86
> >  
> >  
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-12 16:54
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> > On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > > Hi , I've dug more thess days
> > >  
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > > There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > > I think in postcopy the ram_save_live in the iterate part can be ignore
> > > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
> >  
> > Not so essential.
> >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >
> > > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) and never return
> > > so it gets stuck
> >  
> > Do you know what wchan the process was blocked at?
> > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> >  
> >  
> > > when I check the EOS problem
> > > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
> >  and qemu_put_be32
> > > (f, se->section_id)
> > >  (I think this is a wrong way to fix it and I don't know how it get through)
> > > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > > it didn't get stuck at kvm_put_msrs()
> > > but it has some other error
> > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > migration-tcp: accepted migration
> > > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > 2126:2126 postcopy_incoming_ram_load:1057: done
> > > migration: successfully loaded vm state
> > > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > > 2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > > Can't find block !
> > > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > > and at the same time , the destination node didn't show the EOS
> > >  
> > > so I still can't solve the stuck problem
> > > Thanks for your help~!
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2012-01-11 10:45
> > > To: thfbjyddx
> > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > > Hello all!
> > >  
> > > Hi, thank you for detailed report. The procedure you've tried looks
> > > good basically. Some comments below.
> > >  
> > > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211) and
> > > > patched it correctly
> > > > but it still didn't make sense and I got the same scenario as before
> > > > outgoing node intel x86_64; incoming node amd x86_64. guest image is on nfs
> > > >  
> > > >
> >  I think I should show what I do more clearly and hope somebody can figure out
> > > > the problem
> > > > 
> > > >  ・ 1, both in/out node patch the qemu and start on 3.1.7 kernel with umem
> > > > 
> > > >        ./configure --target-list=
> > > x86_64-softmmu --enable-kvm --enable-postcopy
> > > > --enable-debug
> > > >        make
> > > >        make install
> > > > 
> > > >  ・ 2, outgoing qemu:
> > > > 
> > > > qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > > -machine accel=kvm
> > > > incoming qemu:
> > > > qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > > 
> > > >  ・ 3, outgoing node:
> > > > 
> > > > migrate -d -p -n tcp:(incoming node ip):8888
> > > >  
> > > > result:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > info status: VM-status: paused (finish-migrate);
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > can't type any more and can't kill the process(qemu-system-x86)
> > > >  
> > > > I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > > 
> > > >  ・ outgoing qemu:
> > > > 
> > > > (qemu) migration-tcp: connect completed
> > > > migration: beginning savevm
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > > migration: iterate
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > > migration: done iterating
> > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > > 
> > > >  ・ incoming qemu:
> > > > 
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > >  
> > > There should be only single EOS line. Just copy & past miss?
> > >  
> > >  
> > > > from the result:
> > > > It didn't get to the "successfully loaded vm state"
> > > > So it still in the qemu_loadvm_state, and I found it's in
> > > > cpu_synchronize_all_post_init->kvm_arch_put_registers->kvm_put_msrs and got
> > > > stuck
> > >  
> > > Can you please track it down one more step?
> > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > block.(backtrace by the debugger would be best.)
> > >  
> > > If possible, can you please test with more simplified configuration.
> > > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > > So the debug will be simplified.
> > >  
> > > thanks,
> > >  
> > > > Does anyone give some advises on the problem?
> > > > Thanks very much~
> > > >  
> > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> > ━
> > > ━
> > > > Tommy
> > > >  
> > > > From: Isaku Yamahata
> > > > Date: 2011-12-29 09:25
> > > > To: kvm; qemu-devel
> > > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > > Intro
> > > > =====
> > > > This patch series implements postcopy live migration.[1]
> > > > As discussed at KVM forum 2011, dedicated character device is used for
> > > > distributed shared memory between migration source and destination.
> > > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > > much rooms for improvement.
> > > >  
> > > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > > >  
> > > >  
> > > > Usage
> > > > =====
> > > > You need load umem character device on the host before starting migration.
> > > > Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > > on only linux umem character device. But the driver dependent code is split
> > > > into a file.
> > > > I tested only host page size ==
> >  guest page size case, but the implementation
> > > > allows host page size != guest page size case.
> > > >  
> > > > The following options are added with this patch series.
> > > > - incoming part
> > > >   command line options
> > > >   -postcopy [-postcopy-flags <flags>]
> > > >   where flags is for changing behavior for benchmark/debugging
> > > >   Currently the following flags are available
> > > >   0: default
> > > >   1: enable touching page request
> > > >  
> > > >   example:
> > > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > > >  
> > > > - outging part
> > > >   options for migrate command 
> > > >   migrate [-p [-n]] URI
> > > >   -p: indicate postcopy migration
> > > >   -n: disable background transferring pages: This is for benchmark/
> > debugging
> > > >  
> > > >   example:
> > > >   migrate -p -n tcp:<dest ip address>:4444
> > > >  
> > > >  
> > > > TODO
> > > > ====
> > > > - benchmark/evaluation. Especially how async page fault affects the result.
> > > > - improve/optimization
> > > >   At the moment at least what I'm aware of is
> > > >   - touching pages in incoming qemu process by fd handler seems suboptimal.
> > > >     creating dedicated thread?
> > > >   - making incoming socket non-blocking
> > > >   - outgoing handler seems suboptimal causing latency.
> > > > - catch up memory API change
> > > > - consider on FUSE/CUSE possibility
> > > > - and more...
> > > >  
> > > > basic postcopy work flow
> > > > ========================
> > > >         qemu on the destination
> > > >               |
> > > >               V
> > > >         open(/dev/umem)
> > > >               |
> > > >               V
> > > >         UMEM_DEV_CREATE_UMEM
> > > >               |
> > > >               V
> > > >         Here we have two file descriptors to
> > > >         umem device and shmem file
> > > >               |
> > > >               |                                  umemd
> > > >               |                                  daemon on the destination
> > > >               |
> > > >               V    create pipe to communicate
> > > >         fork()---------------------------------------,
> > > >               |                                      |
> > > >               V                                      |
> > > >         close(socket)                                V
> > > >         close(shmem)                              mmap(shmem file)
> > > >               |                                      |
> > > >               V                                      V
> > > >         mmap(umem device) for guest RAM           close(shmem file)
> > > >               |                                      |
> > > >         close(umem device)                           |
> > > >               |                                      |
> > > >               V                                      |
> > > >         wait for ready from daemon <----pipe-----send ready message
> > > >               |                                      |
> > > >               |                                 Here the daemon takes over 
> > > >         send ok------------pipe---------------> the owner of the socket    
> > > >               |         to the source              
> > > >               V                                      |
> > > >         entering post copy stage                     |
> > > >         start guest execution                        |
> > > >               |                                      |
> > > >               V                                      V
> > > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > > >               |                                      |
> > > >               V                                      V
> > > >         page fault ------------------------------>page offset is returned
> > > >         block                                        |
> > > >                                                      V
> > > >                                                   pull page from the source
> > > >                                                   write the page contents
> > > >                                                   to the shmem.
> > > >                                                      |
> > > >                                                      V
> > > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > > >         the fault handler returns the page
> > > >         page fault is resolved
> > > >               |
> > > >               |                                   pages can be sent
> > > >               |                                   backgroundly
> > > >               |                                      |
> > > >               |                                      V
> > > >               |                                   UMEM_MARK_PAGE_CACHED
> > > >               |                                      |
> > > >               V                                      V
> > > >         The specified pages<-----pipe------------request to touch pages
> > > >         are made present by                          |
> > > >         touching guest RAM.                          |
> > > >               |                                      |
> > > >               V                                      V
> > > >              reply-------------pipe-------------> release the cached page
> > > >               |                                   madvise(MADV_REMOVE)
> > > >               |                                      |
> > > >               V                                      V
> > > >  
> > > >                  all the pages are pulled from the source
> > > >  
> > > >               |                                      |
> > > >               V                                      V
> > > >         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
> > > >        (note: I'm not sure if this can be implemented or not)
> > > >               |                                      |
> > > >               V                                      V
> > > >         migration completes                        exit()
> > > >  
> > > >  
> > > >  
> > > > Isaku Yamahata (21):
> > > >   arch_init: export sort_ram_list() and ram_save_block()
> > > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > > >   arch_init/ram_save: introduce constant for ram save version = 4
> > > >   arch_init: refactor host_from_stream_offset()
> > > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > > >   arch_init: refactor ram_save_block()
> > > >   arch_init/ram_save_live: factor out ram_save_limit
> > > >   arch_init/ram_load: refactor ram_load
> > > >   exec.c: factor out qemu_get_ram_ptr()
> > > >   exec.c: export last_ram_offset()
> > > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > > >   savevm: qemu_pending_size() to return pending buffered size
> > > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > > >     file
> > > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > > >   migration: factor out parameters into MigrationParams
> > > >   umem.h: import Linux umem.h
> > > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > > >   configure: add CONFIG_POSTCOPY option
> > > >   postcopy: introduce -postcopy and -postcopy-flags option
> > > >   postcopy outgoing: add -p and -n option to migrate command
> > > >   postcopy: implement postcopy livemigration
> > > >  
> > > >  Makefile.target                 |    4 +
> > > >  arch_init.c                     |  260 ++++---
> > > >  arch_init.h                     |   20 +
> > > >  block-migration.c               |    8 +-
> > > >  buffered_file.c                 |   20 +-
> > > >  buffered_file.h                 |    1 +
> > > >  configure                       |   12 +
> > > >  cpu-all.h                       |    9 +
> > > >  exec-obsolete.h                 |    1 +
> > > >  exec.c                          |   75 +-
> > > >  hmp-commands.hx                 |   12 +-
> > > >  hw/hw.h                         |    7 +-
> > > >  linux-headers/linux/umem.h      |   83 ++
> > > >  migration-exec.c                |    8 +
> > > >  migration-fd.c                  |   30 +
> > > >  migration-postcopy-stub.c       |   77 ++
> > > >  migration-postcopy.c            |
> > >  1891 +++++++++++++++++++++++++++++++++++++++
> > > >  migration-tcp.c                 |   37 +-
> > > >  migration-unix.c                |   32 +-
> > > >  migration.c                     |   53 +-
> > > >  migration.h                     |   49 +-
> > > >  qemu-common.h                   |    2 +
> > > >  qemu-options.hx                 |   25 +
> > > >  qmp-commands.hx                 |   10 +-
> > > >  savevm.c                        |   31 +-
> > > >  scripts/update-linux-headers.sh |    2 +-
> > > >  sysemu.h                        |    4 +-
> > > >  umem.c                          |  379 ++++++++
> > > >  umem.h                          |  105 +++
> > > >  vl.c                            |   20 +-
> > > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > > >  create mode 100644 linux-headers/linux/umem.h
> > > >  create mode 100644 migration-postcopy-stub.c
> > > >  create mode 100644 migration-postcopy.c
> > > >  create mode 100644 umem.c
> > > >  create mode 100644 umem.h
> > > >  
> > > >  
> > > >  
> > >  
> > > -- 
> > > yamahata
> > >  
> > >  
> >  
> > -- 
> > yamahata
> >  
> >  
> 
> -- 
> yamahata
> 

-- 
yamahata

[-- Attachment #1.2: Type: text/html, Size: 62659 bytes --]

[-- Attachment #2: Catch.jpg --]
[-- Type: image/jpeg, Size: 169899 bytes --]

[-- Attachment #3: CatchE90F.jpg --]
[-- Type: image/jpeg, Size: 256595 bytes --]

[-- Attachment #4: Catch89D3.jpg --]
[-- Type: image/jpeg, Size: 60718 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: 回??: [PATCH 00/21][RFC] postcopy live?migration
  2012-03-12  8:36                 ` [Qemu-devel] " thfbjyddx
@ 2012-03-13  3:21                   ` Isaku Yamahata
  -1 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-03-13  3:21 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

I fixed several issues locally, and am planning to post v2 of the patches
soon.

On Mon, Mar 12, 2012 at 04:36:53PM +0800, thfbjyddx wrote:
> Hi
> Thank you for your repley!
> I've tried with -machine accel:tcg and I got this below:
>  
> src node:
> [cid]
>  
> des node:
> [cid]
> and sometimes the last line got
> [cid]
>  
> I'm wondering that why the umemd.mig_read_fd can be read again without the des
> node's page_req
> from the last line on src node, we can see it doesn't get page req. But the des
> node read the umemd.mig_read_fd again.
> I think umemd.mig_read_fd can be read only when the src node send sth to the
> socket. Is there any other situation?

This will be addressed by v2 patches. The reading part is made fully
non-blocking.

But I think, the select multiplex + non-blocking IO should be replaced
with thread + blocking IO eventually.
Given that smarter(e.g. XBRLE) page compression is coming, it won't be
practical to make those non-blocking.
Threading will cope with those patches and can take advantage of new
compression.


> BTW can you tell me how does the KVM pv clock make the patches work
> incorrectly?

PV clock touches guest pages on loading before enabling umem.
Anyway I found many other devices touch guest pages. virtio-balloon and
audio devices...
I addressed it locally by enabling umem before device loading, and it will
be included in v2.

thanks,

> Thanks
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-16 18:17
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> On Mon, Jan 16, 2012 at 03:51:16PM +0900, Isaku Yamahata wrote:
> > Thank you for your info.
> > I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
> > Your kernel enables KVM paravirt_ops, right?
> > 
> > Although I'm preparing the next path series including the fixes,
> > you can also try postcopy by disabling paravirt_ops or disabling kvm
> > (use tcg i.e. -machine accel:tcg).
>  
> Disabling KVM pv clock would be ok.
> Passing no-kvmclock to guest kernel disables it.
>  
> > thanks,
> > 
> > 
> > On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
> > >  
> > > Do you know what wchan the process was blocked at?
> > > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> > >  
> > > It's
> > > WCHAN              COMMAND
> > > umem_fault------qemu-system-x86
> > >  
> > >  
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2012-01-12 16:54
> > > To: thfbjyddx
> > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> > > On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > > > Hi , I've dug more thess days
> > > >  
> > > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > > migration-tcp: accepted migration
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > >  
> > > > There should be only single EOS line. Just copy & past miss?
> > > >  
> > > >
>  There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > > > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > > > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > > > I think in postcopy the ram_save_live in the iterate part can be ignore
> > > > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > > > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
> > >  
> > > Not so essential.
> > >  
> > > > Can you please track it down one more step?
> > > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > > block.(backtrace by the debugger would be best.)
> > > >
> > > > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data)
>  and never return
> > > > so it gets stuck
> > >  
> > > Do you know what wchan the process was blocked at?
> > > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> > >  
> > >  
> > > > when I check the EOS problem
> > > > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
> > >  and qemu_put_be32
> > > > (f, se->section_id)
> > > >  
> (I think this is a wrong way to fix it and I don't know how it get through)
> > > > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > > > it didn't get stuck at kvm_put_msrs()
> > > > but it has some other error
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 2126:2126 postcopy_incoming_ram_load:1057: done
> > > > migration: successfully loaded vm state
> > > > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > > >
>  2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > > > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > > > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > > > Can't find block !
> > > > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > > > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > > > and at the same time , the destination node didn't show the EOS
> > > >  
> > > > so I still can't solve the stuck problem
> > > > Thanks for your help~!
> > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━━
> > > ━
> > > > Tommy
> > > >  
> > > > From: Isaku Yamahata
> > > > Date: 2012-01-11 10:45
> > > > To: thfbjyddx
> > > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > > > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > > > Hello all!
> > > >  
> > > > Hi, thank you for detailed report. The procedure you've tried looks
> > > > good basically. Some comments below.
> > > >  
> > > > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211)
>  and
> > > > > patched it correctly
> > > > > but it still didn't make sense and I got the same scenario as before
> > > > > outgoing node intel x86_64;
>  incoming node amd x86_64. guest image is on nfs
> > > > >  
> > > > >
> > >
>   I think I should show what I do more clearly and hope somebody can figure out
> > > > > the problem
> > > > > 
> > > > >  ・ 1, both in/
> out node patch the qemu and start on 3.1.7 kernel with umem
> > > > > 
> > > > >        ./configure --target-list=
> > > > x86_64-softmmu --enable-kvm --enable-postcopy
> > > > > --enable-debug
> > > > >        make
> > > > >        make install
> > > > > 
> > > > >  ・ 2, outgoing qemu:
> > > > > 
> > > > >
>  qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > > > -machine accel=kvm
> > > > > incoming qemu:
> > > > >
>  qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > > > 
> > > > >  ・ 3, outgoing node:
> > > > > 
> > > > > migrate -d -p -n tcp:(incoming node ip):8888
> > > > >  
> > > > > result:
> > > > > 
> > > > >  ・ outgoing qemu:
> > > > > 
> > > > > info status: VM-status: paused (finish-migrate);
> > > > > 
> > > > >  ・ incoming qemu:
> > > > > 
> > > > > can't type any more and can't kill the process(qemu-system-x86)
> > > > >  
> > > > >
>  I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > > > 
> > > > >  ・ outgoing qemu:
> > > > > 
> > > > > (qemu) migration-tcp: connect completed
> > > > > migration: beginning savevm
> > > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > > > migration: iterate
> > > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > > > migration: done iterating
> > > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > > > 
> > > > >  ・ incoming qemu:
> > > > > 
> > > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > > migration-tcp: accepted migration
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > >  
> > > > There should be only single EOS line. Just copy & past miss?
> > > >  
> > > >  
> > > > > from the result:
> > > > > It didn't get to the "successfully loaded vm state"
> > > > > So it still in the qemu_loadvm_state, and I found it's in
> > > > > cpu_synchronize_all_post_init->kvm_arch_put_registers->
> kvm_put_msrs and got
> > > > > stuck
> > > >  
> > > > Can you please track it down one more step?
> > > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > > block.(backtrace by the debugger would be best.)
> > > >  
> > > > If possible, can you please test with more simplified configuration.
> > > > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > > > So the debug will be simplified.
> > > >  
> > > > thanks,
> > > >  
> > > > > Does anyone give some advises on the problem?
> > > > > Thanks very much~
> > > > >  
> > > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━━
> > > ━
> > > > ━
> > > > > Tommy
> > > > >  
> > > > > From: Isaku Yamahata
> > > > > Date: 2011-12-29 09:25
> > > > > To: kvm; qemu-devel
> > > > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > > > Intro
> > > > > =====
> > > > > This patch series implements postcopy live migration.[1]
> > > > > As discussed at KVM forum 2011, dedicated character device is used for
> > > > > distributed shared memory between migration source and destination.
> > > > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > > > much rooms for improvement.
> > > > >  
> > > > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > > > >  
> > > > >  
> > > > > Usage
> > > > > =====
> > > > >
>  You need load umem character device on the host before starting migration.
> > > > >
>  Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > > >
>  on only linux umem character device. But the driver dependent code is split
> > > > > into a file.
> > > > > I tested only host page size ==
> > >  guest page size case, but the implementation
> > > > > allows host page size != guest page size case.
> > > > >  
> > > > > The following options are added with this patch series.
> > > > > - incoming part
> > > > >   command line options
> > > > >   -postcopy [-postcopy-flags <flags>]
> > > > >   where flags is for changing behavior for benchmark/debugging
> > > > >   Currently the following flags are available
> > > > >   0: default
> > > > >   1: enable touching page request
> > > > >  
> > > > >   example:
> > > > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > > > >  
> > > > > - outging part
> > > > >   options for migrate command 
> > > > >   migrate [-p [-n]] URI
> > > > >   -p: indicate postcopy migration
> > > > >   -n: disable background transferring pages: This is for benchmark/
> > > debugging
> > > > >  
> > > > >   example:
> > > > >   migrate -p -n tcp:<dest ip address>:4444
> > > > >  
> > > > >  
> > > > > TODO
> > > > > ====
> > > > > - benchmark/
> evaluation. Especially how async page fault affects the result.
> > > > > - improve/optimization
> > > > >   At the moment at least what I'm aware of is
> > > > >
>    - touching pages in incoming qemu process by fd handler seems suboptimal.
> > > > >     creating dedicated thread?
> > > > >   - making incoming socket non-blocking
> > > > >   - outgoing handler seems suboptimal causing latency.
> > > > > - catch up memory API change
> > > > > - consider on FUSE/CUSE possibility
> > > > > - and more...
> > > > >  
> > > > > basic postcopy work flow
> > > > > ========================
> > > > >         qemu on the destination
> > > > >               |
> > > > >               V
> > > > >         open(/dev/umem)
> > > > >               |
> > > > >               V
> > > > >         UMEM_DEV_CREATE_UMEM
> > > > >               |
> > > > >               V
> > > > >         Here we have two file descriptors to
> > > > >         umem device and shmem file
> > > > >               |
> > > > >               |                                  umemd
> > > > >               |
>                                   daemon on the destination
> > > > >               |
> > > > >               V    create pipe to communicate
> > > > >         fork()---------------------------------------,
> > > > >               |                                      |
> > > > >               V                                      |
> > > > >         close(socket)                                V
> > > > >         close(shmem)                              mmap(shmem file)
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         mmap(umem device) for guest RAM           close(shmem file)
> > > > >               |                                      |
> > > > >         close(umem device)                           |
> > > > >               |                                      |
> > > > >               V                                      |
> > > > >         wait for ready from daemon <----pipe-----send ready message
> > > > >               |                                      |
> > > > >               |
>                                  Here the daemon takes over 
> > > > >         send ok------------pipe--------------->
>  the owner of the socket    
> > > > >               |         to the source              
> > > > >               V                                      |
> > > > >         entering post copy stage                     |
> > > > >         start guest execution                        |
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         page fault ------------------------------>
> page offset is returned
> > > > >         block                                        |
> > > > >                                                      V
> > > > >
>                                                    pull page from the source
> > > > >
>                                                    write the page contents
> > > > >                                                   to the shmem.
> > > > >                                                      |
> > > > >                                                      V
> > > > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > > > >         the fault handler returns the page
> > > > >         page fault is resolved
> > > > >               |
> > > > >               |                                   pages can be sent
> > > > >               |                                   backgroundly
> > > > >               |                                      |
> > > > >               |                                      V
> > > > >               |                                   UMEM_MARK_PAGE_CACHED
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         The specified pages<-----pipe------------request to touch pages
> > > > >         are made present by                          |
> > > > >         touching guest RAM.                          |
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >              reply-------------pipe------------->
>  release the cached page
> > > > >               |                                   madvise(MADV_REMOVE)
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >  
> > > > >                  all the pages are pulled from the source
> > > > >  
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         the vma becomes anonymous
> <----------------UMEM_MAKE_VMA_ANONYMOUS
> > > > >        (note: I'm not sure if this can be implemented or not)
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         migration completes                        exit()
> > > > >  
> > > > >  
> > > > >  
> > > > > Isaku Yamahata (21):
> > > > >   arch_init: export sort_ram_list() and ram_save_block()
> > > > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > > > >   arch_init/ram_save: introduce constant for ram save version = 4
> > > > >   arch_init: refactor host_from_stream_offset()
> > > > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > > > >   arch_init: refactor ram_save_block()
> > > > >   arch_init/ram_save_live: factor out ram_save_limit
> > > > >   arch_init/ram_load: refactor ram_load
> > > > >   exec.c: factor out qemu_get_ram_ptr()
> > > > >   exec.c: export last_ram_offset()
> > > > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > > > >   savevm: qemu_pending_size() to return pending buffered size
> > > > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > > > >     file
> > > > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > > > >   migration: factor out parameters into MigrationParams
> > > > >   umem.h: import Linux umem.h
> > > > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > > > >   configure: add CONFIG_POSTCOPY option
> > > > >   postcopy: introduce -postcopy and -postcopy-flags option
> > > > >   postcopy outgoing: add -p and -n option to migrate command
> > > > >   postcopy: implement postcopy livemigration
> > > > >  
> > > > >  Makefile.target                 |    4 +
> > > > >  arch_init.c                     |  260 ++++---
> > > > >  arch_init.h                     |   20 +
> > > > >  block-migration.c               |    8 +-
> > > > >  buffered_file.c                 |   20 +-
> > > > >  buffered_file.h                 |    1 +
> > > > >  configure                       |   12 +
> > > > >  cpu-all.h                       |    9 +
> > > > >  exec-obsolete.h                 |    1 +
> > > > >  exec.c                          |   75 +-
> > > > >  hmp-commands.hx                 |   12 +-
> > > > >  hw/hw.h                         |    7 +-
> > > > >  linux-headers/linux/umem.h      |   83 ++
> > > > >  migration-exec.c                |    8 +
> > > > >  migration-fd.c                  |   30 +
> > > > >  migration-postcopy-stub.c       |   77 ++
> > > > >  migration-postcopy.c            |
> > > >  1891 +++++++++++++++++++++++++++++++++++++++
> > > > >  migration-tcp.c                 |   37 +-
> > > > >  migration-unix.c                |   32 +-
> > > > >  migration.c                     |   53 +-
> > > > >  migration.h                     |   49 +-
> > > > >  qemu-common.h                   |    2 +
> > > > >  qemu-options.hx                 |   25 +
> > > > >  qmp-commands.hx                 |   10 +-
> > > > >  savevm.c                        |   31 +-
> > > > >  scripts/update-linux-headers.sh |    2 +-
> > > > >  sysemu.h                        |    4 +-
> > > > >  umem.c                          |  379 ++++++++
> > > > >  umem.h                          |  105 +++
> > > > >  vl.c                            |   20 +-
> > > > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > > > >  create mode 100644 linux-headers/linux/umem.h
> > > > >  create mode 100644 migration-postcopy-stub.c
> > > > >  create mode 100644 migration-postcopy.c
> > > > >  create mode 100644 umem.c
> > > > >  create mode 100644 umem.h
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > > -- 
> > > > yamahata
> > > >  
> > > >  
> > >  
> > > -- 
> > > yamahata
> > >  
> > >  
> > 
> > -- 
> > yamahata
> > 
>  
> -- 
> yamahata
>  
>  





-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 00/21][RFC] postcopy live?migration
@ 2012-03-13  3:21                   ` Isaku Yamahata
  0 siblings, 0 replies; 88+ messages in thread
From: Isaku Yamahata @ 2012-03-13  3:21 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

I fixed several issues locally, and am planning to post v2 of the patches
soon.

On Mon, Mar 12, 2012 at 04:36:53PM +0800, thfbjyddx wrote:
> Hi
> Thank you for your repley!
> I've tried with -machine accel:tcg and I got this below:
>  
> src node:
> [cid]
>  
> des node:
> [cid]
> and sometimes the last line got
> [cid]
>  
> I'm wondering that why the umemd.mig_read_fd can be read again without the des
> node's page_req
> from the last line on src node, we can see it doesn't get page req. But the des
> node read the umemd.mig_read_fd again.
> I think umemd.mig_read_fd can be read only when the src node send sth to the
> socket. Is there any other situation?

This will be addressed by v2 patches. The reading part is made fully
non-blocking.

But I think, the select multiplex + non-blocking IO should be replaced
with thread + blocking IO eventually.
Given that smarter(e.g. XBRLE) page compression is coming, it won't be
practical to make those non-blocking.
Threading will cope with those patches and can take advantage of new
compression.


> BTW can you tell me how does the KVM pv clock make the patches work
> incorrectly?

PV clock touches guest pages on loading before enabling umem.
Anyway I found many other devices touch guest pages. virtio-balloon and
audio devices...
I addressed it locally by enabling umem before device loading, and it will
be included in v2.

thanks,

> Thanks
>  
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-16 18:17
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> On Mon, Jan 16, 2012 at 03:51:16PM +0900, Isaku Yamahata wrote:
> > Thank you for your info.
> > I suppose I found the cause, MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME.
> > Your kernel enables KVM paravirt_ops, right?
> > 
> > Although I'm preparing the next path series including the fixes,
> > you can also try postcopy by disabling paravirt_ops or disabling kvm
> > (use tcg i.e. -machine accel:tcg).
>  
> Disabling KVM pv clock would be ok.
> Passing no-kvmclock to guest kernel disables it.
>  
> > thanks,
> > 
> > 
> > On Thu, Jan 12, 2012 at 09:26:03PM +0800, thfbjyddx wrote:
> > >  
> > > Do you know what wchan the process was blocked at?
> > > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> > >  
> > > It's
> > > WCHAN              COMMAND
> > > umem_fault------qemu-system-x86
> > >  
> > >  
> > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━━
> > > Tommy
> > >  
> > > From: Isaku Yamahata
> > > Date: 2012-01-12 16:54
> > > To: thfbjyddx
> > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live?migration
> > > On Thu, Jan 12, 2012 at 04:29:44PM +0800, thfbjyddx wrote:
> > > > Hi , I've dug more thess days
> > > >  
> > > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > > migration-tcp: accepted migration
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > >  
> > > > There should be only single EOS line. Just copy & past miss?
> > > >  
> > > >
>  There must be two EOS for one is coming from postcopy_outgoing_ram_save_live
> > > > (...stage == QEMU_SAVE_LIVE_STAGE_PART) and the other is
> > > > postcopy_outgoing_ram_save_live(...stage == QEMU_SAVE_LIVE_STAGE_END)
> > > > I think in postcopy the ram_save_live in the iterate part can be ignore
> > > > so why there still have the qemu_put_byte(f, QEMU_VM_SECTON_PART) and
> > > > qemu_put_byte(f, QEMU_VM_SECTON_END) in the procedure? Is it essential?
> > >  
> > > Not so essential.
> > >  
> > > > Can you please track it down one more step?
> > > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > > block.(backtrace by the debugger would be best.)
> > > >
> > > > it gets to the kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data)
>  and never return
> > > > so it gets stuck
> > >  
> > > Do you know what wchan the process was blocked at?
> > > kvm_vcpu_ioctl(env, KVM_SET_MSRS, &msr_data) doesn't seem to block.
> > >  
> > >  
> > > > when I check the EOS problem
> > > > I just annotated the qemu_put_byte(f, QEMU_VM_SECTION_PART);
> > >  and qemu_put_be32
> > > > (f, se->section_id)
> > > >  
> (I think this is a wrong way to fix it and I don't know how it get through)
> > > > and leave just the se->save_live_state in the qemu_savevm_state_iterate
> > > > it didn't get stuck at kvm_put_msrs()
> > > > but it has some other error
> > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > migration-tcp: accepted migration
> > > > 2126:2126 postcopy_incoming_ram_load:1018: incoming ram load
> > > > 2126:2126 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > 2126:2126 postcopy_incoming_ram_load:1057: done
> > > > migration: successfully loaded vm state
> > > > 2126:2126 postcopy_incoming_fork_umemd:1069: fork
> > > >
>  2126:2126 postcopy_incoming_fork_umemd:1127: qemu pid: 2126 daemon pid: 2129
> > > > 2130:2130 postcopy_incoming_umemd:1840: daemon pid: 2130
> > > > 2130:2130 postcopy_incoming_umemd:1875: entering umemd main loop
> > > > Can't find block !
> > > > 2130:2130 postcopy_incoming_umem_ram_load:1526: shmem == NULL
> > > > 2130:2130 postcopy_incoming_umemd:1882: exiting umemd main loop
> > > > and at the same time , the destination node didn't show the EOS
> > > >  
> > > > so I still can't solve the stuck problem
> > > > Thanks for your help~!
> > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━━
> > > ━
> > > > Tommy
> > > >  
> > > > From: Isaku Yamahata
> > > > Date: 2012-01-11 10:45
> > > > To: thfbjyddx
> > > > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > > > Subject: Re: [Qemu-devel]回??: [PATCH 00/21][RFC] postcopy live migration
> > > > On Sat, Jan 07, 2012 at 06:29:14PM +0800, thfbjyddx wrote:
> > > > > Hello all!
> > > >  
> > > > Hi, thank you for detailed report. The procedure you've tried looks
> > > > good basically. Some comments below.
> > > >  
> > > > > I got the qemu basic version(03ecd2c80a64d030a22fe67cc7a60f24e17ff211)
>  and
> > > > > patched it correctly
> > > > > but it still didn't make sense and I got the same scenario as before
> > > > > outgoing node intel x86_64;
>  incoming node amd x86_64. guest image is on nfs
> > > > >  
> > > > >
> > >
>   I think I should show what I do more clearly and hope somebody can figure out
> > > > > the problem
> > > > > 
> > > > >  ・ 1, both in/
> out node patch the qemu and start on 3.1.7 kernel with umem
> > > > > 
> > > > >        ./configure --target-list=
> > > > x86_64-softmmu --enable-kvm --enable-postcopy
> > > > > --enable-debug
> > > > >        make
> > > > >        make install
> > > > > 
> > > > >  ・ 2, outgoing qemu:
> > > > > 
> > > > >
>  qemu-system-x86_64 -m 256 -hda xxx -monitor stdio -vnc: 2 -usbdevice tablet
> > > > > -machine accel=kvm
> > > > > incoming qemu:
> > > > >
>  qemu-system-x86_64 -m 256 -hda xxx -postcopy -incoming tcp:0:8888 -monitor
> > > > > stdio -vnc: 2 -usbdevice tablet -machine accel=kvm
> > > > > 
> > > > >  ・ 3, outgoing node:
> > > > > 
> > > > > migrate -d -p -n tcp:(incoming node ip):8888
> > > > >  
> > > > > result:
> > > > > 
> > > > >  ・ outgoing qemu:
> > > > > 
> > > > > info status: VM-status: paused (finish-migrate);
> > > > > 
> > > > >  ・ incoming qemu:
> > > > > 
> > > > > can't type any more and can't kill the process(qemu-system-x86)
> > > > >  
> > > > >
>  I open the debug flag in migration.c migration-tcp.c migration-postcopy.c:
> > > > > 
> > > > >  ・ outgoing qemu:
> > > > > 
> > > > > (qemu) migration-tcp: connect completed
> > > > > migration: beginning savevm
> > > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 1
> > > > > migration: iterate
> > > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 2
> > > > > migration: done iterating
> > > > > 4500:4500 postcopy_outgoing_ram_save_live:540: stage 3
> > > > > 4500:4500 postcopy_outgoing_begin:716: outgoing begin
> > > > > 
> > > > >  ・ incoming qemu:
> > > > > 
> > > > > (qemu) migration-tcp: Attempting to start an incoming migration
> > > > > migration-tcp: accepted migration
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x10870000 flags 0x4
> > > > > 4872:4872 postcopy_incoming_ram_load:1057: done
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > > > 4872:4872 postcopy_incoming_ram_load:1018: incoming ram load
> > > > > 4872:4872 postcopy_incoming_ram_load:1031: addr 0x0 flags 0x10
> > > > > 4872:4872 postcopy_incoming_ram_load:1037: EOS
> > > >  
> > > > There should be only single EOS line. Just copy & past miss?
> > > >  
> > > >  
> > > > > from the result:
> > > > > It didn't get to the "successfully loaded vm state"
> > > > > So it still in the qemu_loadvm_state, and I found it's in
> > > > > cpu_synchronize_all_post_init->kvm_arch_put_registers->
> kvm_put_msrs and got
> > > > > stuck
> > > >  
> > > > Can you please track it down one more step?
> > > > Which line did it stuck in kvm_put_msrs()? kvm_put_msrs() doesn't seem to
> > > > block.(backtrace by the debugger would be best.)
> > > >  
> > > > If possible, can you please test with more simplified configuration.
> > > > i.e. drop device as much as possible i.e. no usbdevice, no disk...
> > > > So the debug will be simplified.
> > > >  
> > > > thanks,
> > > >  
> > > > > Does anyone give some advises on the problem?
> > > > > Thanks very much~
> > > > >  
> > > > > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━━
> > > ━
> > > > ━
> > > > > Tommy
> > > > >  
> > > > > From: Isaku Yamahata
> > > > > Date: 2011-12-29 09:25
> > > > > To: kvm; qemu-devel
> > > > > CC: yamahata; t.hirofuchi; satoshi.itoh
> > > > > Subject: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
> > > > > Intro
> > > > > =====
> > > > > This patch series implements postcopy live migration.[1]
> > > > > As discussed at KVM forum 2011, dedicated character device is used for
> > > > > distributed shared memory between migration source and destination.
> > > > > Now we can discuss/benchmark/compare with precopy. I believe there are
> > > > > much rooms for improvement.
> > > > >  
> > > > > [1] http://wiki.qemu.org/Features/PostCopyLiveMigration
> > > > >  
> > > > >  
> > > > > Usage
> > > > > =====
> > > > >
>  You need load umem character device on the host before starting migration.
> > > > >
>  Postcopy can be used for tcg and kvm accelarator. The implementation depend
> > > > >
>  on only linux umem character device. But the driver dependent code is split
> > > > > into a file.
> > > > > I tested only host page size ==
> > >  guest page size case, but the implementation
> > > > > allows host page size != guest page size case.
> > > > >  
> > > > > The following options are added with this patch series.
> > > > > - incoming part
> > > > >   command line options
> > > > >   -postcopy [-postcopy-flags <flags>]
> > > > >   where flags is for changing behavior for benchmark/debugging
> > > > >   Currently the following flags are available
> > > > >   0: default
> > > > >   1: enable touching page request
> > > > >  
> > > > >   example:
> > > > >   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm
> > > > >  
> > > > > - outging part
> > > > >   options for migrate command 
> > > > >   migrate [-p [-n]] URI
> > > > >   -p: indicate postcopy migration
> > > > >   -n: disable background transferring pages: This is for benchmark/
> > > debugging
> > > > >  
> > > > >   example:
> > > > >   migrate -p -n tcp:<dest ip address>:4444
> > > > >  
> > > > >  
> > > > > TODO
> > > > > ====
> > > > > - benchmark/
> evaluation. Especially how async page fault affects the result.
> > > > > - improve/optimization
> > > > >   At the moment at least what I'm aware of is
> > > > >
>    - touching pages in incoming qemu process by fd handler seems suboptimal.
> > > > >     creating dedicated thread?
> > > > >   - making incoming socket non-blocking
> > > > >   - outgoing handler seems suboptimal causing latency.
> > > > > - catch up memory API change
> > > > > - consider on FUSE/CUSE possibility
> > > > > - and more...
> > > > >  
> > > > > basic postcopy work flow
> > > > > ========================
> > > > >         qemu on the destination
> > > > >               |
> > > > >               V
> > > > >         open(/dev/umem)
> > > > >               |
> > > > >               V
> > > > >         UMEM_DEV_CREATE_UMEM
> > > > >               |
> > > > >               V
> > > > >         Here we have two file descriptors to
> > > > >         umem device and shmem file
> > > > >               |
> > > > >               |                                  umemd
> > > > >               |
>                                   daemon on the destination
> > > > >               |
> > > > >               V    create pipe to communicate
> > > > >         fork()---------------------------------------,
> > > > >               |                                      |
> > > > >               V                                      |
> > > > >         close(socket)                                V
> > > > >         close(shmem)                              mmap(shmem file)
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         mmap(umem device) for guest RAM           close(shmem file)
> > > > >               |                                      |
> > > > >         close(umem device)                           |
> > > > >               |                                      |
> > > > >               V                                      |
> > > > >         wait for ready from daemon <----pipe-----send ready message
> > > > >               |                                      |
> > > > >               |
>                                  Here the daemon takes over 
> > > > >         send ok------------pipe--------------->
>  the owner of the socket    
> > > > >               |         to the source              
> > > > >               V                                      |
> > > > >         entering post copy stage                     |
> > > > >         start guest execution                        |
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         access guest RAM                          UMEM_GET_PAGE_REQUEST
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         page fault ------------------------------>
> page offset is returned
> > > > >         block                                        |
> > > > >                                                      V
> > > > >
>                                                    pull page from the source
> > > > >
>                                                    write the page contents
> > > > >                                                   to the shmem.
> > > > >                                                      |
> > > > >                                                      V
> > > > >         unblock     <-----------------------------UMEM_MARK_PAGE_CACHED
> > > > >         the fault handler returns the page
> > > > >         page fault is resolved
> > > > >               |
> > > > >               |                                   pages can be sent
> > > > >               |                                   backgroundly
> > > > >               |                                      |
> > > > >               |                                      V
> > > > >               |                                   UMEM_MARK_PAGE_CACHED
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         The specified pages<-----pipe------------request to touch pages
> > > > >         are made present by                          |
> > > > >         touching guest RAM.                          |
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >              reply-------------pipe------------->
>  release the cached page
> > > > >               |                                   madvise(MADV_REMOVE)
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >  
> > > > >                  all the pages are pulled from the source
> > > > >  
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         the vma becomes anonymous
> <----------------UMEM_MAKE_VMA_ANONYMOUS
> > > > >        (note: I'm not sure if this can be implemented or not)
> > > > >               |                                      |
> > > > >               V                                      V
> > > > >         migration completes                        exit()
> > > > >  
> > > > >  
> > > > >  
> > > > > Isaku Yamahata (21):
> > > > >   arch_init: export sort_ram_list() and ram_save_block()
> > > > >   arch_init: export RAM_SAVE_xxx flags for postcopy
> > > > >   arch_init/ram_save: introduce constant for ram save version = 4
> > > > >   arch_init: refactor host_from_stream_offset()
> > > > >   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
> > > > >   arch_init: refactor ram_save_block()
> > > > >   arch_init/ram_save_live: factor out ram_save_limit
> > > > >   arch_init/ram_load: refactor ram_load
> > > > >   exec.c: factor out qemu_get_ram_ptr()
> > > > >   exec.c: export last_ram_offset()
> > > > >   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
> > > > >   savevm: qemu_pending_size() to return pending buffered size
> > > > >   savevm, buffered_file: introduce method to drain buffer of buffered
> > > > >     file
> > > > >   migration: export migrate_fd_completed() and migrate_fd_cleanup()
> > > > >   migration: factor out parameters into MigrationParams
> > > > >   umem.h: import Linux umem.h
> > > > >   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
> > > > >   configure: add CONFIG_POSTCOPY option
> > > > >   postcopy: introduce -postcopy and -postcopy-flags option
> > > > >   postcopy outgoing: add -p and -n option to migrate command
> > > > >   postcopy: implement postcopy livemigration
> > > > >  
> > > > >  Makefile.target                 |    4 +
> > > > >  arch_init.c                     |  260 ++++---
> > > > >  arch_init.h                     |   20 +
> > > > >  block-migration.c               |    8 +-
> > > > >  buffered_file.c                 |   20 +-
> > > > >  buffered_file.h                 |    1 +
> > > > >  configure                       |   12 +
> > > > >  cpu-all.h                       |    9 +
> > > > >  exec-obsolete.h                 |    1 +
> > > > >  exec.c                          |   75 +-
> > > > >  hmp-commands.hx                 |   12 +-
> > > > >  hw/hw.h                         |    7 +-
> > > > >  linux-headers/linux/umem.h      |   83 ++
> > > > >  migration-exec.c                |    8 +
> > > > >  migration-fd.c                  |   30 +
> > > > >  migration-postcopy-stub.c       |   77 ++
> > > > >  migration-postcopy.c            |
> > > >  1891 +++++++++++++++++++++++++++++++++++++++
> > > > >  migration-tcp.c                 |   37 +-
> > > > >  migration-unix.c                |   32 +-
> > > > >  migration.c                     |   53 +-
> > > > >  migration.h                     |   49 +-
> > > > >  qemu-common.h                   |    2 +
> > > > >  qemu-options.hx                 |   25 +
> > > > >  qmp-commands.hx                 |   10 +-
> > > > >  savevm.c                        |   31 +-
> > > > >  scripts/update-linux-headers.sh |    2 +-
> > > > >  sysemu.h                        |    4 +-
> > > > >  umem.c                          |  379 ++++++++
> > > > >  umem.h                          |  105 +++
> > > > >  vl.c                            |   20 +-
> > > > >  30 files changed, 3086 insertions(+), 181 deletions(-)
> > > > >  create mode 100644 linux-headers/linux/umem.h
> > > > >  create mode 100644 migration-postcopy-stub.c
> > > > >  create mode 100644 migration-postcopy.c
> > > > >  create mode 100644 umem.c
> > > > >  create mode 100644 umem.h
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > > -- 
> > > > yamahata
> > > >  
> > > >  
> > >  
> > > -- 
> > > yamahata
> > >  
> > >  
> > 
> > -- 
> > yamahata
> > 
>  
> -- 
> yamahata
>  
>  





-- 
yamahata

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2012-03-13  3:21 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-29  1:25 [PATCH 00/21][RFC] postcopy live migration Isaku Yamahata
2011-12-29  1:25 ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 01/21] arch_init: export sort_ram_list() and ram_save_block() Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 02/21] arch_init: export RAM_SAVE_xxx flags for postcopy Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 03/21] arch_init/ram_save: introduce constant for ram save version = 4 Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 04/21] arch_init: refactor host_from_stream_offset() Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 05/21] arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 06/21] arch_init: refactor ram_save_block() Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 07/21] arch_init/ram_save_live: factor out ram_save_limit Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 08/21] arch_init/ram_load: refactor ram_load Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 09/21] exec.c: factor out qemu_get_ram_ptr() Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 10/21] exec.c: export last_ram_offset() Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 11/21] savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 12/21] savevm: qemu_pending_size() to return pending buffered size Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 13/21] savevm, buffered_file: introduce method to drain buffer of buffered file Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 14/21] migration: export migrate_fd_completed() and migrate_fd_cleanup() Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 15/21] migration: factor out parameters into MigrationParams Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 16/21] umem.h: import Linux umem.h Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 17/21] update-linux-headers.sh: teach umem.h to update-linux-headers.sh Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 18/21] configure: add CONFIG_POSTCOPY option Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 19/21] postcopy: introduce -postcopy and -postcopy-flags option Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:25 ` [PATCH 20/21] postcopy outgoing: add -p and -n option to migrate command Isaku Yamahata
2011-12-29  1:25   ` [Qemu-devel] " Isaku Yamahata
2011-12-29  1:26 ` [PATCH 21/21] postcopy: implement postcopy livemigration Isaku Yamahata
2011-12-29  1:26   ` [Qemu-devel] " Isaku Yamahata
2011-12-29 15:51   ` Orit Wasserman
2011-12-29 15:51     ` Orit Wasserman
2012-01-04  3:34     ` Isaku Yamahata
2012-01-04  3:34       ` [Qemu-devel] " Isaku Yamahata
2011-12-29 16:06   ` Avi Kivity
2011-12-29 16:06     ` [Qemu-devel] " Avi Kivity
2012-01-04  3:29     ` Isaku Yamahata
2012-01-04  3:29       ` [Qemu-devel] " Isaku Yamahata
2012-01-12 14:15       ` Avi Kivity
2012-01-12 14:15         ` [Qemu-devel] " Avi Kivity
2011-12-29 22:39 ` [PATCH 00/21][RFC] postcopy live migration Anthony Liguori
2011-12-29 22:39   ` [Qemu-devel] " Anthony Liguori
2012-01-01  9:43   ` Orit Wasserman
2012-01-01  9:43     ` [Qemu-devel] " Orit Wasserman
2012-01-01 16:27     ` Stefan Hajnoczi
2012-01-01 16:27       ` Stefan Hajnoczi
2012-01-02  9:28       ` Dor Laor
2012-01-02  9:28         ` Dor Laor
2012-01-02 17:22         ` Stefan Hajnoczi
2012-01-02 17:22           ` [Qemu-devel] " Stefan Hajnoczi
2012-01-01  9:52   ` Dor Laor
2012-01-01  9:52     ` [Qemu-devel] " Dor Laor
2012-01-04  1:30     ` Takuya Yoshikawa
2012-01-04  1:30       ` [Qemu-devel] " Takuya Yoshikawa
2012-01-04  3:48     ` Michael Roth
2012-01-04  3:48       ` [Qemu-devel] " Michael Roth
2012-01-04  3:51   ` Isaku Yamahata
2012-01-04  3:51     ` Isaku Yamahata
     [not found] ` <BLU0-SMTP161AC380D472854F48E33A5BC9A0@phx.gbl>
2012-01-11  2:45   ` 回??: " Isaku Yamahata
2012-01-11  2:45     ` [Qemu-devel] " Isaku Yamahata
2012-01-12  8:29     ` thfbjyddx
2012-01-12  8:29       ` [Qemu-devel] " thfbjyddx
2012-01-12  8:54       ` 回??: [PATCH 00/21][RFC] postcopy live?migration Isaku Yamahata
2012-01-12  8:54         ` [Qemu-devel] " Isaku Yamahata
2012-01-12 13:26         ` thfbjyddx
2012-01-12 13:26           ` [Qemu-devel] " thfbjyddx
2012-01-16  6:51           ` Isaku Yamahata
2012-01-16  6:51             ` [Qemu-devel] " Isaku Yamahata
2012-01-16 10:17             ` Isaku Yamahata
2012-01-16 10:17               ` [Qemu-devel] " Isaku Yamahata
2012-03-12  8:36               ` thfbjyddx
2012-03-12  8:36                 ` [Qemu-devel] " thfbjyddx
2012-03-13  3:21                 ` Isaku Yamahata
2012-03-13  3:21                   ` [Qemu-devel] " Isaku Yamahata

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.