All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: QEMU bumping memory limit and domain restore
@ 2015-06-02 14:05 Wei Liu
  2015-06-02 14:08 ` Wei Liu
  2015-06-03 13:22 ` George Dunlap
  0 siblings, 2 replies; 18+ messages in thread
From: Wei Liu @ 2015-06-02 14:05 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, Ian Campbell, Stefano Stabellini, Ian Jackson, dslutz,
	andrew, yanghy

Previous discussion at [0].

For the benefit of discussion, we refer to max_memkb inside hypervisor
as hv_max_memkb (name subject to improvement). That's the maximum number
of memory a domain can use.

Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
because of optional ROMs etc.

Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
there is no mechanism to community between QEMU and libxl. This is an
area that needs improvement, we've encountered problems in this area
before.

QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
pages are only accounted in hypervisor. During migration, libxl
(currently) doesn't extract that value from hypervisor.

So now the problem is on the remote end:

1. Libxl indicates domain needs X pages.
2. Domain actually needs X + N pages.
3. Remote end tries to write N more pages and fail.

This behaviour currently doesn't affect normal migration (that you
transfer libxl JSON to remote, construct a domain, then start QEMU)
because QEMU won't bump hv_max_memkb again. This is by design and
reflected in QEMU code.

This behaviour affects COLO and becomes a bug in that case, because
secondary VM's QEMU doesn't go through the same start-of-day
initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
hv_max_memkb inside QEMU.

Andrew plans to embed JSON inside migration v2 and COLO is based on
migration v2. The bug is fixed if JSON is correct in the first place.

As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
it should be fixed for the benefit of COLO.

So here is a proof of concept patch to record and honour that value
during migration.  A new field is added in IDL. Note that we don't
provide xl level config option for it and mandate it to be default value
during domain creation. This is to prevent libxl user from using it to
avoid unforeseen repercussions.

This patch is compiled test only. If we agree this is the way to go I
will test and submit a proper patch.

Wei.

[0] <1428941353-18673-1-git-send-email-dslutz@verizon.com>

---8<---
>From ab9dc179ea4ee26eb88f61f8dad36dd01b63bb6b Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Tue, 2 Jun 2015 14:53:20 +0100
Subject: [PATCH] libxl: record and honour hv_max_memkb

The new field hv_max_memkb in IDL is used to record "max_memkb" inside
hypervisor. That reflects the maximum memory a guest can ask for.

This field is mandated to be default value during guest creation to
avoid unforeseen repercussions. It's only honour when restoring a guest.

(XXX compiled test only at this stage)

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxl/libxl.c         | 17 +++++++++--------
 tools/libxl/libxl_create.c  |  6 ++++++
 tools/libxl/libxl_dom.c     |  9 +++++++--
 tools/libxl/libxl_types.idl |  1 +
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 9117b01..72fec8b 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -6614,6 +6614,7 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
     GC_INIT(ctx);
     int rc;
     libxl__domain_userdata_lock *lock = NULL;
+    uint64_t hv_max_memkb;
 
     CTX_LOCK;
 
@@ -6654,6 +6655,7 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
         }
         libxl_uuid_copy(ctx, &d_config->c_info.uuid, &info.uuid);
         libxl_dominfo_dispose(&info);
+        hv_max_memkb = info.max_memkb; /* store and use later */
     }
 
     /* Memory limits:
@@ -6661,17 +6663,15 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
      * Currently there are three memory limits:
      *  1. "target" in xenstore (originally memory= in config file)
      *  2. "static-max" in xenstore (originally maxmem= in config file)
-     *  3. "max_memkb" in hypervisor
-     *
-     * The third one is not visible and currently managed by
-     * toolstack. In order to rebuild a domain we only need to have
-     * "target" and "static-max".
+     *  3. "max_memkb" in hypervisor (corresponds to hv_max_memkb in
+     *     idl, not visible to xl level)
      */
     {
-        uint32_t target_memkb = 0, max_memkb = 0;
+        uint32_t target_memkb = 0, static_max_memkb = 0;
 
         /* "target" */
-        rc = libxl__get_memory_target(gc, domid, &target_memkb, &max_memkb);
+        rc = libxl__get_memory_target(gc, domid, &target_memkb,
+                                      &static_max_memkb);
         if (rc) {
             LOG(ERROR, "fail to get memory target for domain %d", domid);
             goto out;
@@ -6683,7 +6683,8 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
         d_config->b_info.target_memkb = target_memkb +
             d_config->b_info.video_memkb;
 
-        d_config->b_info.max_memkb = max_memkb;
+        d_config->b_info.max_memkb = static_max_memkb;
+        d_config->b_info.hv_max_memkb = hv_max_memkb;
     }
 
     /* Devices: disk, nic, vtpm, pcidev etc. */
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 86384d2..d35a393 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -813,6 +813,12 @@ static void initiate_domain_create(libxl__egc *egc,
 
     domid = 0;
 
+    if (restore_fd <= 0 &&
+        d_config->b_info.hv_max_memkb != LIBXL_MEMKB_DEFAULT) {
+        LOG(ERROR, "Setting hv_max_memkb when creating domain");
+        goto error_out;
+    }
+
     if (d_config->c_info.ssid_label) {
         char *s = d_config->c_info.ssid_label;
         ret = libxl_flask_context_to_sid(ctx, s, strlen(s),
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index a0c9850..5e33611 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -408,8 +408,13 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         }
     }
 
-    if (xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
-        LIBXL_MAXMEM_CONSTANT) < 0) {
+    /* hv_max_memkb is set in restore path, honour it. */
+    if (info->hv_max_memkb == LIBXL_MEMKB_DEFAULT)
+        rc = xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
+                                 LIBXL_MAXMEM_CONSTANT);
+    else
+        rc = xc_domain_setmaxmem(ctx->xch, domid, info->hv_max_memkb);
+    if (rc < 0) {
         LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Couldn't set max memory");
         return ERROR_FAIL;
     }
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 23f27d4..3c306c0 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -377,6 +377,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),
     ("target_memkb",    MemKB),
+    ("hv_max_memkb",    MemKB),
     ("video_memkb",     MemKB),
     ("shadow_memkb",    MemKB),
     ("rtc_timeoffset",  uint32),
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 14:05 RFC: QEMU bumping memory limit and domain restore Wei Liu
@ 2015-06-02 14:08 ` Wei Liu
  2015-06-02 15:49   ` Ian Campbell
  2015-06-03 13:22 ` George Dunlap
  1 sibling, 1 reply; 18+ messages in thread
From: Wei Liu @ 2015-06-02 14:08 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Ian Jackson, dslutz, yanghy

I fat-fingered Andrew's email address. Really CC him this time.

On Tue, Jun 02, 2015 at 03:05:07PM +0100, Wei Liu wrote:
> Previous discussion at [0].
> 
> For the benefit of discussion, we refer to max_memkb inside hypervisor
> as hv_max_memkb (name subject to improvement). That's the maximum number
> of memory a domain can use.
> 
> Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
> because of optional ROMs etc.
> 
> Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
> there is no mechanism to community between QEMU and libxl. This is an
> area that needs improvement, we've encountered problems in this area
> before.
> 
> QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
> pages are only accounted in hypervisor. During migration, libxl
> (currently) doesn't extract that value from hypervisor.
> 
> So now the problem is on the remote end:
> 
> 1. Libxl indicates domain needs X pages.
> 2. Domain actually needs X + N pages.
> 3. Remote end tries to write N more pages and fail.
> 
> This behaviour currently doesn't affect normal migration (that you
> transfer libxl JSON to remote, construct a domain, then start QEMU)
> because QEMU won't bump hv_max_memkb again. This is by design and
> reflected in QEMU code.
> 
> This behaviour affects COLO and becomes a bug in that case, because
> secondary VM's QEMU doesn't go through the same start-of-day
> initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
> hv_max_memkb inside QEMU.
> 
> Andrew plans to embed JSON inside migration v2 and COLO is based on
> migration v2. The bug is fixed if JSON is correct in the first place.
> 
> As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
> it should be fixed for the benefit of COLO.
> 
> So here is a proof of concept patch to record and honour that value
> during migration.  A new field is added in IDL. Note that we don't
> provide xl level config option for it and mandate it to be default value
> during domain creation. This is to prevent libxl user from using it to
> avoid unforeseen repercussions.
> 
> This patch is compiled test only. If we agree this is the way to go I
> will test and submit a proper patch.
> 
> Wei.
> 
> [0] <1428941353-18673-1-git-send-email-dslutz@verizon.com>
> 
> ---8<---
> >From ab9dc179ea4ee26eb88f61f8dad36dd01b63bb6b Mon Sep 17 00:00:00 2001
> From: Wei Liu <wei.liu2@citrix.com>
> Date: Tue, 2 Jun 2015 14:53:20 +0100
> Subject: [PATCH] libxl: record and honour hv_max_memkb
> 
> The new field hv_max_memkb in IDL is used to record "max_memkb" inside
> hypervisor. That reflects the maximum memory a guest can ask for.
> 
> This field is mandated to be default value during guest creation to
> avoid unforeseen repercussions. It's only honour when restoring a guest.
> 
> (XXX compiled test only at this stage)
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> ---
>  tools/libxl/libxl.c         | 17 +++++++++--------
>  tools/libxl/libxl_create.c  |  6 ++++++
>  tools/libxl/libxl_dom.c     |  9 +++++++--
>  tools/libxl/libxl_types.idl |  1 +
>  4 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 9117b01..72fec8b 100644
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -6614,6 +6614,7 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
>      GC_INIT(ctx);
>      int rc;
>      libxl__domain_userdata_lock *lock = NULL;
> +    uint64_t hv_max_memkb;
>  
>      CTX_LOCK;
>  
> @@ -6654,6 +6655,7 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
>          }
>          libxl_uuid_copy(ctx, &d_config->c_info.uuid, &info.uuid);
>          libxl_dominfo_dispose(&info);
> +        hv_max_memkb = info.max_memkb; /* store and use later */
>      }
>  
>      /* Memory limits:
> @@ -6661,17 +6663,15 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
>       * Currently there are three memory limits:
>       *  1. "target" in xenstore (originally memory= in config file)
>       *  2. "static-max" in xenstore (originally maxmem= in config file)
> -     *  3. "max_memkb" in hypervisor
> -     *
> -     * The third one is not visible and currently managed by
> -     * toolstack. In order to rebuild a domain we only need to have
> -     * "target" and "static-max".
> +     *  3. "max_memkb" in hypervisor (corresponds to hv_max_memkb in
> +     *     idl, not visible to xl level)
>       */
>      {
> -        uint32_t target_memkb = 0, max_memkb = 0;
> +        uint32_t target_memkb = 0, static_max_memkb = 0;
>  
>          /* "target" */
> -        rc = libxl__get_memory_target(gc, domid, &target_memkb, &max_memkb);
> +        rc = libxl__get_memory_target(gc, domid, &target_memkb,
> +                                      &static_max_memkb);
>          if (rc) {
>              LOG(ERROR, "fail to get memory target for domain %d", domid);
>              goto out;
> @@ -6683,7 +6683,8 @@ int libxl_retrieve_domain_configuration(libxl_ctx *ctx, uint32_t domid,
>          d_config->b_info.target_memkb = target_memkb +
>              d_config->b_info.video_memkb;
>  
> -        d_config->b_info.max_memkb = max_memkb;
> +        d_config->b_info.max_memkb = static_max_memkb;
> +        d_config->b_info.hv_max_memkb = hv_max_memkb;
>      }
>  
>      /* Devices: disk, nic, vtpm, pcidev etc. */
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 86384d2..d35a393 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -813,6 +813,12 @@ static void initiate_domain_create(libxl__egc *egc,
>  
>      domid = 0;
>  
> +    if (restore_fd <= 0 &&
> +        d_config->b_info.hv_max_memkb != LIBXL_MEMKB_DEFAULT) {
> +        LOG(ERROR, "Setting hv_max_memkb when creating domain");
> +        goto error_out;
> +    }
> +
>      if (d_config->c_info.ssid_label) {
>          char *s = d_config->c_info.ssid_label;
>          ret = libxl_flask_context_to_sid(ctx, s, strlen(s),
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index a0c9850..5e33611 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -408,8 +408,13 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>          }
>      }
>  
> -    if (xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
> -        LIBXL_MAXMEM_CONSTANT) < 0) {
> +    /* hv_max_memkb is set in restore path, honour it. */
> +    if (info->hv_max_memkb == LIBXL_MEMKB_DEFAULT)
> +        rc = xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
> +                                 LIBXL_MAXMEM_CONSTANT);
> +    else
> +        rc = xc_domain_setmaxmem(ctx->xch, domid, info->hv_max_memkb);
> +    if (rc < 0) {
>          LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Couldn't set max memory");
>          return ERROR_FAIL;
>      }
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 23f27d4..3c306c0 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -377,6 +377,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
>      ("tsc_mode",        libxl_tsc_mode),
>      ("max_memkb",       MemKB),
>      ("target_memkb",    MemKB),
> +    ("hv_max_memkb",    MemKB),
>      ("video_memkb",     MemKB),
>      ("shadow_memkb",    MemKB),
>      ("rtc_timeoffset",  uint32),
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 14:08 ` Wei Liu
@ 2015-06-02 15:49   ` Ian Campbell
  2015-06-02 16:11     ` Andrew Cooper
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ian Campbell @ 2015-06-02 15:49 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel, yanghy

On Tue, 2015-06-02 at 15:08 +0100, Wei Liu wrote:
[...]
> > So here is a proof of concept patch to record and honour that value
> > during migration.  A new field is added in IDL. Note that we don't
> > provide xl level config option for it and mandate it to be default value
> > during domain creation. This is to prevent libxl user from using it to
> > avoid unforeseen repercussions.
> [...]
> > This field is mandated to be default value during guest creation to
> > avoid unforeseen repercussions. It's only honour when restoring a guest.

IMHO this means that the libxl API/IDL is the wrong place for this
value. Only user and/or application serviceable parts belong in the API.

So while I agree that this value need to be communicated across a
migration, the JSON blob is not the right mechanism for doing so. IOW if
you go down this general path I think you need a new
field/record/whatever in the migration protocol at some layer or other
(if not libxc then at the libxl layer).

To my mind this "actual state" vs "user configured state" is more akin
to the sorts of things which is in the hypervisor save blob or something
like that (nb: This is not a suggestion that it should go there).

IIRC Don also outlined another case, which is
    xl create -p
    xl migrate
    xl unpause

Which might need more thought if any bumping can happen after the
migrate i.e. on unpause?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 15:49   ` Ian Campbell
@ 2015-06-02 16:11     ` Andrew Cooper
  2015-06-02 17:40       ` Wei Liu
  2015-06-02 17:32     ` Wei Liu
  2015-06-03  1:05     ` Yang Hongyang
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2015-06-02 16:11 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu
  Cc: xen-devel, yanghy, Ian Jackson, dslutz, Stefano Stabellini

On 02/06/15 16:49, Ian Campbell wrote:
> On Tue, 2015-06-02 at 15:08 +0100, Wei Liu wrote:
> [...]
>>> So here is a proof of concept patch to record and honour that value
>>> during migration.  A new field is added in IDL. Note that we don't
>>> provide xl level config option for it and mandate it to be default value
>>> during domain creation. This is to prevent libxl user from using it to
>>> avoid unforeseen repercussions.
>> [...]
>>> This field is mandated to be default value during guest creation to
>>> avoid unforeseen repercussions. It's only honour when restoring a guest.
> IMHO this means that the libxl API/IDL is the wrong place for this
> value. Only user and/or application serviceable parts belong in the API.
>
> So while I agree that this value need to be communicated across a
> migration, the JSON blob is not the right mechanism for doing so. IOW if
> you go down this general path I think you need a new
> field/record/whatever in the migration protocol at some layer or other
> (if not libxc then at the libxl layer).
>
> To my mind this "actual state" vs "user configured state" is more akin
> to the sorts of things which is in the hypervisor save blob or something
> like that (nb: This is not a suggestion that it should go there).
>
> IIRC Don also outlined another case, which is
>     xl create -p
>     xl migrate
>     xl unpause
>
> Which might need more thought if any bumping can happen after the
> migrate i.e. on unpause?
>
>

The problem is qemu using set_max_mem.  It should never have done so.

Nothing other than libxl should be using such hypercalls, at which point
libxls idea of guest memory is accurate and the bug ceases to exist.

~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 15:49   ` Ian Campbell
  2015-06-02 16:11     ` Andrew Cooper
@ 2015-06-02 17:32     ` Wei Liu
  2015-06-03  1:05     ` Yang Hongyang
  2 siblings, 0 replies; 18+ messages in thread
From: Wei Liu @ 2015-06-02 17:32 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel, yanghy

On Tue, Jun 02, 2015 at 04:49:09PM +0100, Ian Campbell wrote:
> On Tue, 2015-06-02 at 15:08 +0100, Wei Liu wrote:
> [...]
> > > So here is a proof of concept patch to record and honour that value
> > > during migration.  A new field is added in IDL. Note that we don't
> > > provide xl level config option for it and mandate it to be default value
> > > during domain creation. This is to prevent libxl user from using it to
> > > avoid unforeseen repercussions.
> > [...]
> > > This field is mandated to be default value during guest creation to
> > > avoid unforeseen repercussions. It's only honour when restoring a guest.
> 
> IMHO this means that the libxl API/IDL is the wrong place for this
> value. Only user and/or application serviceable parts belong in the API.
> 

I feared the same when I wrote the patch.

> So while I agree that this value need to be communicated across a
> migration, the JSON blob is not the right mechanism for doing so. IOW if
> you go down this general path I think you need a new
> field/record/whatever in the migration protocol at some layer or other
> (if not libxc then at the libxl layer).
> 

There isn't a libxl layer for migration at the moment. The stream is xl
+ libxc. Maybe we should wait till migration v2 is in place.

> To my mind this "actual state" vs "user configured state" is more akin
> to the sorts of things which is in the hypervisor save blob or something
> like that (nb: This is not a suggestion that it should go there).
> 
> IIRC Don also outlined another case, which is
>     xl create -p
>     xl migrate
>     xl unpause
> 
> Which might need more thought if any bumping can happen after the
> migrate i.e. on unpause?
> 

I just tried. It failed at "xl migrate".

Wei.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 16:11     ` Andrew Cooper
@ 2015-06-02 17:40       ` Wei Liu
  2015-06-04 15:28         ` Wei Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Liu @ 2015-06-02 17:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Ian Jackson, dslutz,
	xen-devel, yanghy

On Tue, Jun 02, 2015 at 05:11:02PM +0100, Andrew Cooper wrote:
> On 02/06/15 16:49, Ian Campbell wrote:
> > On Tue, 2015-06-02 at 15:08 +0100, Wei Liu wrote:
> > [...]
> >>> So here is a proof of concept patch to record and honour that value
> >>> during migration.  A new field is added in IDL. Note that we don't
> >>> provide xl level config option for it and mandate it to be default value
> >>> during domain creation. This is to prevent libxl user from using it to
> >>> avoid unforeseen repercussions.
> >> [...]
> >>> This field is mandated to be default value during guest creation to
> >>> avoid unforeseen repercussions. It's only honour when restoring a guest.
> > IMHO this means that the libxl API/IDL is the wrong place for this
> > value. Only user and/or application serviceable parts belong in the API.
> >
> > So while I agree that this value need to be communicated across a
> > migration, the JSON blob is not the right mechanism for doing so. IOW if
> > you go down this general path I think you need a new
> > field/record/whatever in the migration protocol at some layer or other
> > (if not libxc then at the libxl layer).
> >
> > To my mind this "actual state" vs "user configured state" is more akin
> > to the sorts of things which is in the hypervisor save blob or something
> > like that (nb: This is not a suggestion that it should go there).
> >
> > IIRC Don also outlined another case, which is
> >     xl create -p
> >     xl migrate
> >     xl unpause
> >
> > Which might need more thought if any bumping can happen after the
> > migrate i.e. on unpause?
> >
> >
> 
> The problem is qemu using set_max_mem.  It should never have done so.
> 
> Nothing other than libxl should be using such hypercalls, at which point
> libxls idea of guest memory is accurate and the bug ceases to exist.
> 

That would be an ideal solution, but it's not going to materialise
any time soon because:

1. Libxl cannot know how much more memory guest needs prior to QEMU
   start-up.
2. QEMU cannot communicate back the value to libxl because
 2.1 there is no communication channel;
 2.2 there is no libxld to update the config (maybe we can work around
     this by waiting for QEMU during guest creation).

While we work towards that end, a short term solution might be:

1. Make QEMU not call xc_domain_setmaxmem.
2. Provide a field in IDL and a config option in xl to indicate how
   much more memory is needed. User can fill in that option as he / she
   sees fit.

After we get to the ideal solution we can then deprecate that field /
option.

Wei.

> ~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 15:49   ` Ian Campbell
  2015-06-02 16:11     ` Andrew Cooper
  2015-06-02 17:32     ` Wei Liu
@ 2015-06-03  1:05     ` Yang Hongyang
  2 siblings, 0 replies; 18+ messages in thread
From: Yang Hongyang @ 2015-06-03  1:05 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu
  Cc: xen-devel, Andrew Cooper, Ian Jackson, dslutz, Stefano Stabellini



On 06/02/2015 11:49 PM, Ian Campbell wrote:
> On Tue, 2015-06-02 at 15:08 +0100, Wei Liu wrote:
> [...]
>>> So here is a proof of concept patch to record and honour that value
>>> during migration.  A new field is added in IDL. Note that we don't
>>> provide xl level config option for it and mandate it to be default value
>>> during domain creation. This is to prevent libxl user from using it to
>>> avoid unforeseen repercussions.
>> [...]
>>> This field is mandated to be default value during guest creation to
>>> avoid unforeseen repercussions. It's only honour when restoring a guest.
>
> IMHO this means that the libxl API/IDL is the wrong place for this
> value. Only user and/or application serviceable parts belong in the API.
>
> So while I agree that this value need to be communicated across a
> migration, the JSON blob is not the right mechanism for doing so. IOW if
> you go down this general path I think you need a new
> field/record/whatever in the migration protocol at some layer or other
> (if not libxc then at the libxl layer).
>
> To my mind this "actual state" vs "user configured state" is more akin
> to the sorts of things which is in the hypervisor save blob or something
> like that (nb: This is not a suggestion that it should go there).
>
> IIRC Don also outlined another case, which is
>      xl create -p
>      xl migrate
>      xl unpause

Actually this is what COLO do. On primary, we must start using -p then
migrate to ensure the disk is consistent.

>
> Which might need more thought if any bumping can happen after the
> migrate i.e. on unpause?
>
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 14:05 RFC: QEMU bumping memory limit and domain restore Wei Liu
  2015-06-02 14:08 ` Wei Liu
@ 2015-06-03 13:22 ` George Dunlap
  2015-06-03 13:32   ` Andrew Cooper
  2015-06-04  9:14   ` Wei Liu
  1 sibling, 2 replies; 18+ messages in thread
From: George Dunlap @ 2015-06-03 13:22 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Ian Jackson, Don Slutz, andrew,
	xen-devel, Yang Hongyang

On Tue, Jun 2, 2015 at 3:05 PM, Wei Liu <wei.liu2@citrix.com> wrote:
> Previous discussion at [0].
>
> For the benefit of discussion, we refer to max_memkb inside hypervisor
> as hv_max_memkb (name subject to improvement). That's the maximum number
> of memory a domain can use.

Why don't we try to use "memory" for virtual RAM that we report to the
guest, and "pages" for what exists inside the hypervisor?  "Pages" is
the term the hypervisor itself uses internally (i.e., set_max_mem()
actually changes a domain's max_pages value).

So in this case both guest memory and option roms are created using
hypervisor pages.

> Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
> because of optional ROMs etc.

So a translation of this using "memory/pages" terminology would be:

QEMU may need extra pages from Xen to implement option ROMS, and so at
the moment it calls set_max_mem() to increase max_pages so that it can
allocate more pages to the guest.  libxl doesn't know what max_pages a
domain needs prior to qemu start-up.

> Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
> there is no mechanism to community between QEMU and libxl. This is an
> area that needs improvement, we've encountered problems in this area
> before.

[translating]
Libxl doesn't know max_pages  even after qemu start-up, because there
is no mechanism to communicate between qemu and libxl.

> QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
> pages are only accounted in hypervisor. During migration, libxl
> (currently) doesn't extract that value from hypervisor.

[translating]
qemu calls xc_domain_setmaxmem to increase max_pages by N pages.
Those pages are only accounted for in the hypervisor.  libxl
(currently) does not extract that value from the hypervisor.

> So now the problem is on the remote end:
>
> 1. Libxl indicates domain needs X pages.
> 2. Domain actually needs X + N pages.
> 3. Remote end tries to write N more pages and fail.
>
> This behaviour currently doesn't affect normal migration (that you
> transfer libxl JSON to remote, construct a domain, then start QEMU)
> because QEMU won't bump hv_max_memkb again. This is by design and
> reflected in QEMU code.

I don't understand this paragraph -- does the remote domain actually
need X+N pages or not?  If it does, in what way does this behavior
"not affect normal migration"?

> This behaviour affects COLO and becomes a bug in that case, because
> secondary VM's QEMU doesn't go through the same start-of-day
> initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
> hv_max_memkb inside QEMU.
>
> Andrew plans to embed JSON inside migration v2 and COLO is based on
> migration v2. The bug is fixed if JSON is correct in the first place.
>
> As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
> it should be fixed for the benefit of COLO.
>
> So here is a proof of concept patch to record and honour that value
> during migration.  A new field is added in IDL. Note that we don't
> provide xl level config option for it and mandate it to be default value
> during domain creation. This is to prevent libxl user from using it to
> avoid unforeseen repercussions.
>
> This patch is compiled test only. If we agree this is the way to go I
> will test and submit a proper patch.

Reading max_pages from Xen and setting it on the far side seems like a
reasonable option.  Is there a reason we can't add a magic XC_SAVE_ID
for v1, like we do for other parameters?

 -George

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-03 13:22 ` George Dunlap
@ 2015-06-03 13:32   ` Andrew Cooper
  2015-06-03 13:53     ` George Dunlap
  2015-06-04  9:14   ` Wei Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2015-06-03 13:32 UTC (permalink / raw)
  To: George Dunlap, Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Ian Jackson, Don Slutz, andrew,
	xen-devel, Yang Hongyang

On 03/06/15 14:22, George Dunlap wrote:
> On Tue, Jun 2, 2015 at 3:05 PM, Wei Liu <wei.liu2@citrix.com> wrote:
>> Previous discussion at [0].
>>
>> For the benefit of discussion, we refer to max_memkb inside hypervisor
>> as hv_max_memkb (name subject to improvement). That's the maximum number
>> of memory a domain can use.
> Why don't we try to use "memory" for virtual RAM that we report to the
> guest, and "pages" for what exists inside the hypervisor?  "Pages" is
> the term the hypervisor itself uses internally (i.e., set_max_mem()
> actually changes a domain's max_pages value).
>
> So in this case both guest memory and option roms are created using
> hypervisor pages.
>
>> Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
>> because of optional ROMs etc.
> So a translation of this using "memory/pages" terminology would be:
>
> QEMU may need extra pages from Xen to implement option ROMS, and so at
> the moment it calls set_max_mem() to increase max_pages so that it can
> allocate more pages to the guest.  libxl doesn't know what max_pages a
> domain needs prior to qemu start-up.
>
>> Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
>> there is no mechanism to community between QEMU and libxl. This is an
>> area that needs improvement, we've encountered problems in this area
>> before.
> [translating]
> Libxl doesn't know max_pages  even after qemu start-up, because there
> is no mechanism to communicate between qemu and libxl.
>
>> QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
>> pages are only accounted in hypervisor. During migration, libxl
>> (currently) doesn't extract that value from hypervisor.
> [translating]
> qemu calls xc_domain_setmaxmem to increase max_pages by N pages.
> Those pages are only accounted for in the hypervisor.  libxl
> (currently) does not extract that value from the hypervisor.
>
>> So now the problem is on the remote end:
>>
>> 1. Libxl indicates domain needs X pages.
>> 2. Domain actually needs X + N pages.
>> 3. Remote end tries to write N more pages and fail.
>>
>> This behaviour currently doesn't affect normal migration (that you
>> transfer libxl JSON to remote, construct a domain, then start QEMU)
>> because QEMU won't bump hv_max_memkb again. This is by design and
>> reflected in QEMU code.
> I don't understand this paragraph -- does the remote domain actually
> need X+N pages or not?  If it does, in what way does this behavior
> "not affect normal migration"?
>
>> This behaviour affects COLO and becomes a bug in that case, because
>> secondary VM's QEMU doesn't go through the same start-of-day
>> initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
>> hv_max_memkb inside QEMU.
>>
>> Andrew plans to embed JSON inside migration v2 and COLO is based on
>> migration v2. The bug is fixed if JSON is correct in the first place.
>>
>> As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
>> it should be fixed for the benefit of COLO.
>>
>> So here is a proof of concept patch to record and honour that value
>> during migration.  A new field is added in IDL. Note that we don't
>> provide xl level config option for it and mandate it to be default value
>> during domain creation. This is to prevent libxl user from using it to
>> avoid unforeseen repercussions.
>>
>> This patch is compiled test only. If we agree this is the way to go I
>> will test and submit a proper patch.
> Reading max_pages from Xen and setting it on the far side seems like a
> reasonable option.

It is the wrong level to fix the bug.  Yes - it will (and does) fix one
manifestation of the bug, but does not solve the problem.

>   Is there a reason we can't add a magic XC_SAVE_ID
> for v1, like we do for other parameters?

Amongst other things, playing with xc_domain_setmaxmem() is liable to
cause a PoD domain to be shot by Xen because the PoD cache was not
adjusted at the same time that maxmem was.

Only libxl is in a position to safely adjust domain memory.

~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-03 13:32   ` Andrew Cooper
@ 2015-06-03 13:53     ` George Dunlap
  2015-06-03 16:11       ` Don Slutz
  0 siblings, 1 reply; 18+ messages in thread
From: George Dunlap @ 2015-06-03 13:53 UTC (permalink / raw)
  To: Andrew Cooper, Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Ian Jackson, Don Slutz, andrew,
	xen-devel, Yang Hongyang

On 06/03/2015 02:32 PM, Andrew Cooper wrote:
> On 03/06/15 14:22, George Dunlap wrote:
>> On Tue, Jun 2, 2015 at 3:05 PM, Wei Liu <wei.liu2@citrix.com> wrote:
>>> Previous discussion at [0].
>>>
>>> For the benefit of discussion, we refer to max_memkb inside hypervisor
>>> as hv_max_memkb (name subject to improvement). That's the maximum number
>>> of memory a domain can use.
>> Why don't we try to use "memory" for virtual RAM that we report to the
>> guest, and "pages" for what exists inside the hypervisor?  "Pages" is
>> the term the hypervisor itself uses internally (i.e., set_max_mem()
>> actually changes a domain's max_pages value).
>>
>> So in this case both guest memory and option roms are created using
>> hypervisor pages.
>>
>>> Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
>>> because of optional ROMs etc.
>> So a translation of this using "memory/pages" terminology would be:
>>
>> QEMU may need extra pages from Xen to implement option ROMS, and so at
>> the moment it calls set_max_mem() to increase max_pages so that it can
>> allocate more pages to the guest.  libxl doesn't know what max_pages a
>> domain needs prior to qemu start-up.
>>
>>> Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
>>> there is no mechanism to community between QEMU and libxl. This is an
>>> area that needs improvement, we've encountered problems in this area
>>> before.
>> [translating]
>> Libxl doesn't know max_pages  even after qemu start-up, because there
>> is no mechanism to communicate between qemu and libxl.
>>
>>> QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
>>> pages are only accounted in hypervisor. During migration, libxl
>>> (currently) doesn't extract that value from hypervisor.
>> [translating]
>> qemu calls xc_domain_setmaxmem to increase max_pages by N pages.
>> Those pages are only accounted for in the hypervisor.  libxl
>> (currently) does not extract that value from the hypervisor.
>>
>>> So now the problem is on the remote end:
>>>
>>> 1. Libxl indicates domain needs X pages.
>>> 2. Domain actually needs X + N pages.
>>> 3. Remote end tries to write N more pages and fail.
>>>
>>> This behaviour currently doesn't affect normal migration (that you
>>> transfer libxl JSON to remote, construct a domain, then start QEMU)
>>> because QEMU won't bump hv_max_memkb again. This is by design and
>>> reflected in QEMU code.
>> I don't understand this paragraph -- does the remote domain actually
>> need X+N pages or not?  If it does, in what way does this behavior
>> "not affect normal migration"?
>>
>>> This behaviour affects COLO and becomes a bug in that case, because
>>> secondary VM's QEMU doesn't go through the same start-of-day
>>> initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
>>> hv_max_memkb inside QEMU.
>>>
>>> Andrew plans to embed JSON inside migration v2 and COLO is based on
>>> migration v2. The bug is fixed if JSON is correct in the first place.
>>>
>>> As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
>>> it should be fixed for the benefit of COLO.
>>>
>>> So here is a proof of concept patch to record and honour that value
>>> during migration.  A new field is added in IDL. Note that we don't
>>> provide xl level config option for it and mandate it to be default value
>>> during domain creation. This is to prevent libxl user from using it to
>>> avoid unforeseen repercussions.
>>>
>>> This patch is compiled test only. If we agree this is the way to go I
>>> will test and submit a proper patch.
>> Reading max_pages from Xen and setting it on the far side seems like a
>> reasonable option.
> 
> It is the wrong level to fix the bug.  Yes - it will (and does) fix one
> manifestation of the bug, but does not solve the problem.
> 
>>   Is there a reason we can't add a magic XC_SAVE_ID
>> for v1, like we do for other parameters?
> 
> Amongst other things, playing with xc_domain_setmaxmem() is liable to
> cause a PoD domain to be shot by Xen because the PoD cache was not
> adjusted at the same time that maxmem was.

As far as I can tell that's completely unrelated.

The PoD target needs to be updated when you change the *balloon driver*
target in xenstore.  libxl_domain_setmaxmem() for instance won't update
the PoD target either.

> Only libxl is in a position to safely adjust domain memory.

I was initially of the same mind as you in this matter.  But I think if
you stop thinking about this as "the domain's memory", and start instead
thinking about it as qemu modifying the state of the machine (by, say,
adding option roms), then I think it can be less strict.  The *memory*
(i.e., what is seen by the guest as virtual RAM) is controlled by libxl,
but the *pages* may be manipulated by others (as they are by the balloon
driver, for instance).

Although actually -- I guess that exposes another issue with this: what
if someone calls setmaxmem in libxl?  libxl doesn't know about the extra
N pages that qemu wanted, so it won't factor that in to its setmaxmem
calculation.  In particular, qemu's read-modify-write of setmaxmem isn't
idempotent -- there's almost certainly a race if someone were to call
libxl_domain_setmaxmem at the same time qemu was trying to add a few pages.

In any case, as Wei said, you have the problem that by the time qemu
figures out that it wants some extra pages to implement an option rom,
libxl has already determined that its job was complete and walked off.
Changing that requires a more architectural change.

Perhaps joining this with the security deprivileging thing would make
some sense?  I.e., have qemu do no hypercalls at all, but have them done
instead by a "babysitter" process from libxl, which could then update
the domain config state as well.

 -George

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-03 13:53     ` George Dunlap
@ 2015-06-03 16:11       ` Don Slutz
  0 siblings, 0 replies; 18+ messages in thread
From: Don Slutz @ 2015-06-03 16:11 UTC (permalink / raw)
  To: George Dunlap, Andrew Cooper, Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Ian Jackson, Don Slutz, andrew,
	xen-devel, Yang Hongyang

On 06/03/15 09:53, George Dunlap wrote:
> On 06/03/2015 02:32 PM, Andrew Cooper wrote:
>> On 03/06/15 14:22, George Dunlap wrote:
>>> On Tue, Jun 2, 2015 at 3:05 PM, Wei Liu <wei.liu2@citrix.com> wrote:
>>>> Previous discussion at [0].
>>>>
>>>> For the benefit of discussion, we refer to max_memkb inside hypervisor
>>>> as hv_max_memkb (name subject to improvement). That's the maximum number
>>>> of memory a domain can use.
>>> Why don't we try to use "memory" for virtual RAM that we report to the
>>> guest, and "pages" for what exists inside the hypervisor?  "Pages" is
>>> the term the hypervisor itself uses internally (i.e., set_max_mem()
>>> actually changes a domain's max_pages value).
>>>
>>> So in this case both guest memory and option roms are created using
>>> hypervisor pages.
>>>
>>>> Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
>>>> because of optional ROMs etc.
>>> So a translation of this using "memory/pages" terminology would be:
>>>
>>> QEMU may need extra pages from Xen to implement option ROMS, and so at
>>> the moment it calls set_max_mem() to increase max_pages so that it can
>>> allocate more pages to the guest.  libxl doesn't know what max_pages a
>>> domain needs prior to qemu start-up.
>>>
>>>> Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
>>>> there is no mechanism to community between QEMU and libxl. This is an
>>>> area that needs improvement, we've encountered problems in this area
>>>> before.
>>> [translating]
>>> Libxl doesn't know max_pages  even after qemu start-up, because there
>>> is no mechanism to communicate between qemu and libxl.

This is not 100% true.  libxl and QEMU do communicate.  Currently QEMU
does not record it's changes but I have not search all the info you can
get.  Also libxl "knows" what it set maxmem to before QEMU starts and it
can read (via xc) what it is after QEMU is done all the changes.  QEMU
only responds to commands from libxl at this time.

>>>> QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
>>>> pages are only accounted in hypervisor. During migration, libxl
>>>> (currently) doesn't extract that value from hypervisor.
>>> [translating]
>>> qemu calls xc_domain_setmaxmem to increase max_pages by N pages.
>>> Those pages are only accounted for in the hypervisor.  libxl
>>> (currently) does not extract that value from the hypervisor.

Yes.

>>>> So now the problem is on the remote end:
>>>>
>>>> 1. Libxl indicates domain needs X pages.
>>>> 2. Domain actually needs X + N pages.
>>>> 3. Remote end tries to write N more pages and fail.
>>>>
>>>> This behaviour currently doesn't affect normal migration (that you
>>>> transfer libxl JSON to remote, construct a domain, then start QEMU)
>>>> because QEMU won't bump hv_max_memkb again. This is by design and
>>>> reflected in QEMU code.
>>> I don't understand this paragraph -- does the remote domain actually
>>> need X+N pages or not?  If it does, in what way does this behavior
>>> "not affect normal migration"?

As far as I know is does.  And it affects "xl save" and "xl restore".

However you need more then 4 e1000 nics (for example) to see this.

>>>> This behaviour affects COLO and becomes a bug in that case, because
>>>> secondary VM's QEMU doesn't go through the same start-of-day
>>>> initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
>>>> hv_max_memkb inside QEMU.
>>>>
>>>> Andrew plans to embed JSON inside migration v2 and COLO is based on
>>>> migration v2. The bug is fixed if JSON is correct in the first place.
>>>>
>>>> As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
>>>> it should be fixed for the benefit of COLO.
>>>>
>>>> So here is a proof of concept patch to record and honour that value
>>>> during migration.  A new field is added in IDL. Note that we don't
>>>> provide xl level config option for it and mandate it to be default value
>>>> during domain creation. This is to prevent libxl user from using it to
>>>> avoid unforeseen repercussions.
>>>>
>>>> This patch is compiled test only. If we agree this is the way to go I
>>>> will test and submit a proper patch.
>>> Reading max_pages from Xen and setting it on the far side seems like a
>>> reasonable option.
>> It is the wrong level to fix the bug.  Yes - it will (and does) fix one
>> manifestation of the bug, but does not solve the problem.
>>
>>>   Is there a reason we can't add a magic XC_SAVE_ID
>>> for v1, like we do for other parameters?

Wei Liu failed to refer to:

Date: Tue, 14 Apr 2015 18:06:26 -0400
Subject: [PATCH v2 1/1] tools: Handle xc_maxmem adjustments
Thread-Topic: [PATCH v2 1/1] tools: Handle xc_maxmem adjustments
Thread-Index: AdB2/0pJoAZieMWqTomVEBNbV7iiEA==
Message-ID: <1429049186-27343-1-git-send-email-dslutz@verizon.com>
References: <20150414175437.GB11717@zion.uk.xensource.com>
In-Reply-To: <20150414175437.GB11717@zion.uk.xensource.com>

Which was this coded up.

>> Amongst other things, playing with xc_domain_setmaxmem() is liable to
>> cause a PoD domain to be shot by Xen because the PoD cache was not
>> adjusted at the same time that maxmem was.
> As far as I can tell that's completely unrelated.
>
> The PoD target needs to be updated when you change the *balloon driver*
> target in xenstore.  libxl_domain_setmaxmem() for instance won't update
> the PoD target either.
>

And xc_maxmem does not change xenstore.

>> Only libxl is in a position to safely adjust domain memory.
> I was initially of the same mind as you in this matter.  But I think if
> you stop thinking about this as "the domain's memory", and start instead
> thinking about it as qemu modifying the state of the machine (by, say,
> adding option roms), then I think it can be less strict.  The *memory*
> (i.e., what is seen by the guest as virtual RAM) is controlled by libxl,
> but the *pages* may be manipulated by others (as they are by the balloon
> driver, for instance).
>
> Although actually -- I guess that exposes another issue with this: what
> if someone calls setmaxmem in libxl?  libxl doesn't know about the extra
> N pages that qemu wanted, so it won't factor that in to its setmaxmem
> calculation.  In particular, qemu's read-modify-write of setmaxmem isn't
> idempotent -- there's almost certainly a race if someone were to call
> libxl_domain_setmaxmem at the same time qemu was trying to add a few pages.

There have been changes in the area.  libxl_setmem now reads, modifies
and writes.

> In any case, as Wei said, you have the problem that by the time qemu
> figures out that it wants some extra pages to implement an option rom,
> libxl has already determined that its job was complete and walked off.

Not in my testing.  libxl wants to get the VNC "display" that QEMU
configured which happens after QEMU has changed maxmem.

> Changing that requires a more architectural change.
>
> Perhaps joining this with the security deprivileging thing would make
> some sense?  I.e., have qemu do no hypercalls at all, but have them done
> instead by a "babysitter" process from libxl, which could then update
> the domain config state as well.

Not sure.
    -Don Slutz

>  -George
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-03 13:22 ` George Dunlap
  2015-06-03 13:32   ` Andrew Cooper
@ 2015-06-04  9:14   ` Wei Liu
  2015-06-04  9:26     ` Ian Campbell
  1 sibling, 1 reply; 18+ messages in thread
From: Wei Liu @ 2015-06-04  9:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Ian Jackson,
	Don Slutz, andrew, xen-devel, Yang Hongyang

On Wed, Jun 03, 2015 at 02:22:25PM +0100, George Dunlap wrote:
> On Tue, Jun 2, 2015 at 3:05 PM, Wei Liu <wei.liu2@citrix.com> wrote:
> > Previous discussion at [0].
> >
> > For the benefit of discussion, we refer to max_memkb inside hypervisor
> > as hv_max_memkb (name subject to improvement). That's the maximum number
> > of memory a domain can use.
> 
> Why don't we try to use "memory" for virtual RAM that we report to the
> guest, and "pages" for what exists inside the hypervisor?  "Pages" is
> the term the hypervisor itself uses internally (i.e., set_max_mem()
> actually changes a domain's max_pages value).
> 
> So in this case both guest memory and option roms are created using
> hypervisor pages.
> 
> > Libxl doesn't know hv_max_memkb for a domain needs prior to QEMU start-up
> > because of optional ROMs etc.
> 
> So a translation of this using "memory/pages" terminology would be:
> 
> QEMU may need extra pages from Xen to implement option ROMS, and so at
> the moment it calls set_max_mem() to increase max_pages so that it can
> allocate more pages to the guest.  libxl doesn't know what max_pages a
> domain needs prior to qemu start-up.
> 
> > Libxl doesn't know the hv_max_memkb even after QEMU start-up, because
> > there is no mechanism to community between QEMU and libxl. This is an
> > area that needs improvement, we've encountered problems in this area
> > before.
> 
> [translating]
> Libxl doesn't know max_pages  even after qemu start-up, because there
> is no mechanism to communicate between qemu and libxl.
> 
> > QEMU calls xc_domain_setmaxmem to increase hv_max_memkb by N pages. Those
> > pages are only accounted in hypervisor. During migration, libxl
> > (currently) doesn't extract that value from hypervisor.
> 
> [translating]
> qemu calls xc_domain_setmaxmem to increase max_pages by N pages.
> Those pages are only accounted for in the hypervisor.  libxl
> (currently) does not extract that value from the hypervisor.
> 
> > So now the problem is on the remote end:
> >
> > 1. Libxl indicates domain needs X pages.
> > 2. Domain actually needs X + N pages.
> > 3. Remote end tries to write N more pages and fail.
> >
> > This behaviour currently doesn't affect normal migration (that you
> > transfer libxl JSON to remote, construct a domain, then start QEMU)
> > because QEMU won't bump hv_max_memkb again. This is by design and
> > reflected in QEMU code.
> 
> I don't understand this paragraph -- does the remote domain actually
> need X+N pages or not?  If it does, in what way does this behavior
> "not affect normal migration"?
> 

I was wrong. I don't recollect how I came to that conclusion. It does
affect normal migration.

> > This behaviour affects COLO and becomes a bug in that case, because
> > secondary VM's QEMU doesn't go through the same start-of-day
> > initialisation (Hongyang, correct me if I'm wrong), i.e. no bumping
> > hv_max_memkb inside QEMU.
> >
> > Andrew plans to embed JSON inside migration v2 and COLO is based on
> > migration v2. The bug is fixed if JSON is correct in the first place.
> >
> > As COLO is not yet upstream, so this bug is not a blocker for 4.6. But
> > it should be fixed for the benefit of COLO.
> >
> > So here is a proof of concept patch to record and honour that value
> > during migration.  A new field is added in IDL. Note that we don't
> > provide xl level config option for it and mandate it to be default value
> > during domain creation. This is to prevent libxl user from using it to
> > avoid unforeseen repercussions.
> >
> > This patch is compiled test only. If we agree this is the way to go I
> > will test and submit a proper patch.
> 
> Reading max_pages from Xen and setting it on the far side seems like a
> reasonable option.  Is there a reason we can't add a magic XC_SAVE_ID

Yes. That's the correct behaviour we want to have. The question is where
should we put that value and when to set it.

> for v1, like we do for other parameters?
> 

The main objection is that we shouldn't call xc_domain_setmaxmem in the
middle of a migration stream.

Wei.

>  -George

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-04  9:14   ` Wei Liu
@ 2015-06-04  9:26     ` Ian Campbell
  2015-06-04  9:32       ` Andrew Cooper
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2015-06-04  9:26 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, George Dunlap, Ian Jackson, Don Slutz,
	andrew, xen-devel, Yang Hongyang

On Thu, 2015-06-04 at 10:14 +0100, Wei Liu wrote:
> The main objection is that we shouldn't call xc_domain_setmaxmem in the
> middle of a migration stream.

In the middle of an _xc_ migration stream.

This seems like the sort of thing it would be OK to have in a (to be
introduced) libxl stream (which would itself contain the xc stream as a
data item).

I think we are expecting such a thing to be introduced as part of the
libxl side of migration v2, aren't we?

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-04  9:26     ` Ian Campbell
@ 2015-06-04  9:32       ` Andrew Cooper
  2015-06-04  9:46         ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Cooper @ 2015-06-04  9:32 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu
  Cc: Stefano Stabellini, George Dunlap, Ian Jackson, Don Slutz,
	andrew, xen-devel, Yang Hongyang

On 04/06/15 10:26, Ian Campbell wrote:
> On Thu, 2015-06-04 at 10:14 +0100, Wei Liu wrote:
>> The main objection is that we shouldn't call xc_domain_setmaxmem in the
>> middle of a migration stream.
> In the middle of an _xc_ migration stream.
>
> This seems like the sort of thing it would be OK to have in a (to be
> introduced) libxl stream (which would itself contain the xc stream as a
> data item).
>
> I think we are expecting such a thing to be introduced as part of the
> libxl side of migration v2, aren't we?

No.  libxl migration v2 will not be affecting the behaviour here.

The only reasonable way I see this being fixed is for the libxl json
blob to contain the correct size of the VM, and for
libxl_domain_create() to make an appropriately sized VM in the first place.

One extension which libxl migration v2 will bring is the ability to send
a new json blob later in the stream, but such a fix still requires the
json blob to have the correct size in it.

~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-04  9:32       ` Andrew Cooper
@ 2015-06-04  9:46         ` Ian Campbell
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Campbell @ 2015-06-04  9:46 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Ian Jackson,
	Don Slutz, andrew, xen-devel, Yang Hongyang

On Thu, 2015-06-04 at 10:32 +0100, Andrew Cooper wrote:
> On 04/06/15 10:26, Ian Campbell wrote:
> > On Thu, 2015-06-04 at 10:14 +0100, Wei Liu wrote:
> >> The main objection is that we shouldn't call xc_domain_setmaxmem in the
> >> middle of a migration stream.
> > In the middle of an _xc_ migration stream.
> >
> > This seems like the sort of thing it would be OK to have in a (to be
> > introduced) libxl stream (which would itself contain the xc stream as a
> > data item).
> >
> > I think we are expecting such a thing to be introduced as part of the
> > libxl side of migration v2, aren't we?
> 
> No.  libxl migration v2 will not be affecting the behaviour here.

By "such a thing" I was referring to "a libxl stream format", not this
piece of data specifically.

This stream format however will be extensible (I hope) and therefore
could accommodate state information which differs from the domain
configuration.

> The only reasonable way I see this being fixed is for the libxl json
> blob

For _a_ libxl json blob, not necessarily the existing one, which is the
user's domain configuration information.

IOW it doesn't have to be the existing libxl_domain_config blob. (Nor
does it really need to be json, but whatever)

>  to contain the correct size of the VM, and for
> libxl_domain_create() to make an appropriately sized VM in the first place.
> 
> One extension which libxl migration v2 will bring is the ability to send
> a new json blob later in the stream, but such a fix still requires the
> json blob to have the correct size in it.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-02 17:40       ` Wei Liu
@ 2015-06-04 15:28         ` Wei Liu
  2015-06-04 15:49           ` Ian Campbell
  2015-06-05 13:09           ` Wei Liu
  0 siblings, 2 replies; 18+ messages in thread
From: Wei Liu @ 2015-06-04 15:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Ian Jackson, dslutz,
	xen-devel, yanghy

On Tue, Jun 02, 2015 at 06:40:56PM +0100, Wei Liu wrote:
> On Tue, Jun 02, 2015 at 05:11:02PM +0100, Andrew Cooper wrote:
> > On 02/06/15 16:49, Ian Campbell wrote:
> > > On Tue, 2015-06-02 at 15:08 +0100, Wei Liu wrote:
> > > [...]
> > >>> So here is a proof of concept patch to record and honour that value
> > >>> during migration.  A new field is added in IDL. Note that we don't
> > >>> provide xl level config option for it and mandate it to be default value
> > >>> during domain creation. This is to prevent libxl user from using it to
> > >>> avoid unforeseen repercussions.
> > >> [...]
> > >>> This field is mandated to be default value during guest creation to
> > >>> avoid unforeseen repercussions. It's only honour when restoring a guest.
> > > IMHO this means that the libxl API/IDL is the wrong place for this
> > > value. Only user and/or application serviceable parts belong in the API.
> > >
> > > So while I agree that this value need to be communicated across a
> > > migration, the JSON blob is not the right mechanism for doing so. IOW if
> > > you go down this general path I think you need a new
> > > field/record/whatever in the migration protocol at some layer or other
> > > (if not libxc then at the libxl layer).
> > >
> > > To my mind this "actual state" vs "user configured state" is more akin
> > > to the sorts of things which is in the hypervisor save blob or something
> > > like that (nb: This is not a suggestion that it should go there).
> > >
> > > IIRC Don also outlined another case, which is
> > >     xl create -p
> > >     xl migrate
> > >     xl unpause
> > >
> > > Which might need more thought if any bumping can happen after the
> > > migrate i.e. on unpause?
> > >
> > >
> > 
> > The problem is qemu using set_max_mem.  It should never have done so.
> > 
> > Nothing other than libxl should be using such hypercalls, at which point
> > libxls idea of guest memory is accurate and the bug ceases to exist.
> > 
> 
> That would be an ideal solution, but it's not going to materialise
> any time soon because:
> 
> 1. Libxl cannot know how much more memory guest needs prior to QEMU
>    start-up.
> 2. QEMU cannot communicate back the value to libxl because
>  2.1 there is no communication channel;
>  2.2 there is no libxld to update the config (maybe we can work around
>      this by waiting for QEMU during guest creation).
> 
> While we work towards that end, a short term solution might be:
> 
> 1. Make QEMU not call xc_domain_setmaxmem.
> 2. Provide a field in IDL and a config option in xl to indicate how
>    much more memory is needed. User can fill in that option as he / she
>    sees fit.
> 
> After we get to the ideal solution we can then deprecate that field /
> option.
> 

An alternative we came up with during our IRL discussion.

1. QEMU writes the size of additional memory in xenstore.
2. Libxl record that in toolstack save path.
3. Remote end calls xc_domain_setmaxmem in toolstack restore path.

Wei.

> Wei.
> 
> > ~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-04 15:28         ` Wei Liu
@ 2015-06-04 15:49           ` Ian Campbell
  2015-06-05 13:09           ` Wei Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Ian Campbell @ 2015-06-04 15:49 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel, yanghy

On Thu, 2015-06-04 at 16:28 +0100, Wei Liu wrote:
> An alternative we came up with during our IRL discussion.
> 
> 1. QEMU writes the size of additional memory in xenstore.

Above is orthogonal to below I think. You existing patch could be
reimplemented in terms of the below, while the above is potentially a
separate piece of work.

> 2. Libxl record that in toolstack save path.
> 3. Remote end calls xc_domain_setmaxmem in toolstack restore path.
> 
> Wei.
> 
> > Wei.
> > 
> > > ~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: QEMU bumping memory limit and domain restore
  2015-06-04 15:28         ` Wei Liu
  2015-06-04 15:49           ` Ian Campbell
@ 2015-06-05 13:09           ` Wei Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Wei Liu @ 2015-06-05 13:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Ian Jackson, dslutz,
	xen-devel, yanghy

On Thu, Jun 04, 2015 at 04:28:00PM +0100, Wei Liu wrote:
[...]
> 2. Libxl record that in toolstack save path.
> 3. Remote end calls xc_domain_setmaxmem in toolstack restore path.

Unfortunately toolstack restore is called after
libxl__xc_domain_restore so it cannot be used to solve our problem.

Wei.

> 
> Wei.
> 
> > Wei.
> > 
> > > ~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-06-05 13:10 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-02 14:05 RFC: QEMU bumping memory limit and domain restore Wei Liu
2015-06-02 14:08 ` Wei Liu
2015-06-02 15:49   ` Ian Campbell
2015-06-02 16:11     ` Andrew Cooper
2015-06-02 17:40       ` Wei Liu
2015-06-04 15:28         ` Wei Liu
2015-06-04 15:49           ` Ian Campbell
2015-06-05 13:09           ` Wei Liu
2015-06-02 17:32     ` Wei Liu
2015-06-03  1:05     ` Yang Hongyang
2015-06-03 13:22 ` George Dunlap
2015-06-03 13:32   ` Andrew Cooper
2015-06-03 13:53     ` George Dunlap
2015-06-03 16:11       ` Don Slutz
2015-06-04  9:14   ` Wei Liu
2015-06-04  9:26     ` Ian Campbell
2015-06-04  9:32       ` Andrew Cooper
2015-06-04  9:46         ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.