All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 2 RFC] xl: move domeins among NUMA nodes
@ 2012-07-06  9:54 Dario Faggioli
  2012-07-06  9:54 ` [PATCH 1 of 2 RFC] xl: parse extra_config options even when restoring Dario Faggioli
  2012-07-06  9:54 ` [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity Dario Faggioli
  0 siblings, 2 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-06  9:54 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Zhang, Yang Z, Ian Jackson, Ian Campbell, Andre Przywara

Hello Everyone,

FIRST OF ALL, this is an RFC and, more important, it implements a new feature
(although it's a small one), so it is not meant for 4.2. Of course, comments
are more than welcome, if you happen to find the time to look at it (after all,
it is why I'm releasing it)... Just don't rush if you're busy with some
pressing release-related stuff! :-P

The above being stated, this series tries to tackle the problem of moving an
already existent domain from a (some) NUMA node(s) to a (some) other(s). In
fact, changing vcpu-affinity affects the node affinity of the domain itself,
but this is only effective wrt the memory that will be allocated in the future,
while all the existent pages are untouched. This of course makes a lot of
sense, but there might be the need to _fully_ move a domain, both vcpus and
memory wise, and that's what this series tries to achieve.

Just let me add that the whole process looks a lot like a suspend/resume (or
either to a local migration) to me, and that's why I implemented it like that
for now. I went for the suspend resume approach, but I think I can make it more
migration-oriented solution if wanted, as it'd reduce the downtime (although,
in that case, we'd need to have twice the memory of the domain available during
the process).
Of course, I can also add the proper logic and interface to the hypervisor, but
I'm not sure it buys that much, as there will still be something very similar
to suspension or migration, just called and used a little bit differently (but
of course I'm open to change my mind :-)).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1 of 2 RFC] xl: parse extra_config options even when restoring
  2012-07-06  9:54 [PATCH 0 of 2 RFC] xl: move domeins among NUMA nodes Dario Faggioli
@ 2012-07-06  9:54 ` Dario Faggioli
  2012-07-06  9:54 ` [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity Dario Faggioli
  1 sibling, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-06  9:54 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Zhang, Yang Z, Ian Jackson, Ian Campbell, Andre Przywara

In preparation for the next patch. Of course, if no extra_config options
are provided by the caller, this is harmless and cause no functional
changes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1706,21 +1706,6 @@ static int create_domain(struct domain_c
             if (ret) { fprintf(stderr, "Failed to read config file: %s: %s\n",
                                config_file, strerror(errno)); return ERROR_FAIL; }
         }
-        if (!restoring && extra_config && strlen(extra_config)) {
-            if (config_len > INT_MAX - (strlen(extra_config) + 2 + 1)) {
-                fprintf(stderr, "Failed to attach extra configration\n");
-                return ERROR_FAIL;
-            }
-            /* allocate space for the extra config plus two EOLs plus \0 */
-            config_data = realloc(config_data, config_len
-                + strlen(extra_config) + 2 + 1);
-            if (!config_data) {
-                fprintf(stderr, "Failed to realloc config_data\n");
-                return ERROR_FAIL;
-            }
-            config_len += sprintf(config_data + config_len, "\n%s\n",
-                extra_config);
-        }
         config_source=config_file;
     } else {
         if (!config_data) {
@@ -1731,6 +1716,22 @@ static int create_domain(struct domain_c
         config_source = "<saved>";
     }
 
+    if (extra_config && strlen(extra_config)) {
+        if (config_len > INT_MAX - (strlen(extra_config) + 2 + 1)) {
+            fprintf(stderr, "Failed to attach extra configration\n");
+            return ERROR_FAIL;
+        }
+        /* allocate space for the extra config plus two EOLs plus \0 */
+        config_data = realloc(config_data, config_len
+            + strlen(extra_config) + 2 + 1);
+        if (!config_data) {
+            fprintf(stderr, "Failed to realloc config_data\n");
+            return ERROR_FAIL;
+        }
+        config_len += sprintf(config_data + config_len, "\n%s\n",
+            extra_config);
+    }
+
     if (!dom_info->quiet)
         printf("Parsing config from %s\n", config_source);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06  9:54 [PATCH 0 of 2 RFC] xl: move domeins among NUMA nodes Dario Faggioli
  2012-07-06  9:54 ` [PATCH 1 of 2 RFC] xl: parse extra_config options even when restoring Dario Faggioli
@ 2012-07-06  9:54 ` Dario Faggioli
  2012-07-06 12:53   ` George Dunlap
  1 sibling, 1 reply; 13+ messages in thread
From: Dario Faggioli @ 2012-07-06  9:54 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Zhang, Yang Z, Ian Jackson, Ian Campbell, Andre Przywara

By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
memory "movement" is achieved suspending the domain to a ttemporary file and
resuming it with the new vcpu-affinity.

Signed-off-by: dario.faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -606,7 +606,7 @@ after B<vcpu-set>, go to B<SEE ALSO> sec
 Lists VCPU information for a specific domain.  If no domain is
 specified, VCPU information for all domains will be provided.
 
-=item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus>
+=item B<vcpu-pin> [I<OPTIONS>] I<domain-id> I<vcpu> I<cpus>
 
 Pins the VCPU to only run on the specific CPUs.  The keyword
 B<all> can be used to apply the I<cpus> list to all VCPUs in the
@@ -617,6 +617,22 @@ different run state is appropriate.  Pin
 this, by ensuring certain VCPUs can only run on certain physical
 CPUs.
 
+B<OPTIONS>
+
+=over 4
+
+=item I<-M>
+
+If and only if using this together with I<vcpu> equal to B<all>, not only
+pin the VCPUs to I<cpus>, it also tries to make sure all the memory for
+the domain comes from the NUMA nodes I<cpus> belong to. This involves
+suspending and resuming the domain, so some (hopefully small) downtime
+should be expected, as well as the change of the domain id.
+
+=back
+
+=back
+
 =item B<vm-list>
 
 Prints information about guests. This list excludes information about
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4061,7 +4061,52 @@ int main_vcpulist(int argc, char **argv)
     return 0;
 }
 
-static void vcpupin(const char *d, const char *vcpu, char *cpu)
+/*
+ * Easiest way for having all the memory allocated in such a way that it
+ * reflects the vcpu affinity is to save and re-create the domain (as it
+ * happens when migrating and suspend/resumimg).
+ */
+static int vcpupin_move(const char *d, char *cpu)
+{
+    struct domain_create dom_info;
+    char savefile[] = "/tmp/vcpupin-autosave.XXXXXX";
+    char extra_config[1024];
+    uint8_t *config_data;
+    int config_len;
+    int save_fd;
+
+    /* Setup a temporary core file for the domain */
+    save_fd = mkostemp(savefile, O_WRONLY|O_CREAT|O_TRUNC);
+    if (save_fd < 0)
+        return -1;
+
+    /* Save and destroy the current incarnation of the domain */
+    MUST(libxl_userdata_retrieve(ctx, domid, "xl", &config_data, &config_len));
+    save_domain_core_writeconfig(save_fd, savefile, config_data, config_len);
+
+    MUST(libxl_domain_suspend(ctx, domid, save_fd, LIBXL_SUSPEND_LIVE, NULL));
+    MUST(libxl_domain_destroy(ctx, domid));
+    close(save_fd);
+    domid = INVALID_DOMID;
+
+    /* Default parameters plus some extra_config for domain re-creation */
+    memset(&dom_info, 0, sizeof(dom_info));
+    dom_info.quiet = 1;
+    dom_info.migrate_fd = -1;
+    dom_info.monitor = dom_info.daemonize = 1;
+    dom_info.incr_generationid = 1;
+    dom_info.restore_file = savefile;
+    sprintf(extra_config, "cpus=\"%s\"\n", cpu);
+    dom_info.extra_config = extra_config;
+
+    /* Finally, re-create the domain */
+    MUST(create_domain(&dom_info));
+    unlink(savefile);
+
+    return 0;
+}
+
+static void vcpupin(const char *d, const char *vcpu, char *cpu, int move)
 {
     libxl_vcpuinfo *vcpuinfo;
     libxl_cpumap cpumap;
@@ -4078,6 +4123,10 @@ static void vcpupin(const char *d, const
         }
         vcpuid = -1;
     }
+    if (vcpuid != -1 && move) {
+        fprintf(stderr, "Error: can only move when VCPU is \"all\"\n");
+        return;
+    }
 
     find_domain(d);
 
@@ -4094,18 +4143,22 @@ static void vcpupin(const char *d, const
         }
     }
     else {
-        if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) {
-            fprintf(stderr, "libxl_list_vcpu failed.\n");
-            goto vcpupin_out1;
-        }
-        for (i = 0; i < nb_vcpu; i++) {
-            if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid,
-                                       &cpumap) == -1) {
-                fprintf(stderr, "libxl_set_vcpuaffinity failed"
-                                " on vcpu `%u'.\n", vcpuinfo[i].vcpuid);
+        if (move && vcpupin_move(d, cpu))
+            fprintf(stderr, "Could not move the domain\n");
+        else {
+            if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) {
+                fprintf(stderr, "libxl_list_vcpu failed.\n");
+                goto vcpupin_out1;
             }
-        }
-        libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
+            for (i = 0; i < nb_vcpu; i++) {
+                if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid,
+                                           &cpumap) == -1) {
+                    fprintf(stderr, "libxl_set_vcpuaffinity failed"
+                                    " on vcpu `%u'.\n", vcpuinfo[i].vcpuid);
+                }
+            }
+            libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
+        }
     }
   vcpupin_out1:
     libxl_cpumap_dispose(&cpumap);
@@ -4115,12 +4168,20 @@ static void vcpupin(const char *d, const
 
 int main_vcpupin(int argc, char **argv)
 {
+    int move_memory = 0;
     int opt;
 
-    if ((opt = def_getopt(argc, argv, "", "vcpu-pin", 3)) != -1)
-        return opt;
-
-    vcpupin(argv[optind], argv[optind+1] , argv[optind+2]);
+    if ((opt = def_getopt(argc, argv, "M", "vcpu-pin", 3)) != -1) {
+        switch(opt) {
+        case 0: case 2:
+            return opt;
+        case 'M':
+            move_memory = 1;
+            break;
+        }
+    }
+
+    vcpupin(argv[optind], argv[optind+1] , argv[optind+2], move_memory);
     return 0;
 }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -207,7 +207,10 @@ struct cmd_spec cmd_table[] = {
     { "vcpu-pin",
       &main_vcpupin, 0, 1,
       "Set which CPUs a VCPU can use",
-      "<Domain> <VCPU|all> <CPUs|all>",
+      "[options] <Domain> <VCPU|all> <CPUs|all>",
+      "-M    Move the memory to th nodes corresponding to CPUs\n"
+      "      (involves suspending/resuming the domain, so some\n"
+      "      downtime is to be expected)"
     },
     { "vcpu-set",
       &main_vcpuset, 0, 1,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06  9:54 ` [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity Dario Faggioli
@ 2012-07-06 12:53   ` George Dunlap
  2012-07-06 13:25     ` Ian Campbell
  2012-07-06 13:57     ` Dario Faggioli
  0 siblings, 2 replies; 13+ messages in thread
From: George Dunlap @ 2012-07-06 12:53 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Zhang, Yang Z, Andre Przywara, Ian Jackson, Ian Campbell, xen-devel

On 06/07/12 10:54, Dario Faggioli wrote:
> By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
> memory "movement" is achieved suspending the domain to a ttemporary file and
> resuming it with the new vcpu-affinity
Hmm... this will work and be reliable, but it seems a bit clunky.  Long 
term we want to be able to do node migration in the background without 
shutting down a VM, right?  If that can be done in the 4.3 timeframe, 
then it seems unnecessary to implement something like this.

  -George


> .
>
> Signed-off-by: dario.faggioli<dario.faggioli@citrix.com>
>
> diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
> --- a/docs/man/xl.pod.1
> +++ b/docs/man/xl.pod.1
> @@ -606,7 +606,7 @@ after B<vcpu-set>, go to B<SEE ALSO>  sec
>   Lists VCPU information for a specific domain.  If no domain is
>   specified, VCPU information for all domains will be provided.
>
> -=item B<vcpu-pin>  I<domain-id>  I<vcpu>  I<cpus>
> +=item B<vcpu-pin>  [I<OPTIONS>] I<domain-id>  I<vcpu>  I<cpus>
>
>   Pins the VCPU to only run on the specific CPUs.  The keyword
>   B<all>  can be used to apply the I<cpus>  list to all VCPUs in the
> @@ -617,6 +617,22 @@ different run state is appropriate.  Pin
>   this, by ensuring certain VCPUs can only run on certain physical
>   CPUs.
>
> +B<OPTIONS>
> +
> +=over 4
> +
> +=item I<-M>
> +
> +If and only if using this together with I<vcpu>  equal to B<all>, not only
> +pin the VCPUs to I<cpus>, it also tries to make sure all the memory for
> +the domain comes from the NUMA nodes I<cpus>  belong to. This involves
> +suspending and resuming the domain, so some (hopefully small) downtime
> +should be expected, as well as the change of the domain id.
> +
> +=back
> +
> +=back
> +
>   =item B<vm-list>
>
>   Prints information about guests. This list excludes information about
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -4061,7 +4061,52 @@ int main_vcpulist(int argc, char **argv)
>       return 0;
>   }
>
> -static void vcpupin(const char *d, const char *vcpu, char *cpu)
> +/*
> + * Easiest way for having all the memory allocated in such a way that it
> + * reflects the vcpu affinity is to save and re-create the domain (as it
> + * happens when migrating and suspend/resumimg).
> + */
> +static int vcpupin_move(const char *d, char *cpu)
> +{
> +    struct domain_create dom_info;
> +    char savefile[] = "/tmp/vcpupin-autosave.XXXXXX";
> +    char extra_config[1024];
> +    uint8_t *config_data;
> +    int config_len;
> +    int save_fd;
> +
> +    /* Setup a temporary core file for the domain */
> +    save_fd = mkostemp(savefile, O_WRONLY|O_CREAT|O_TRUNC);
> +    if (save_fd<  0)
> +        return -1;
> +
> +    /* Save and destroy the current incarnation of the domain */
> +    MUST(libxl_userdata_retrieve(ctx, domid, "xl",&config_data,&config_len));
> +    save_domain_core_writeconfig(save_fd, savefile, config_data, config_len);
> +
> +    MUST(libxl_domain_suspend(ctx, domid, save_fd, LIBXL_SUSPEND_LIVE, NULL));
> +    MUST(libxl_domain_destroy(ctx, domid));
> +    close(save_fd);
> +    domid = INVALID_DOMID;
> +
> +    /* Default parameters plus some extra_config for domain re-creation */
> +    memset(&dom_info, 0, sizeof(dom_info));
> +    dom_info.quiet = 1;
> +    dom_info.migrate_fd = -1;
> +    dom_info.monitor = dom_info.daemonize = 1;
> +    dom_info.incr_generationid = 1;
> +    dom_info.restore_file = savefile;
> +    sprintf(extra_config, "cpus=\"%s\"\n", cpu);
> +    dom_info.extra_config = extra_config;
> +
> +    /* Finally, re-create the domain */
> +    MUST(create_domain(&dom_info));
> +    unlink(savefile);
> +
> +    return 0;
> +}
> +
> +static void vcpupin(const char *d, const char *vcpu, char *cpu, int move)
>   {
>       libxl_vcpuinfo *vcpuinfo;
>       libxl_cpumap cpumap;
> @@ -4078,6 +4123,10 @@ static void vcpupin(const char *d, const
>           }
>           vcpuid = -1;
>       }
> +    if (vcpuid != -1&&  move) {
> +        fprintf(stderr, "Error: can only move when VCPU is \"all\"\n");
> +        return;
> +    }
>
>       find_domain(d);
>
> @@ -4094,18 +4143,22 @@ static void vcpupin(const char *d, const
>           }
>       }
>       else {
> -        if (!(vcpuinfo = libxl_list_vcpu(ctx, domid,&nb_vcpu,&i))) {
> -            fprintf(stderr, "libxl_list_vcpu failed.\n");
> -            goto vcpupin_out1;
> -        }
> -        for (i = 0; i<  nb_vcpu; i++) {
> -            if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid,
> -&cpumap) == -1) {
> -                fprintf(stderr, "libxl_set_vcpuaffinity failed"
> -                                " on vcpu `%u'.\n", vcpuinfo[i].vcpuid);
> +        if (move&&  vcpupin_move(d, cpu))
> +            fprintf(stderr, "Could not move the domain\n");
> +        else {
> +            if (!(vcpuinfo = libxl_list_vcpu(ctx, domid,&nb_vcpu,&i))) {
> +                fprintf(stderr, "libxl_list_vcpu failed.\n");
> +                goto vcpupin_out1;
>               }
> -        }
> -        libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
> +            for (i = 0; i<  nb_vcpu; i++) {
> +                if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid,
> +&cpumap) == -1) {
> +                    fprintf(stderr, "libxl_set_vcpuaffinity failed"
> +                                    " on vcpu `%u'.\n", vcpuinfo[i].vcpuid);
> +                }
> +            }
> +            libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
> +        }
>       }
>     vcpupin_out1:
>       libxl_cpumap_dispose(&cpumap);
> @@ -4115,12 +4168,20 @@ static void vcpupin(const char *d, const
>
>   int main_vcpupin(int argc, char **argv)
>   {
> +    int move_memory = 0;
>       int opt;
>
> -    if ((opt = def_getopt(argc, argv, "", "vcpu-pin", 3)) != -1)
> -        return opt;
> -
> -    vcpupin(argv[optind], argv[optind+1] , argv[optind+2]);
> +    if ((opt = def_getopt(argc, argv, "M", "vcpu-pin", 3)) != -1) {
> +        switch(opt) {
> +        case 0: case 2:
> +            return opt;
> +        case 'M':
> +            move_memory = 1;
> +            break;
> +        }
> +    }
> +
> +    vcpupin(argv[optind], argv[optind+1] , argv[optind+2], move_memory);
>       return 0;
>   }
>
> diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
> --- a/tools/libxl/xl_cmdtable.c
> +++ b/tools/libxl/xl_cmdtable.c
> @@ -207,7 +207,10 @@ struct cmd_spec cmd_table[] = {
>       { "vcpu-pin",
>         &main_vcpupin, 0, 1,
>         "Set which CPUs a VCPU can use",
> -      "<Domain>  <VCPU|all>  <CPUs|all>",
> +      "[options]<Domain>  <VCPU|all>  <CPUs|all>",
> +      "-M    Move the memory to th nodes corresponding to CPUs\n"
> +      "      (involves suspending/resuming the domain, so some\n"
> +      "      downtime is to be expected)"
>       },
>       { "vcpu-set",
>         &main_vcpuset, 0, 1,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 12:53   ` George Dunlap
@ 2012-07-06 13:25     ` Ian Campbell
  2012-07-06 13:30       ` George Dunlap
  2012-07-06 14:05       ` Dario Faggioli
  2012-07-06 13:57     ` Dario Faggioli
  1 sibling, 2 replies; 13+ messages in thread
From: Ian Campbell @ 2012-07-06 13:25 UTC (permalink / raw)
  To: George Dunlap
  Cc: Zhang, Yang Z, Ian Jackson, Dario Faggioli, Andre Przywara, xen-devel

On Fri, 2012-07-06 at 13:53 +0100, George Dunlap wrote:
> On 06/07/12 10:54, Dario Faggioli wrote:
> > By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
> > memory "movement" is achieved suspending the domain to a ttemporary file and
> > resuming it with the new vcpu-affinity
> Hmm... this will work and be reliable, but it seems a bit clunky.  Long 
> term we want to be able to do node migration in the background without 
> shutting down a VM, right?  If that can be done in the 4.3 timeframe, 
> then it seems unnecessary to implement something like this.

We could do something cleverer for HVM (or hybrid) guests to migrate
pages while the guest is live but migrating a page under a PV guest's
feet requires quiescing it in the style of a migrate.

We could probably manage to make it such that you just need the
pause/frob/unpause phases without actually writing RAM to disk, and that
infrastructure might be useful for other reasons I suppose. (e.g. I
think bad page offlining currently uses a similar save/restore trick)

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 13:25     ` Ian Campbell
@ 2012-07-06 13:30       ` George Dunlap
  2012-07-06 13:38         ` Ian Campbell
  2012-07-06 14:05       ` Dario Faggioli
  1 sibling, 1 reply; 13+ messages in thread
From: George Dunlap @ 2012-07-06 13:30 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Zhang, Yang Z, Ian Jackson, Dario Faggioli, Andre Przywara, xen-devel

On 06/07/12 14:25, Ian Campbell wrote:
> On Fri, 2012-07-06 at 13:53 +0100, George Dunlap wrote:
>> On 06/07/12 10:54, Dario Faggioli wrote:
>>> By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
>>> memory "movement" is achieved suspending the domain to a ttemporary file and
>>> resuming it with the new vcpu-affinity
>> Hmm... this will work and be reliable, but it seems a bit clunky.  Long
>> term we want to be able to do node migration in the background without
>> shutting down a VM, right?  If that can be done in the 4.3 timeframe,
>> then it seems unnecessary to implement something like this.
> We could do something cleverer for HVM (or hybrid) guests to migrate
> pages while the guest is live but migrating a page under a PV guest's
> feet requires quiescing it in the style of a migrate.
>
> We could probably manage to make it such that you just need the
> pause/frob/unpause phases without actually writing RAM to disk, and that
> infrastructure might be useful for other reasons I suppose. (e.g. I
> think bad page offlining currently uses a similar save/restore trick)
Yes, I think doing this chunks at a time is likely to be much 
preferrable to writing the whole thing to disk and reading it out 
again.  Doing it "live" may end up with more total overhead, but it 
wont' require the VM being actually down for the suspend/resume cycle, 
which on a large guest may be tens of seconds.

  -George

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 13:30       ` George Dunlap
@ 2012-07-06 13:38         ` Ian Campbell
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2012-07-06 13:38 UTC (permalink / raw)
  To: George Dunlap
  Cc: Zhang, Yang Z, Ian Jackson, Dario Faggioli, Andre Przywara, xen-devel

On Fri, 2012-07-06 at 14:30 +0100, George Dunlap wrote:
> On 06/07/12 14:25, Ian Campbell wrote:
> > On Fri, 2012-07-06 at 13:53 +0100, George Dunlap wrote:
> >> On 06/07/12 10:54, Dario Faggioli wrote:
> >>> By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
> >>> memory "movement" is achieved suspending the domain to a ttemporary file and
> >>> resuming it with the new vcpu-affinity
> >> Hmm... this will work and be reliable, but it seems a bit clunky.  Long
> >> term we want to be able to do node migration in the background without
> >> shutting down a VM, right?  If that can be done in the 4.3 timeframe,
> >> then it seems unnecessary to implement something like this.
> > We could do something cleverer for HVM (or hybrid) guests to migrate
> > pages while the guest is live but migrating a page under a PV guest's
> > feet requires quiescing it in the style of a migrate.
> >
> > We could probably manage to make it such that you just need the
> > pause/frob/unpause phases without actually writing RAM to disk, and that
> > infrastructure might be useful for other reasons I suppose. (e.g. I
> > think bad page offlining currently uses a similar save/restore trick)
> Yes, I think doing this chunks at a time is likely to be much 
> preferrable to writing the whole thing to disk and reading it out 
> again.  Doing it "live" may end up with more total overhead, but it 
> wont' require the VM being actually down for the suspend/resume cycle, 
> which on a large guest may be tens of seconds.

Tim also suggested a PV protocol whereby the tools hint to the guest "if
you XENMEM_exchange this MFN you will get back an MFN which is much
better for your performance".

(at least I think that's what Tim suggested, I may have misunderstood)

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 12:53   ` George Dunlap
  2012-07-06 13:25     ` Ian Campbell
@ 2012-07-06 13:57     ` Dario Faggioli
  2012-07-06 14:04       ` George Dunlap
  1 sibling, 1 reply; 13+ messages in thread
From: Dario Faggioli @ 2012-07-06 13:57 UTC (permalink / raw)
  To: George Dunlap
  Cc: Zhang, Yang Z, Andre Przywara, Ian Jackson, Ian Campbell, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2864 bytes --]

On Fri, 2012-07-06 at 13:53 +0100, George Dunlap wrote:
> On 06/07/12 10:54, Dario Faggioli wrote:
> > By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
> > memory "movement" is achieved suspending the domain to a ttemporary file and
> > resuming it with the new vcpu-affinity
> Hmm... this will work and be reliable, but it seems a bit clunky.
>
If I can ask, the idea or the implementation? :-)

> Long 
> term we want to be able to do node migration in the background without 
> shutting down a VM, right?  
>
Definitely, and we also want to do that automatically, according to some
load/performance/whatever run-time measurement, without any user
intervention. However ...

> If that can be done in the 4.3 timeframe, 
> then it seems unnecessary to implement something like this.
>
... I think something like this could still be useful.

IOW, I think it would be worthwhile to have both the automatic memory
migration happening in background and something like this explicit,
do-it-all-now, as they serve different purposes.

It's sort of like in cpu scheduling, where you have free tasks/vcpus
that run wherever the scheduler want, but you also have pinning, if at
some point you want to confine them to some subset of cpus for whatever
reason. Also, as it happens for pinning, you can do it at domain
creation time, but then your might change your mind and you'd need
something to have the system reflecting your actual needs.

So, if you allow me, initial vcpu pinning is like NUMA automatic
placement, and automatic memory migration happening in background is
like "free scheduling", then this feature is like what we get when
running `xl vcpu-pin'. (I hope the comparison helped in clarifying my
view rather than making it even more obscure :-)).

Also, as an example, what happens if you want to create a ne VM and you
have enough memory, but not in a single node because of memory
fragmentation? Ideally, that won't happen thanks to the automatic memory
migration, but that's not guaranteed to be possible (there might be VM
pinned, cpupools, and any sort of stuff), or to happen effectively  and
soon enough (it will be an heuristics after all). In such a situation, I
figure this feature could be a useful tool for a system administrator.

Than again, I'm talking about the feature. The implementation might well
change, and I really expect being able to migrate memory in background
to help beautifying it (the implementation) quite a bit.

Sorry for writing so much. :-P

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 13:57     ` Dario Faggioli
@ 2012-07-06 14:04       ` George Dunlap
  2012-07-06 14:14         ` Dario Faggioli
  0 siblings, 1 reply; 13+ messages in thread
From: George Dunlap @ 2012-07-06 14:04 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Zhang, Yang Z, Andre Przywara, Ian Jackson, Ian Campbell, xen-devel

On 06/07/12 14:57, Dario Faggioli wrote:
> On Fri, 2012-07-06 at 13:53 +0100, George Dunlap wrote:
>> On 06/07/12 10:54, Dario Faggioli wrote:
>>> By introducing a new '-M' option to the `xl vcpu-pin' command. The actual
>>> memory "movement" is achieved suspending the domain to a ttemporary file and
>>> resuming it with the new vcpu-affinity
>> Hmm... this will work and be reliable, but it seems a bit clunky.
>>
> If I can ask, the idea or the implementation? :-)
>
>> Long
>> term we want to be able to do node migration in the background without
>> shutting down a VM, right?
>>
> Definitely, and we also want to do that automatically, according to some
> load/performance/whatever run-time measurement, without any user
> intervention. However ...
>
>> If that can be done in the 4.3 timeframe,
>> then it seems unnecessary to implement something like this.
>>
> ... I think something like this could still be useful.
>
> IOW, I think it would be worthwhile to have both the automatic memory
> migration happening in background and something like this explicit,
> do-it-all-now, as they serve different purposes.
Well, I think you can have an explicit, do-it-now mechanism that doesn't 
require suspending the VM; and in fact, I think such a mechanism would 
be much preferable.

Compare it to vcpu pinning: if you decided you wanted to change the vcpu 
pinning, would you rather just be able to pin the maps while the VM was 
running, and have it take effect immediately (as it does now), or would 
you rather have to suspend and resume the VM, taking it out of 
commission for 10 seconds or so?

I'm not opposed to the -M option meaning, "...and please move the memory 
based on the new nodemask".  But I don't think implementing it as 
suspend/resume is really that useful, except as a stopgap measure until 
we can get "live" memory migration working.  If we can get that before 
4.3, there's no need for the stopgap measure.

  -George

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 13:25     ` Ian Campbell
  2012-07-06 13:30       ` George Dunlap
@ 2012-07-06 14:05       ` Dario Faggioli
  2012-07-06 14:07         ` George Dunlap
  2012-07-06 14:42         ` Ian Campbell
  1 sibling, 2 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-06 14:05 UTC (permalink / raw)
  To: Ian Campbell
  Cc: George Dunlap, Zhang, Yang Z, Ian Jackson, Andre Przywara, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1088 bytes --]

On Fri, 2012-07-06 at 14:25 +0100, Ian Campbell wrote: 
> We could do something cleverer for HVM (or hybrid) guests to migrate
> pages while the guest is live but migrating a page under a PV guest's
> feet requires quiescing it in the style of a migrate.
> 
Which, as we're talking about the same host, means reaching a point
where I have two full copies of the guest memory in RAM, or not?

(Just trying to figure out if I understood things sensibly).

> We could probably manage to make it such that you just need the
> pause/frob/unpause phases without actually writing RAM to disk, 
>
Of course, that was in my e-mail too (it's about the above that I'm not
100% sure...).

> and that
> infrastructure might be useful for other reasons I suppose. 
>
Definitely.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 14:05       ` Dario Faggioli
@ 2012-07-06 14:07         ` George Dunlap
  2012-07-06 14:42         ` Ian Campbell
  1 sibling, 0 replies; 13+ messages in thread
From: George Dunlap @ 2012-07-06 14:07 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Zhang, Yang Z, Andre Przywara, Ian Jackson, Ian Campbell, xen-devel

On 06/07/12 15:05, Dario Faggioli wrote:
> On Fri, 2012-07-06 at 14:25 +0100, Ian Campbell wrote:
>> We could do something cleverer for HVM (or hybrid) guests to migrate
>> pages while the guest is live but migrating a page under a PV guest's
>> feet requires quiescing it in the style of a migrate.
>>
> Which, as we're talking about the same host, means reaching a point
> where I have two full copies of the guest memory in RAM, or not?
>
> (Just trying to figure out if I understood things sensibly).
No, you can do batches at a time, based on how much free memory you 
have.  If you have enough free memory to keep two copies in RAM at a 
time, you might as well.  But even as little as moving 1MB at a time 
would probably be easily achievable and not to much overhead.

  -George

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 14:04       ` George Dunlap
@ 2012-07-06 14:14         ` Dario Faggioli
  0 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-06 14:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Zhang, Yang Z, Andre Przywara, Ian Jackson, Ian Campbell, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1283 bytes --]

On Fri, 2012-07-06 at 15:04 +0100, George Dunlap wrote:
> > IOW, I think it would be worthwhile to have both the automatic memory
> > migration happening in background and something like this explicit,
> > do-it-all-now, as they serve different purposes.
> Well, I think you can have an explicit, do-it-now mechanism that doesn't 
> require suspending the VM; and in fact, I think such a mechanism would 
> be much preferable.
> 
I fully agree on this.

> I'm not opposed to the -M option meaning, "...and please move the memory 
> based on the new nodemask".  But I don't think implementing it as 
> suspend/resume is really that useful, except as a stopgap measure until 
> we can get "live" memory migration working.  
>
I see, then we agree on everything, basically. :-P

> If we can get that before 
> 4.3, there's no need for the stopgap measure.
> 
Sure, that was the reason why it was an RFC after all... And it at least
triggered this discussion. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity
  2012-07-06 14:05       ` Dario Faggioli
  2012-07-06 14:07         ` George Dunlap
@ 2012-07-06 14:42         ` Ian Campbell
  1 sibling, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2012-07-06 14:42 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: George Dunlap, Zhang, Yang Z, Ian Jackson, Andre Przywara, xen-devel

On Fri, 2012-07-06 at 15:05 +0100, Dario Faggioli wrote:
> On Fri, 2012-07-06 at 14:25 +0100, Ian Campbell wrote: 
> > We could do something cleverer for HVM (or hybrid) guests to migrate
> > pages while the guest is live but migrating a page under a PV guest's
> > feet requires quiescing it in the style of a migrate.
> > 
> Which, as we're talking about the same host, means reaching a point
> where I have two full copies of the guest memory in RAM, or not?
> 
> (Just trying to figure out if I understood things sensibly).

You could move it over in chunks, freeing as you go so if chunk size was
1M you would only need 1M of extra memory, not the full domain size.

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-06 14:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-06  9:54 [PATCH 0 of 2 RFC] xl: move domeins among NUMA nodes Dario Faggioli
2012-07-06  9:54 ` [PATCH 1 of 2 RFC] xl: parse extra_config options even when restoring Dario Faggioli
2012-07-06  9:54 ` [PATCH 2 of 2 RFC] xl: allow for moving the domain's memory when changing vcpu affinity Dario Faggioli
2012-07-06 12:53   ` George Dunlap
2012-07-06 13:25     ` Ian Campbell
2012-07-06 13:30       ` George Dunlap
2012-07-06 13:38         ` Ian Campbell
2012-07-06 14:05       ` Dario Faggioli
2012-07-06 14:07         ` George Dunlap
2012-07-06 14:42         ` Ian Campbell
2012-07-06 13:57     ` Dario Faggioli
2012-07-06 14:04       ` George Dunlap
2012-07-06 14:14         ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.